Fig 1.
Five contextual types of harassment.
Table 1.
Summery of the related research.
Table 2.
Annotation statistics of our categorized corpus.
Table 3.
Agreement rate.
Table 4.
Statistics for the Golbeck corpus after our annotation wrt. contextual type.
Fig 2.
Significant LIWC features in comparing harassing corpus to non-harassing corpus for six categories.
The extreme red (green) color indicates the significance of a given feature in the harassing corpus (non-harassing corpus). E.g. the negation feature with the value 2.34 in the appearance harassing corpus is significantly higher than non-harassing corpus. The white color indicates a lack of difference for a given feature when comparing two corpora.
Fig 3.
Top-25 frequent words within each harassing corpora.
Fig 4.
Top-25 frequent words within each non-harassing corpora.
Table 5.
Percentage of type-dependent of top-15 frequent words within each sub-corpus.
H stands for the harassing corpus and NH stands for the non-harassing corpus.
Table 6.
Size of the training datasets for each type.
Fig 5.
Comparative study of the F-score from four major classifiers i.e., SVM stands for support vector machine, KNN = K-Nearest Neighbor, GBM = Gradient Boosting Machine, NB = Naive Bayes, NN = Nueral Network).
Fig 6.
Comparative study of the various feature settings on the performance of the GBM classifier using measures such as precision, recall, F-score, accuracy, and specificity.
The extreme colors, i.e., purple, yellow, green, olive, and pink show the higher values versus the white color that shows a lower value.
Table 7.
Performance of the GBM binary classifier on the combined corpus.
Table 8.
Performance of our multi-class classifier for predicting type of harassment incident.
Table 9.
Performance of our classifier for predicting tweets for Golbeck corpus.