
Cyber Bullying Detection Based on Semantic -Enhanced Marginalized Denoising Auto-Encoder
Abstract
Introduction
Cyberbullying can be defined as aggressive, intentional actions performed by an individual or a group of people via digital communication methods such as sending messages and posting comments against a victim. Cyberbullying on social media can take place anywhere at any time. For bullies, they are free to hurt their peers’ feelings because they do not need to face someone and can hide behind the Internet. For victims, they are easily exposed to harassment since all of us, especially youth, are constantly connected to Internet or social media. Previous works on computational studies have shown that natural language processing and machine learning are powerful tools to study bullying. Cyberbullying detection can be named as a supervised learning problem. A classifier is first trained on cyberbullying corpus labeled by humans, and the learned classifier is then used to recognize a bullying message. Three kinds of information including text, user demography, and social network features are often used in cyberbullying detection. This paper focuses on text-based cyberbullying detection. In cyberbullying detection, the numerical representation for Internet messages should be robust and discriminative. With the knowledge of one deep learning method named stacked denoising auto encoder (SDA). In this paper investigates a new text representation model based on SDA: marginalized stacked denoisingautoencoders (mSDA), which adopts linear instead in order to learn more robust representations We utilize semantic information to expand mSDA and develop Semantic-enhanced Marginalized Stacked
DenoisingAutoencoders (smSDA). The semantic information consists of bullying words. An automatic extraction of bullying words based on word embeddings is proposed so that the involved human labor can be reduced. During training of smSDA, we attempt to reconstruct bullying features from other normal words, i.e. correlation, between bullying and normal words.Our proposed Semantic-enhanced Marginalized Stacked DenoisingAutoencoder is able to learn robust features from BoW representation in an efficient and effective way. These robust features are learned by reconstructing original input from corrupted (i.e., missing) ones. * Semantic information is incorporated into the reconstruction process. In our framework, high-quality semantic information, i.e., bullying words, can be extracted automatically through word embeddings. Finally, these specialized modifications makes it easier for bullying detection. * Comprehensive experiments on real-data sets have verified the performance of the proposed model.