Cyber Bullying Detection Based on Semantic -Enhanced Marginalized Denoising Auto-Encoder

0
1136
Cyber bullying Detection based on Semantic-Enhanced Marginalized Denoising Auto-Encoder

Cyber Bullying Detection Based on Semantic -Enhanced Marginalized Denoising Auto-Encoder

Abstract

Cyber bullying Detection based on Semantic-Enhanced Marginalized Denoising Auto-Encoder,As a side effect of increasingly popular social media, cyberbullying has emerged as a serious problem afflicting children, adolescents and young adults. Machine learning techniques make automatic detection of bullying messages in social media possible, and this could help to construct a healthy and safe social media environment. In this meaningful research area, one critical issue is robust and discriminative numerical representation learning of text messages. In this paper, we propose a new representation learning method to tackle this problem. Our method named semantic-enhanced marginalized denoising auto-encoder (smSDA) is developed via semantic extension of the popular deep learning model stacked denoising autoencoder (SDA). The semantic extension consists of semantic dropout noise and sparsity constraints, where the semantic dropout noise is designed based on domain knowledge and the word embedding technique. Our proposed method is able to exploit the hidden feature structure of bullying information and learn a robust and discriminative representation of text. Comprehensive experiments on two public cyberbullying corpora (Twitter and MySpace) are conducted, and the results show that our proposed approaches outperform other baseline text representation learning methods.
 

Introduction

Cyberbullying can be defined as aggressive, intentional actions performed by an individual or a group of people via digital communication methods such as sending messages and posting comments against a victim. Cyberbullying on social media can take place anywhere at any time. For bullies, they are free to hurt their peers’ feelings because they do not need to face someone and can hide behind the Internet. For victims, they are easily exposed to harassment since all of us, especially youth, are constantly connected to Internet or social media. Previous works on computational studies have shown that natural language processing and machine learning are powerful tools to study bullying. Cyberbullying detection can be named as a supervised learning problem. A classifier is first trained on cyberbullying corpus labeled by humans, and the learned classifier is then used to recognize a bullying message. Three kinds of information including text, user demography, and social network features are often used in cyberbullying detection. This paper focuses on text-based cyberbullying detection. In cyberbullying detection, the numerical representation for Internet messages should be robust and discriminative. With the knowledge of one deep learning method named stacked denoising auto encoder (SDA). In this paper investigates a new text representation model based on SDA: marginalized stacked denoisingautoencoders (mSDA), which adopts linear instead in order to learn more robust representations We utilize semantic information to expand mSDA and develop Semantic-enhanced Marginalized Stacked

DenoisingAutoencoders (smSDA). The semantic information consists of bullying words. An automatic extraction of bullying words based on word embeddings is proposed so that the involved human labor can be reduced. During training of smSDA, we attempt to reconstruct bullying features from other normal words, i.e. correlation, between bullying and normal words.Our proposed Semantic-enhanced Marginalized Stacked DenoisingAutoencoder is able to learn robust features from BoW representation in an efficient and effective way. These robust features are learned by reconstructing original input from corrupted (i.e., missing) ones. * Semantic information is incorporated into the reconstruction process. In our framework, high-quality semantic information, i.e., bullying words, can be extracted automatically through word embeddings. Finally, these specialized modifications makes it easier for bullying detection. * Comprehensive experiments on real-data sets have verified the performance of the proposed model.