Network security and secure computing

NetSpam: a Network-based Spam Detection Framework for Reviews in Online Social Media

January 20, 2019

2257

NetSpam: a Network-based Spam Detection Framework for Reviews in Online Social Media

Abstract

NetSpam: a Network-based Spam Detection Framework for Reviews in Online Social Media,Nowadays, a big part of people rely on available content in social media in their decisions (e.g., reviews and feedback on a topic or product). The possibility that anybody can leave a review provides a golden opportunity for spammers to write spam reviews about products and services for different interests. Identifying these spammers and the spam content is a hot topic of research, and although a considerable number of studies have been done recently toward this end, but so far the methodologies put forth still barely detect spam reviews, and none of them show the importance of each extracted feature type. In this paper, we propose a novel framework, named NetSpam, which utilizes spam features for modeling review data sets as heterogeneous information networks to map spam detection procedure into a classification problem in such networks. Using the importance of spam features helps us to obtain better results in terms of different metrics experimented on real-world review data sets from Yelp and Amazon Web sites. The results show that NetSpam outperforms the existing methods and among four categories of features, including review-behavioral, user-behavioral, review-linguistic, and user-linguistic, the first type of features performs better than the other categories.

Introduction

Online Social Media portals play an inﬂuential role in information propagation which is considered as an important source for producers in their advertising campaigns as well as for customers in selecting products and services. In the past years, people rely a lot on the written reviews in their decision-making processes, and positive/negative reviews encouraging/discouraging them in their selection of products and services. In addition, written reviews also help service providers to enhance the quality of their products and services. These reviews thus have become an important factor in success of a business while positive reviews can bring beneﬁts for a company, negative reviews can potentially impact credibility and cause economic losses. The fact that anyone with any identity can leave comments as review, provides a tempting opportunity for spammers to write fake reviews designed to mislead users’ opinion. These misleading reviews are then multiplied by the sharing function of social media and propagation over the web. The reviews written to change users’ perception of how good a product or a service are considered as spam , and are often written in exchange for money.

20% of the reviews in the Yelp website are actually spam reviews. On the other hand, a considerable amount of literature has been published on the techniques used to identify spam and spammers as well as different type of analysis on this topic . These techniques can be classiﬁed into different categories; some using linguistic patterns in text which are mostly based on bigram, and unigram, others are based on behavioral patterns that rely on features extracted from patterns in users’ behavior which are mostly metadatabased and even some techniques using graphs and graph-based algorithms and classiﬁers . Despite this great deal of efforts, many aspects have been missed or remained unsolved. One of them is a classiﬁer that can calculate feature weights that show each feature’s level of importance in determining spam reviews.

The general concept of our proposed framework is to model a given review dataset as a Heterogeneous Information Network (HIN) and to map the problem of spam detection into a HIN classiﬁcation problem. In particular, we model review dataset as a HIN in which reviews are connected through different node types (such as features and users). A weighting algorithm is then employed to calculate each feature’s importance (or weight). These weights are utilized to calculate the ﬁnal labels for reviews using both unsupervised and supervised approaches. To evaluate the proposed solution, we used two sample review datasets from Yelp and Amazon websites. Based on our observations, deﬁning two views for features (review-user and behavioral-linguistic), the classiﬁed features as reviewbehavioral have more weights and yield better performance on spotting spam reviews in both semi-supervised and unsupervised approaches. In addition, we demonstrate that using different supervisions such as 1%, 2.5% and 5% or using an unsupervised approach, make no noticeable variation on the performance of our approach. We observed that feature weights can be added or removed for labeling and hence time complexity can be scaled for a speciﬁc level of accuracy. As the result of this weighting step, we can use fewer features with more weights to obtain better accuracy with less time complexity. In addition, categorizing features in four major categories (review-behavioral, user-behavioral, reviewlinguistic, user-linguistic), helps us to understand how much each category of features is contributed to spam detection. In summary, our main contributions are as follows:

We propose NetSpam framework that is a novel networkbased approach which models review networks as heterogeneous information networks. The classiﬁcation step uses different metapath types which are innovative in the spam detection domain.
A new weighting method for spam features is proposed to determine the relative importance of each feature and shows how effective each of features are in identifying spams from normal reviews. Previous works [12], [20] also aimed to address the importance of features mainly in term of obtained accuracy, but not as a build-in function in their framework (i.e., their approach is dependent to ground truth for determining each feature importance). As we explain in our unsupervised approach, NetSpam is able to ﬁnd features importance even without ground truth, and only by relying on metapath deﬁnition and based on values calculated for each review.
NetSpam improves the accuracy compared to the stateof-the art in terms of time complexity, which highly depends to the number of features used to identify a spam review; hence, using features with more weights will resulted in detecting fake reviews easier with less time complexity.

Existing system techniques can be classified into different categories; some using linguistic patterns in text which are mostly based on bigram, and unigram, others are based on behavioral patterns that rely on features extracted from patterns in users’ behavior which are mostly meta data based and even some techniques using graphs and graph-based algorithms and classifiers.
Existing system can be summarized into three categories: Linguistic-based Methods, Behavior-based Methods and Graph-based Methods.
Feng et al. use unigram, bigram and their composition. Other studies use other features like pairwise features (features between two reviews; e.g. content similarity), percentage of CAPITAL words in a reviews for finding spam reviews.
Lai et al. used a probabilistic language modeling to spot spam. This study demonstrates that 2% of reviews written on business websites are actually spam.
Deeper analysis on literature show that behavioral features work better than linguistic ones in term of accuracy they yield.

Disadvantages

The fact that anyone with any identity can leave comments as review, provides a tempting opportunity for spammers to write fake reviews designed to mislead users’ opinion. These misleading reviews are then multiplied by the sharing function of social media and propagation over the web.
Many aspects have been missed or remained unsolved.
Previous works also aimed to address the importance of features mainly in term of obtained accuracy, but not as a build-in function in their framework (i.e., their approach is dependent to ground truth for determining each feature importance).

Proposed System

The general concept of our proposed framework is to model a given review dataset as a Heterogeneous Information Network (HIN) and to map the problem of spam detection into a HIN classification problem.
In particular, we model review dataset as a HIN in which reviews are connected through different node types (such as features and users). A weighting algorithm is then employed to calculate each feature’s importance (or weight). These weights are utilized to calculate the final labels for reviews using both unsupervised and supervised approaches.
We propose NetSpam framework that is a novel network based approach which models review networks as heterogeneous information networks. The classification step uses different metapath types which are innovative in the spam detection domain.
A new weighting method for spam features is proposed to determine the relative importance of each feature and shows how effective each of features are in identifying spams from normal reviews.
NetSpam improves the accuracy compared to the state of- the art in terms of time complexity, which highly depends to the number of features used to identify a spam review; hence, using features with more weights will resulted in detecting fake reviews easier with less time complexity.

Advantages

Improved Accuracy
Easier in detecting fake reviews
Less time Complexity
As we explain in our unsupervised approach, NetSpam is able to find features importance even without ground truth, and only by relying on metapath definition and based on values calculated for each review.
There is no previous method which engage importance of features (known as weights in our proposed framework; NetSpam) in the classification step. By using these weights, on one hand we involve features importance in calculating final labels and hence accuracy of NetSpam increase, gradually.
On the other hand we can determine which feature can provide better performance in term of their involvement in connecting spamreviews (in proposed network).

Related Work

In the last decade, a great number of research studies focus on the problem of spotting spammers and spam reviews. However, since the problem is non-trivial and challenging, it remains far from fully solved. We can summarize our discussion aboutprevious studiesin three following categories.

Linguistic-based Methods This approach extract linguistic-based features to ﬁnd spam reviews. Feng et al. use unigram, bigram and their composition. Other studies use other features like pairwise features (features between two reviews; e.g. content similarity), percentage of CAPITAL words in a reviews for ﬁnding spam reviews. Lai et al. use a probabilistic language modeling to spot spam. This study demonstrates that 2% of reviews written on business websites are actually spam.

Behavior-based Methods Approaches in this group almost use reviews meta data to extract features; those which are normal pattern of a reviewer behaviors. Feng et al. in [21] focus on distribution of spammers rating on different products and traces them. In [34], Jindal et. al extract 36 behavioral features and use a supervised method to ﬁnd spammers on Amazon and [14] indicates behavioral features show spammers’ identity better than linguistic ones. Xue et al. in [32] use rate deviation of a speciﬁc user and use a trust-aware model to ﬁnd the relationship between users for calculating ﬁnal spamicity score. Minnich et al. in [8] use temporal and location features of users to ﬁnd unusual behavior of spammers. Li et al. in [10] use some basic features (e.g polarity of reviews) and then run a HNC (Heterogeneous Network Classiﬁer) to ﬁnd ﬁnal labels on Dianpings dataset. Mukherjee et al. in [16] almost engage behavioral features like rate deviation, extremity and etc. Xie et al. in [17] also use a temporal pattern (time window) to ﬁnd singleton reviews (reviews written just once) on Amazon. Luca et al. in [26] use behavioral features to show increasing competition between companies leads to very large expansion of spam reviews on products. Crawford et al. in [28] indicates using different classiﬁcation approach need different number of features to attain desired performance and propose approaches which use fewer features to attain that performance and hence recommend .

Graph-based Methods Studies in this group aim to make a graph between users, reviews and items and use connections in the graph and also some network-based algorithms to rank or label reviews (as spam or genuine) and users (as spammer or honest). Akoglu et al. in [11] use a network-based algorithm known as LBP (Loopy Belief Propagation) in linearly scalable iterations related to number of edges to ﬁnd ﬁnal probabilities for different components in network. Fei et al. in [7] also use same algorithm (LBP), and utilize burstiness of each review to ﬁnd spammers and spam reviews on Amazon. Li et al. in [10] build a graph of users, reviews, users IP and indicates users with same IP have same labels, for example if a user with multiple different account and same IP writes some reviews, they are supposed to have same label. Wang et al.

Conclusion

This NetSpam: a Network-based Spam Detection Framework for Reviews in Online Social Media study introduces a novel spam detection framework namely NetSpam based on a metapath concept as well as a new graph-based method to label reviews relying on a rank-based labeling approach. The performance of the proposed framework is evaluated by using two real-world labeled datasets of Yelp and Amazon websites. Our observations show that calculated weights by using this metapath concept can be very effective in identifying spam reviews and leads to a better performance. In addition, we found that even without a train set, NetSpam can calculate the importance of each feature and it yields better performance in the features’ addition process, and performs better than previous works, with only a small number of features. Moreover, after deﬁning four main categories for features our observations show that the reviewsbehavioral category performs better than other categories, in terms of AP, AUC as well as in the calculated weights. The results also conﬁrm that using different supervisions, similar to the semi-supervised method, have no noticeable effect on determining most of the weighted features, just as in different datasets.

MOST POPULAR

Study on Information Risk in Project Management

E-Visa Processing

Education Industry-Beacons Project

Website Evaluation-Opinion Mining

HOT NEWS

Monitoring Suspicious Discussions on Online Forums

Mix Design for Pavement Overlays for Sustainable Development

Query-Adaptive Image Search with Hash Codes

Online User Behavior Analysis on Graphical Model