
NetSpam: a Network-based Spam Detection Framework for Reviews in Online Social Media
Abstract
Introduction
Online Social Media portals play an influential role in information propagation which is considered as an important source for producers in their advertising campaigns as well as for customers in selecting products and services. In the past years, people rely a lot on the written reviews in their decision-making processes, and positive/negative reviews encouraging/discouraging them in their selection of products and services. In addition, written reviews also help service providers to enhance the quality of their products and services. These reviews thus have become an important factor in success of a business while positive reviews can bring benefits for a company, negative reviews can potentially impact credibility and cause economic losses. The fact that anyone with any identity can leave comments as review, provides a tempting opportunity for spammers to write fake reviews designed to mislead users’ opinion. These misleading reviews are then multiplied by the sharing function of social media and propagation over the web. The reviews written to change users’ perception of how good a product or a service are considered as spam , and are often written in exchange for money.
20% of the reviews in the Yelp website are actually spam reviews. On the other hand, a considerable amount of literature has been published on the techniques used to identify spam and spammers as well as different type of analysis on this topic . These techniques can be classified into different categories; some using linguistic patterns in text which are mostly based on bigram, and unigram, others are based on behavioral patterns that rely on features extracted from patterns in users’ behavior which are mostly metadatabased and even some techniques using graphs and graph-based algorithms and classifiers . Despite this great deal of efforts, many aspects have been missed or remained unsolved. One of them is a classifier that can calculate feature weights that show each feature’s level of importance in determining spam reviews.
The general concept of our proposed framework is to model a given review dataset as a Heterogeneous Information Network (HIN) and to map the problem of spam detection into a HIN classification problem. In particular, we model review dataset as a HIN in which reviews are connected through different node types (such as features and users). A weighting algorithm is then employed to calculate each feature’s importance (or weight). These weights are utilized to calculate the final labels for reviews using both unsupervised and supervised approaches. To evaluate the proposed solution, we used two sample review datasets from Yelp and Amazon websites. Based on our observations, defining two views for features (review-user and behavioral-linguistic), the classified features as reviewbehavioral have more weights and yield better performance on spotting spam reviews in both semi-supervised and unsupervised approaches. In addition, we demonstrate that using different supervisions such as 1%, 2.5% and 5% or using an unsupervised approach, make no noticeable variation on the performance of our approach. We observed that feature weights can be added or removed for labeling and hence time complexity can be scaled for a specific level of accuracy. As the result of this weighting step, we can use fewer features with more weights to obtain better accuracy with less time complexity. In addition, categorizing features in four major categories (review-behavioral, user-behavioral, reviewlinguistic, user-linguistic), helps us to understand how much each category of features is contributed to spam detection. In summary, our main contributions are as follows:
- We propose NetSpam framework that is a novel networkbased approach which models review networks as heterogeneous information networks. The classification step uses different metapath types which are innovative in the spam detection domain.
- A new weighting method for spam features is proposed to determine the relative importance of each feature and shows how effective each of features are in identifying spams from normal reviews. Previous works [12], [20] also aimed to address the importance of features mainly in term of obtained accuracy, but not as a build-in function in their framework (i.e., their approach is dependent to ground truth for determining each feature importance). As we explain in our unsupervised approach, NetSpam is able to find features importance even without ground truth, and only by relying on metapath definition and based on values calculated for each review.
- NetSpam improves the accuracy compared to the stateof-the art in terms of time complexity, which highly depends to the number of features used to identify a spam review; hence, using features with more weights will resulted in detecting fake reviews easier with less time complexity.
- Existing system techniques can be classified into different categories; some using linguistic patterns in text which are mostly based on bigram, and unigram, others are based on behavioral patterns that rely on features extracted from patterns in users’ behavior which are mostly meta data based and even some techniques using graphs and graph-based algorithms and classifiers.
- Existing system can be summarized into three categories: Linguistic-based Methods, Behavior-based Methods and Graph-based Methods.
- Feng et al. use unigram, bigram and their composition. Other studies use other features like pairwise features (features between two reviews; e.g. content similarity), percentage of CAPITAL words in a reviews for finding spam reviews.
- Lai et al. used a probabilistic language modeling to spot spam. This study demonstrates that 2% of reviews written on business websites are actually spam.
- Deeper analysis on literature show that behavioral features work better than linguistic ones in term of accuracy they yield.
Disadvantages
- The fact that anyone with any identity can leave comments as review, provides a tempting opportunity for spammers to write fake reviews designed to mislead users’ opinion. These misleading reviews are then multiplied by the sharing function of social media and propagation over the web.
- Many aspects have been missed or remained unsolved.
- Previous works also aimed to address the importance of features mainly in term of obtained accuracy, but not as a build-in function in their framework (i.e., their approach is dependent to ground truth for determining each feature importance).
Proposed System
- The general concept of our proposed framework is to model a given review dataset as a Heterogeneous Information Network (HIN) and to map the problem of spam detection into a HIN classification problem.
- In particular, we model review dataset as a HIN in which reviews are connected through different node types (such as features and users). A weighting algorithm is then employed to calculate each feature’s importance (or weight). These weights are utilized to calculate the final labels for reviews using both unsupervised and supervised approaches.
- We propose NetSpam framework that is a novel network based approach which models review networks as heterogeneous information networks. The classification step uses different metapath types which are innovative in the spam detection domain.
- A new weighting method for spam features is proposed to determine the relative importance of each feature and shows how effective each of features are in identifying spams from normal reviews.
- NetSpam improves the accuracy compared to the state of- the art in terms of time complexity, which highly depends to the number of features used to identify a spam review; hence, using features with more weights will resulted in detecting fake reviews easier with less time complexity.
Advantages
- Improved Accuracy
- Easier in detecting fake reviews
- Less time Complexity
- As we explain in our unsupervised approach, NetSpam is able to find features importance even without ground truth, and only by relying on metapath definition and based on values calculated for each review.
- There is no previous method which engage importance of features (known as weights in our proposed framework; NetSpam) in the classification step. By using these weights, on one hand we involve features importance in calculating final labels and hence accuracy of NetSpam increase, gradually.
- On the other hand we can determine which feature can provide better performance in term of their involvement in connecting spamreviews (in proposed network).

Related Work
In the last decade, a great number of research studies focus on the problem of spotting spammers and spam reviews. However, since the problem is non-trivial and challenging, it remains far from fully solved. We can summarize our discussion aboutprevious studiesin three following categories.
Linguistic-based Methods This approach extract linguistic-based features to find spam reviews. Feng et al. use unigram, bigram and their composition. Other studies use other features like pairwise features (features between two reviews; e.g. content similarity), percentage of CAPITAL words in a reviews for finding spam reviews. Lai et al. use a probabilistic language modeling to spot spam. This study demonstrates that 2% of reviews written on business websites are actually spam.
Behavior-based Methods Approaches in this group almost use reviews meta data to extract features; those which are normal pattern of a reviewer behaviors. Feng et al. in [21] focus on distribution of spammers rating on different products and traces them. In [34], Jindal et. al extract 36 behavioral features and use a supervised method to find spammers on Amazon and [14] indicates behavioral features show spammers’ identity better than linguistic ones. Xue et al. in [32] use rate deviation of a specific user and use a trust-aware model to find the relationship between users for calculating final spamicity score. Minnich et al. in [8] use temporal and location features of users to find unusual behavior of spammers. Li et al. in [10] use some basic features (e.g polarity of reviews) and then run a HNC (Heterogeneous Network Classifier) to find final labels on Dianpings dataset. Mukherjee et al. in [16] almost engage behavioral features like rate deviation, extremity and etc. Xie et al. in [17] also use a temporal pattern (time window) to find singleton reviews (reviews written just once) on Amazon. Luca et al. in [26] use behavioral features to show increasing competition between companies leads to very large expansion of spam reviews on products. Crawford et al. in [28] indicates using different classification approach need different number of features to attain desired performance and propose approaches which use fewer features to attain that performance and hence recommend .
Graph-based Methods Studies in this group aim to make a graph between users, reviews and items and use connections in the graph and also some network-based algorithms to rank or label reviews (as spam or genuine) and users (as spammer or honest). Akoglu et al. in [11] use a network-based algorithm known as LBP (Loopy Belief Propagation) in linearly scalable iterations related to number of edges to find final probabilities for different components in network. Fei et al. in [7] also use same algorithm (LBP), and utilize burstiness of each review to find spammers and spam reviews on Amazon. Li et al. in [10] build a graph of users, reviews, users IP and indicates users with same IP have same labels, for example if a user with multiple different account and same IP writes some reviews, they are supposed to have same label. Wang et al.
Conclusion
This NetSpam: a Network-based Spam Detection Framework for Reviews in Online Social Media study introduces a novel spam detection framework namely NetSpam based on a metapath concept as well as a new graph-based method to label reviews relying on a rank-based labeling approach. The performance of the proposed framework is evaluated by using two real-world labeled datasets of Yelp and Amazon websites. Our observations show that calculated weights by using this metapath concept can be very effective in identifying spam reviews and leads to a better performance. In addition, we found that even without a train set, NetSpam can calculate the importance of each feature and it yields better performance in the features’ addition process, and performs better than previous works, with only a small number of features. Moreover, after defining four main categories for features our observations show that the reviewsbehavioral category performs better than other categories, in terms of AP, AUC as well as in the calculated weights. The results also confirm that using different supervisions, similar to the semi-supervised method, have no noticeable effect on determining most of the weighted features, just as in different datasets.