Cross – Domain Sentiment Classification Using Sentiment Sensitive 

0
2372
Cross-Domain Sentiment Classification Using Sentiment Sensitive Embeddings

Cross – Domain Sentiment Classification Using Sentiment Sensitive

Abstract

Cross – Domain Sentiment Classification Using Sentiment Sensitive management report in data mining.Unsupervised Cross-domain Sentiment Classification is the task of adapting a sentiment classifier trained on a particular domain (source domain), to a different domain (target domain), without requiring any labeled data for the target domain. By adapting an existing sentiment classifier to previously unseen target domains, we can avoid the cost for manual data annotation for the target domain. We model this problem as embedding learning, and construct three objective functions that capture:

  1. distributional properties of pivots (i.e. common features that appear in both source and target domains),
  2. label constraints in the source domain documents, and,
  3. geometric properties in the unlabeled documents in both source and target domains.

Unlike prior proposals that first learn a lower-dimensional embedding independent of the source domain sentiment labels, and next a sentiment classifier in this embedding, our joint optimisation method learns embeddings that are sensitive to sentiment classification. Experimental results on a benchmark dataset show that by jointly optimising the three objectives we can obtain better performances in comparison to optimising each objective function separately, thereby demonstrating the importance of task-specific embedding learning for cross-domain sentiment classification. Among the individual objective functions, the best performance is obtained by 3. Moreover, the proposed method reports cross-domain sentiment classification accuracies that are statistically comparable to the current state-of-the-art embedding learning methods for cross-domain sentiment classification.

Introduction

Most work regarding sentiment analysis focuses on training and testing a sentiment classifier on data of the same domain. For example a new classifier is trained on tweets and tested on tweets. However, in real-world scenarios the data might originate from different sources and domains. Often it is the case that sentiment analysis is performed on a domain for which there is no training data available. Instead of investing large amounts of money to create such a corpus it would make more sense to use an existing classifier. However, it is not always clear how well the existing classifier generalizes on the target domain. Although, it is obvious that the performance will be affected negatively, the magnitude is not known. This missing information is often useful for assessing the need of generating a new classifier for a given domain which is very costly.

Thus, our work is driven by the question of how useful sentiment classifiers are if we evaluate them with datasets from unseen domains, and if a combination of data from different domains might help to overcome the recurring problem of having too little data. Furthermore, we assess the usefulness of large weakly supervised corpora where the labels are inferred from properties of the text, e.g. the smileys in the text or the rating of a review. We answer the question of how much gain one can expect from leveraging such corpora. Usually, cross-domain sentiment analysis has a low performance due to the vocabulary mismatch (Pan et al., 2010). Thus, we asses the impact of word embeddings trained on large amounts of data, thus guaranteeing a large coverage of the vocabulary. We then asses how word embeddings trained on different types of data (e.g. News, Twitter) impact the performance of the system. For this, we train a convolutional neural network (CNN) based on (Deriu et al., 2016) on data from different combinations of domains and evaluate its performance on foreign domains.

Our contributions in this work can be summarized as follows.

  • We propose a fully automatic method to create a thesaurus that is sensitive to the sentiment of words expressed in different domains. We utilize both labeled and unlabeled data available for the source domains and unlabeled data from the target domain.
  • We propose a method to use the created thesaurus to expand feature vectors at train and test times in a binary classifier.
  • We compare the sentiment classification accuracy of our proposed method against numerous baselines and previously proposed cross-domain sentiment classification methods for both single source and multi-source adaptation settings.
  • We conduct a series of experiments to evaluate the potential applicability of the proposed method in real-world domain adaptation settings. The performance of the proposed method directly depends on the sentiment sensitive thesaurus we use for feature expansion.

RELATED WORKS

Supervised sentiment classification is determine the sentiment of texts according to their opinion and attitude for a given entity. It got more and more attention because of its applications. However, supervised sentiment classification requires that labeled and unlabeled data should be under the same distribution, so that the classifier built by using the labeled data could be well applied to the unlabeled data. But in cross-domain classification field, the labeled and unlabeled data are from different domains, and often have different distributions. Using Sentiment Sensitive Embeddings Bollegala create thesaurus for classification. An embedding technique was developed for training phase. Distributional properties of pivots and label constraints in source domains and geometric properties in unlabeled source and target domains. A method is proposed to automatically create a sentiment sensitive thesaurus that is sensitive to sentiment words from different domains. The created thesaurus is used to expand the feature vector in training and testing a binary classifier.

For Classification another concept is pos taggers. Xia  make use of POS based ensemble model to efficiently integrate features with different types of POS tags to improve the classification performance. A POS tag based method, such as adjectives, adverbs, nouns etc. Finding the significance of these adverbs and adjectives from cross domain, and noun become less important. The POS information is supposed to be a significant indicator of sentiment expression. The two classifiers are employed to select informative samples with the selection strategy of Query by Committee (QBC). Finally, the two classifier is combined to make the classification decision. Importantly, the two classifiers are trained by the unlabeled data in the target domain using the Label Propagation (LP) algorithm.

Subsequently, a Spectral Features Alignment algorithm is proposed in order to bring into line the domainspecific words originating from the source and target domains into expressive groups. Algorithm to minimize the gap between the two domains. Align the domain specific words from different domains into a unified cluster, with the help of domain independent words as bridge. Word polarity is informative in the case of classification of text. So some word polarity can’t identifies without the knowledge of domain. Detecting word polarity is a challenging topic in multi domain. Overcome disadvantages of transfer learning technique Yoshida proposed a novel Bayesian probabilistic model to handle multiple source and multiple target domains. In this model, each word is associated with three factors: Domain label, domain dependence/independence and word polarity. Transfer learning utilizes the results learned in a certain source domain to solve a similar problem in another target domain.

System Configuration:

H/W System Configuration:-

Processor          : Pentium IV

Speed               : 1 Ghz

RAM                  : 512 MB (min)

Hard Disk          : 20GB

Keyboard           : Standard Keyboard

Mouse               : Two or Three Button Mouse

Monitor             : LCD/LED Monitor

S/W System Configuration:-

Operating System               : Windows XP/7

Programming Language       : Java/J2EE

Software Version                 : JDK 1.7 or above

Database                            : MYSQL

ENHANCED SENTIMENT SENSITIVE THESAURAS

In this technique, an enhanced sentiment sensitive thesaurus is created which aligns semantically similar features from different domains and also sentiment features from wiktionary. The focus of this method is to extract more features from wiktionary ,with the help of java wiktionary library tool(JWKTL) ,which are then appended to ESST to provide a better performance. Firstly, the sentences are split into parts and then parts of speech(POS) tagging is performed followed by Lemmatization using RAS. Lemmatization converts plural and singular words into base form and unwanted words are eliminated. With the help of POS tagging, unigrams and bigrams are extracted from the reviews. Next, sentiment features are created by appending the label of the review to each feature from each source domain labeled reviews. The notation *p to indicate positive features and *N to indicate negative features.

Domain independent features are then extracted by computing mutual information between domain and features. If a feature has high mutual information with the domain then it is considered as domain specific whereas if it has less mutual information with domain then it is considered as domain independent. Using domain independent and domain specific features, a co-occurrence matrix is created. From the co-occurrence of the words found in documents, semantic meanings of the words are calculated. After that, values of the features in the co-occurrence matrix are weighted using point wise mutual information equation. After the computation of PMI values, domain specific features from various domains are aligned with the help of DI features.

Semantically similar domain specific features from various domains are aligned by finding similarity score between each domain specific feature with every other domain specific features of PMI weighted matrix and the same procedure is followed to align domain independent features. semantically similar domain specific features are aligned using ESST by computing similarity measure equation in PMI weighted matrix. ESST list up many domain specific features based on descending order of the similarity score for every domain specific feature. ESST also collects more semantically similar features from wiktionary using JWKTL. The seed adjectives are extracted from the reviews are fed into the wiktionary and corresponding glossaries are obtained from it.The unigrams and bigrams for each adjective obtained from the glossaries are then added to ESST. The domain specific features of various domains are then augmented with original features of source domain by finding suitable domain specific features from the created ESST Thesaurus which creates a new feature vector representation for cross-domain sentiment classification. This new representation of feature vector is used to train a sentiment classifier to predict the label of target domain.

  1. Extract Domain Independent features and domain specific features from the given Reviews.
  2. Create co-occurrence matrix between domain independent features with domain specific features.
  3. Compute Point Wise Mutual Information for each features using equation (1).
  4. Create Enhanced sentiment thesaurus by aligning domain specific features by computing similarity measure using equation 2 between domain specific features based on PMI Weighted ratio. Similarly align domain independent features by computing similarity measure between DI features.
  5. Glossaries of each adjectives are extracted from wiktionary using java wiktionary library (JWKTL) by giving seed adjectives from reviews. Unigram and bigram are generated from glossaries that are appended to ESST.
  6. Find a new representation of feature vector by Feature augmentation while training a classifier.
  7. Test the classifier in target domain.

SENTIMENT SENSITIVE EMBEDDINGS

Projecting the source and the target features into the same lower-dimensional embedding, and subsequently learning a sentiment classifier on this embedded feature space is a popular approach to cross domain sentiment classification. It is useful only when there is little overlap between original source and target feature spaces.A limitation of this two-step approach that decouples the embedding learning and sentiment classifier training is that the embeddings learnt in the first step is agnostic to the sentiment of the documents, which is the ultimate goal in cross-domain sentiment classification. In the proposed technique,spectral embedding are used to project words and documents into the same lower dimensional embeddings in comparison to optimising each objective function separately.

Further work

Cross-domain sentiment classification is the task of classifying sentiment documents in a target domain using labeled data from a different domain. Major challenge in cross domain sentiment classification is that the sentiment is expressed using different words across different domains. The proposed system develops a cross-domain sentiment classifier using an automatically extracted sentiment lexicons. To overcome the feature mismatch problem in crossdomain sentiment classification, it uses labeled data from multiple source domains and unlabeled data from source and target domains to compute the improved relatedness of features and construct a Sentiment Corpus. The proposed system extends the feature vectors by using the created corpus. In future, it can be extended to perform classification of positive, negative and neutral reviews. It can also be extended to overcome the problem of word Polysemy in cross-domains.

CONCLUSION

In Cross – Domain Sentiment Classification Using Sentiment Sensitive management report in data mining paper three different techniques used for cross domain classification are studied. Spectral feature alignment technique using spectral feature alignment algorithm using spectral graphs. The Enhanced Sentiment Sensitive Thesaurus technique uses java wiktionary to align sentiment features and provide a better performance.Lastly,sentiment sensitive embeddings technique uses spectral embeddings to project words and documents into the same lower dimensional embeddings.