Understanding Short Texts Through Semantic Enrichment and Hashing

Understanding Short Texts through Semantic Enrichment and Hashing

Understanding Short Texts Through Semantic Enrichment and Hashing


Understanding Short Texts Through Semantic Enrichment and Hashing management report in data mining.Clustering short texts (such as news titles) by their meaning is a challenging task. The semantic hashing approach encodes the meaning of a text into a compact binary code. Thus, to tell if two texts have similar meanings, we only need to check if they have similar codes. The encoding is created by a deep neural network, which is trained on texts represented by word-count vectors (bag-of-word representation). Unfortunately, for short texts such as search queries, tweets, or news titles, such representations are insufficient to capture the underlying semantics.

To cluster short texts by their meanings, we propose to add more semantic signals to short texts. Specifically, for each term in a short text, we obtain its concepts and co-occurring terms from a probabilistic knowledge base to enrich the short text. Furthermore, we introduce a simplified deep learning network consisting of a 3-layer stacked auto-encoders for semantic hashing. Comprehensive experiments show that, with more semantic signals, our simplified deep learning model is able to capture the semantics of short texts, which enables a variety of applications including short text retrieval, classification, and general purpose text processing.


The widespread adoption of social media is based on tapping into the social nature of human interactions, by making it possible for people to voice their opinion, become part of a virtual community and collaborate remotely. If we take micro-blogging as an example, Twitter has 100 million active users, posting over 230 million tweets a day1 . Engaging actively with such high-value, high-volume, brief life-span media streams has now become a daily challenge for both organisations and ordinary people. Automating this process through intelligent, semanticbased information access methods is therefore increasingly needed. This is an emerging research area, combining methods from many fields, in addition to se mantic technologies, e.g. speech and language processing, social science, machine learning, personalisation, and information retrieval. Traditional search methods are no longer able to address the more complex information seeking behaviour in social media, which has evolved towards sense making, learning and investigation, and social search.

Semantic technologies have the potential to help people cope better with social media-induced information overload. Automatic semantic-based methods that adapt to individual’s information seeking goals and summarise briefly the relevant social media, could ultimately support information interpretation and decision making over large-scale, dynamic media streams. Unlike carefully authored news and other textual web content, social media streams pose a number of new challenges for semantic technologies, due to their large-scale, noisy, irregular, and social nature. In this paper we discuss the following key research ques tions, examined through a survey of state-of-the-art approaches:

  1. What ontologies and Web of Data resources can be used to represent and reason about the semantics of social media streams?
  2. How can semantic annotation methods capture the rich semantics implicit in social media?
  3. How can we extract reliable information from these noisy, dynamic content streams?
  4. How can we model the users’ digital identity and social media activities?
  5. What semantic-based information access methods can help address the complex information seeking behaviour in social media?

To the best of our knowledge, this is the first comprehensive meta-review of semantic technology for mining and intelligent information access, where the focus is on current limitations and outstanding challenges, specifically arising in the context of social media streams.


Existing novel approach algorithms are mostly based to classify the queries into certain target categories. The short query classification problem is not as wellformed as other classification problems such as text classification. The difficulties include short and ambiguous queries and the lack of training data. Query enrichment, which takes a short query and maps it to the intermediate objects. Based on the collected intermediate objects, the query is then mapped to the target categories. As short texts do not provide sufficient term co-occurrence information, traditional text representation methods have several limitations when directly applied to short text tasks. Measuring the semantic similarity between two texts has been studied extensively in the IR and NLP communities.

However, the problem of assessing the similarity between two short text segments poses new challenges. Text segments commonly found in these tasks range from a single word to a dozen words. Because of the short length, the text segments do not provide enough contexts for surface matching methods such as computing the cosine score of the two text segments to be effective. Salakhutdinov and Hinton proposed a semantic hashing model based on Restricted Boltzmann Machines (RBMs) for long documents, and the experiments showed that their model achieved comparable accuracy with the traditional methods, including Latent Semantic Analysis (LSA) and TF-IDF. Context is another problem when measuring the similarity between two short segments of text. While a document provides a reasonable amount of text to infer the contextual meaning of a term, a short segment of text only provides a limited context

System Configuration:

H/W System Configuration:-

Processor          : Pentium IV

Speed               : 1 Ghz

RAM                  : 512 MB (min)

Hard Disk          : 20GB

Keyboard           : Standard Keyboard

Mouse               : Two or Three Button Mouse

Monitor             : LCD/LED Monitor

S/W System Configuration:-

Operating System               : Windows XP/7

Programming Language       : Java/J2EE

Software Version                 : JDK 1.7 or above

Database                            : MYSQL


  • Many approaches have been proposed to facilitate short text understanding by enriching the short text.
  • More effectively, a short text can be enriched with explicit semantic information derived from external resources such as WordNet, Wikipedia, the Open Directory Project (ODP), etc.
  • Salakhutdinov and Hinton proposed a semantic hashing model based on Restricted Boltzmann Machines (RBMs) for long documents, and the experiments showed that their model achieved comparable accuracy with the traditional methods, including Latent Semantic Analysis (LSA) and TF-IDF.


  • Search-based approaches may work well for so-called head queries, but for tail or unpopular queries, it is very likely that some of the top search results are irrelevant, which means the enriched short text is likely to contain a lot of noise.
  • On the other hand, methods based on external resources are constrained by the coverage of these resources. Take WordNet for example, WordNet does not contain information for proper nouns, which prevents it to understand entities such as “USA” or “IBM.”
  • For ordinary words such as “cat”, WordNet contains detailed information about its various senses. However, much of the knowledge is of linguistic value, and is rarely evoked in daily usage. For example, the sense of “cat” as gossip or woman is rarely encountered.
  • Unfortunately, WordNet does not weight senses based on their usage, and these rarely used senses often give rise to misinterpretation of short texts. In summary, without knowing the distribution of the senses, it is difficult to build an inferencing mechanism to choose appropriate senses for a word in a context.


  • In this paper, we propose a novel approach for understanding short texts.
  • Our approach A semantic network based approach for enriching a short text;
  • We present a novel mechanism to semantically enrich short texts with both concepts and co-occurring terms, such external knowledges are inferred from a large scale probabilistic knowledge base using our proposed thorough methods.
  • For each autoencoder we design a specific and effective learning strategy to capture useful features from input data.
  • We provide a way to combine knowledge information and deep neural network for text analysis, so that it helps machines better understand short texts.


  • We carry out extensive experiments on tasks including information retrieval and classification for short texts.
  • We show significant improvements over existing approaches, which confirm that
  • concepts and co-occurring terms effectively enrich short texts, and enable better understanding of them;
  • our auto-encoder based DNN model is able to capture the abstract features and complex correlations from the input text such that the learned compact binary codes can be used to represent the meaning of that text.

Enriching short texts:

Given a short text, we first identify the terms that semantic field can recognize, then for each term we perform conceptualization to get its appropriate concepts, and further infer the co-occurring terms. The semantic field text file contain the semantic words for particular word. The semantic words are collected from wordnet or Wikipedia. For example, given “china, India” they can be developing countries or Asian countries same as in electronics the word “TV” specifies different TV sets or different TV companies. We denote this two stage enrichment mechanism as CACT (concepts-and-co-occurring). After that, a short text can be represented by a vector of term count and fed to our RF model to do semantic hashing.


Pre-processing method plays a very important role in text mining techniques and applications. It is the first step in the text mining process. Extraction is the method used to tokenize the search content into individual words. Stop words are removed from documents because those words are not measured as keywords. Stop words are removed by classical method in pre-processing we treat the semantic field as a dictionary of terms.


Consider the true sense of a term is heavily affected by its neighbours, especially for the ambiguous ones, we propose a multiple mechanism to infer. The most appropriate sense for each term in a short text. The most important mechanism we take advantage of is context dependent conceptualization, which uses a probabilistic topic model. Through this method , we first get of a short text s. Let ~s be the sequence of term indices of s and ~z be the topic assignment vector of ~s. We then compute the probability of concept c given a term w of s based on the topic distribution where w and c are indices of instance term and concept them respectively, is the typicality of concept c given term w, is the probability of term w given topic k.

Co-occuring terms:

We define the co-occurrence score to measure the probability of one term o that co-occurs with a target term t in a short text s. represents the normal cooccurrence probability, which is pre-defined as in semantic text file, and measures the semantic similarity between o and t under the text s, since we already know the concept c of term t, we aim to make the co-occurring term o have consistent semantic with where ci represents a concepts of term o and is weight parameter. For example, consider TV in the short text “TV”, video projector, monitors” in different companies would have high normal co-occurring probabilities, but only monitors are appropriate cooccurring terms if we take into consideration the semantic(concept) of “TV” in that text.

Experiment setup:

To know the effectiveness of the random forest algorithm on query enrichment we follow different steps. The term “electronics” have been chosen for this project. The electronic term has different categories : different company, different products, different prices etc. The data is stored in two formats images and files. Data of some magnitude is derived for each category. In the experiment for every noun term in a query, we enrich it with the top possible concepts and cooccurring terms. As a result, the representation of a query is greatly enriched.

Result analysis:

Different experiments are carried out on tasks including information retrieval and classification for short texts. Significant improvements over existing approaches, which confirm that concepts and cooccurring terms effectively enrich short texts, andenable better understanding of them are determined. RF based model is able to capture the abstract features and complex correlations from the input text such that the learned compact binary codes can be used to represent the meaning of the text. The duration comparision between the traditional approach TFID and random forest is shown for particular search query.

Future scope:

There are a number of interesting extensions of this work. As this work is for aggregated search, the efficiency of the whole framework should be optimized for real applications. Moreover, the ranking procedure can be implemented for the frame of work between the random forest algorithm and traditional approach algorithms.


Understanding Short Texts Through Semantic Enrichment and Hashing management report in data mining.Query classification is an important as well as a difficult problem in the field of information retrieval. Once the category information for a query is known, a search engine can be more effective and can return more representative Web pages to the users. To solve the query classification problem, we designed an approach based on query enrichment which can map the queries to some intermediate objects. We propose a novel approach for understanding short texts.

First, we introduce a mechanism to enrich short texts with concepts and co-occurring terms that are extracted from a probabilistic semantic network. Then we approach random forest algorithm to do semantic hashing. We carry out comprehensive experiments on short text centred tasks including information retrieval and classification. The significant improvements shows that the information retrieval takes less time than the traditional approaches like TFID.