
Inference Attack on Browsing History of Twitter Users Using Public Click Analytics and Twitter Metadata
Abstract
Introduction
TWITTER is a popular online social network and microbloging service for exchanging messages (also known as tweets) among people, supported by a huge ecosystem. Twitter announces that it has over 140 million active users creating more than 340 million messages every day [26] and over one million registered applications built by more than 750,000 developers [25]. The third party applications include client applications for various platforms, such as Windows, Mac, iOS, and Android, and web-based applications such as URL shortening services, image-sharing services, and news feeds. Among the third party services, URL shortening services which provide a short alias of a long URL is an essential service for Twitter users who want to share long URLs via tweets having length restriction. Twitter allows users to post up to 140-character tweets containing only texts. Therefore, when users want to share complicated information (e.g., news and multimedia), they should include a URL of a web page containing the information into a tweet. Since the length of the URL and associated texts may exceed 140 characters, Twitter users demand URL shortening services further reducing it. Some URL shortening services (e.g., bit.ly and goo.gl) also provide shortened URLs’ public click analytics consisting of the number of clicks, countries, browsers, and referrers of visitors. Although anyone can access the data to analyze visitor statistics, no one can extract specific information about individual visitors from the data because URL shortening services provide them as an aggregated form to protect the privacy of visitors from attackers.
However, we detect a simple inference attack that can estimate individual visitors from the aggregated, public click analytics using public metadata provided by Twitter. First, we examine the metadata of client application and location because they can be correlated with those of public click analytics. For instance, if a user, Alice, updates her messages using the official Twitter client application for iPhone, “Twitter for iPhone” will be included in the source field of the corresponding metadata. Moreover, Alice may disclose on her profile page that she lives in the USA or activate the location service of a Twitter client application to automatically fill the location field in the metadata. Using this information, we can determine that Alice is an iPhone user who lives in the USA. Next, we perform the simple inference attack on behalf of Alice’s boyfriend, Bob, as follows. Bob first posts a tweet with a URL shortened by goo.gl.
If Alice clicks on the shortened URL, goo.gl records {“country”: “US”, “platform”: “iPhone”, “referrer”: “twitter.com”, “browser”: “Mobile”} in the click analytics of the shortened URL (details are in Sections 2 and 3). Otherwise, goo.gl records no information. Later, Bob retrieves the click analytics of the shortened URL to know whether Alice clicks on his URL. If the click analytics is unchanged or if its changes do not include information about the USA, iPhone, and twitter.com, he infers that Alice does not click on his URL. Otherwise, he infers that Alice click on his URL. The main advantage of the preceding inference attack over the conventional browser history stealing attacks is that it only demands public information. The conventional browser history stealing attacks rely on private information, such as cascading style sheet (CSS) visited styles, browser cache, DNS cache, and latency. To collect such information, attackers should (i) prepare attack pages containing scripts/malware and lure target users for extracting the information from their web browsers or (ii) monitor DNS requests for measuring DNS lookup time of a target host. In other words, attackers should deceive or compromise target users or their networks to obtain the browsing history, which relies on strong assumption. In contrast, anyone can access the metadata of Twitter and the public click analytics of URL shortening services so that passive monitoring is enough for performing our attack. In this paper, we propose novel attack methods for inferring whether a specific user clicked on certain shortened URLs on Twitter. As shown in the preceding simple inference attack, our attacks rely on the combination of publicly available information: click analytics from URL shortening services and metadata from Twitter. The goal of the attacks is to know which URLs are clicked on by target users. We introduce two different attack methods: (i) an attack to know who click on the URLs updated by target users and (ii) an attack to know which URLs are clicked on by target users. To perform the first attack, we find a number of Twitter users who frequently distribute shortened URLs, and investigate the click analytics of the distributed shortened URLs and the metadata of the followers of the Twitter users. To perform the second attack, we create monitoring accounts that monitor messages from all followings of target users to collect all shortened URLs that the target users may click on.
We then monitor the click analytics of those shortened URLs and compare them with the metadata of the target user. Furthermore, we propose an advanced attack method to reduce attack overhead while increasing inference accuracy using the time model of target users, representing when the target users frequently use Twitter. Evaluation results show that our attacks can successfully infer the click information with high accuracy and low overhead. We summarize the main contributions of this paper as follows:
- We propose novel attack techniques to determine whether a specific user clicks on certain shortened URLs on Twitter. To the best of our knowledge, this is the first study that infers URL visiting history on Twitter.
- We only use public information provided by URL shortening services and Twitter (i.e., click analytics and Twitter metadata). We determine whether a target user visits a shortened URL by correlating the publicly available information. Our approach does not need complicated techniques or assumptions such as script injection, phishing, malware intrusion, or DNS monitoring. All we need is publicly available information.
- We further decrease attack overhead while increasing accuracy by considering target users’ time models. It can increase the practicality of our attacks so that we demand immediate countermeasures.
Periodic Monitoring and Matching We periodically monitor click analytics of shortened URLs to observe its instant changes made by a new visitor. Whenever we notice that there is a new visitor, we match his or her information with each of our target users to know whether the new visitor is one of our target users. We can estimate information about visitors by checking the differences between the new and the old click analytics. Fig. 1 shows an example of the process for obtaining the information about the visitor who clicks on a goo.gl URL. In this figure, we easily infer that the new visitor is an iPhone user lives in the USA because the numbers of clicks by “USA” and “iPhone” simultaneously increase. In the periodic monitoring, determining the optimal query interval is important, which depends on the variety of the characteristics of followers. When there are some characteristics to be observed at the same time and their whole values change rapidly, the query interval should be short enough to catch a small change. As we have many followers, the slope becomes stiff so that we should have a short interval. In general, an interval should be decided by considering the slope of change in overall characteristics. However, the periodic monitoring and matching have a limitation because Twitter does not officially provide personal information about users such as country, browsers, and platforms. We need to infer the information about target users by investigating their timeline and profile pages.
Referrers We determine whether a new visitor comes from Twitter by using the changed referrer information of public click analytics. The click analytics of goo.gl only records the hostname of the referrer site. If a visitor comes from Twitter, “t.co” or “twitter.com” is recorded in the Referrers field. In most cases, “t.co” is recorded because all links
shared on Twitter are automatically shortened to t.co links. t.co handles redirections by context and user agents so that the Referrer information varies according to the source of a click [27]. In some cases, “twitter.com” is recorded because some Twitter applications directly use original links instead of t.co links. Consequently, if the Referrers information of the visitor is “t.co” or “twitter. com”, we regard the visitor as coming from Twitter. In the case of bit.ly, we can analyze a shortened URL in detail because bit.ly records the entire URL of the referrer site in click analytics (Section 2.2 and Table 1). When a target user clicks on a bit.ly URL converted into a t.co URL, bit.ly records the entire t.co URL in the Referrers field. Referrer matching is considered successful when a t.co URL recorded in the click analytics is the same as the t.co URL of the target shortened URL.
Country We infer the country information of target users using the location field of their profile pages and compare it with the changed country information of public click analytics. In many cases, Twitter users fill in the location field with their city or place name. We can determine the user’s country by searching GeoNames with the information in the location field of the user’s Twitter profile. GeoNames returns the country code that corresponds to the search keywords. The country information provided by the click analytics is also a country code, so we have a successful country match if both country codes are the same. However, the country matching has a limitation: it does not work when Twitter users leave the location field empty or fill in the location field with meaningless information (e.g., “earth” or “in your heart”). According to Hecht et al. [13], approximately 34 percent of Twitter users do not provide real location information. In the later experiments, we avoid such problems by only selecting target users who filled in valid location names in the location field. However, even without location information, our attacks are still possible with other information because the country information is not a major feature for conducting inference. Additionally, we can utilize recent studies inferring the location of Twitter users based on their posts for our attacks.
Browsers and Platforms When our target users click on a shortened URL provided by goo.gl , we use the information about the user’s browser and platform to increase the inference accuracy. Although Twitter does not provide information of this nature about its users, it does record the name of the application that was used to post a tweet. For example, when someone posts a tweet using the official Twitter client application for the iPhone, the information “Twitter for iPhone” is recorded in the source field of the tweet, which enables us to use this information to infer the browser and platform that were used. Table 2 shows examples of source values corresponding to different browsers and platforms. We should consider applications supporting multiple platforms, such as TweetDeck, which is a multi-platform application that is supported by the iOS, Android, Windows, and Mac OS X operating systems. A target user, who uses a multi-platform application, should be regarded as using all the platforms that support the application.
Conclusion
In this Inference Attack on Browsing History of Twitter Users using Public Click Analytics and Twitter Metadata paper, we proposed derivation assaults to ded uce which abbreviated URLs tapped on by an objective client. All the data required in our assaults is open data: the snap investigation of URL shortening administrations and Twitter metadata. To assess our assaults, we slithered and observed the snap investigation of URL shortening administrations and Twitter information. All through the trials, we have demonstrated that our assaults can surmise the hopefuls by and large. We propose calculations to apply our deduction assault all in all circumstances. We first characterize client and information models.