Last week, Haystack crossed the bar of 200 installs on Google Play! We want to take this opportunity to thank our users as your collaboration is extremely valuable for us in our research efforts to illuminate the fast-evolving mobile ecosystem.
If you haven’t tried Haystack yet, we would like to invite you to install it. Haystack is a tool developed by a group of academic researchers at the International Computer Science Institute and UC Berkeley in collaboration with Stony Brook University. Haystack analyzes the traffic generated by the apps running on your phone so you can understand how they communicate with online services, and whether they leak any sensitive information about you or about your device. Haystack is available for free on Google Play.
How does Haystack work?
Your phone hosts a rich array of information about you and your activities. This includes a range of identifiers that can enable sites to track you, as well as data about your device, your installed apps, your accounts, and your location.
Sometimes apps require information to provide useful functions of the app, or to adapt content to your device. For example, a Maps application of course needs to know your exact location. But in other cases, apps may collect and upload privacy-sensitive information for advertising, analytics, and tracking purposes. We consider these to be privacy-sensitive leaks but it is up to you to decide whether you want to continue using the app.
Android uses a permission model to control how applications access sensitive resources, but it does not really help you to know which organizations collect data about you or your device. Haystack aims to fill this gap.
The ICSI Haystack app helps you to identify which apps leak information about you, where your apps connect to, which protocols they use and it finally informs you about the organizations collecting this information, even when the apps use encryption techniques like TLS/SSL (i.e., the technology providing the “S” for Secure in HTTPS). We accomplish this by a technique known as TLS Interception. If you want to disable or enable TLS interception at any time, you can easily do it on the app settings.
Haystack requires access to a number of Android permissions to capture your app’s traffic and to identify privacy leaks. If you want to know more specific details, or if you want to know how Haystack will affect your battery life or performance, you can read our paper. We tried our best to minimize Haystack’s overhead and we will continue investigating new techniques to improve its performance.
A note about Samsung devices: We have noticed that Haystack cannot intercept app traffic in many Samsung devices. In some cases, it only intercepts DNS traffic, and in others it only works when connected over WiFi. Please, accept our apologies. We are investigating the underlying problem to come with a solution but it is possible that this is a device or firmware limitation. Users of other Android VPN clients available in Google Play also reported similar problems. To minimize user’s frustration, we have not listed Samsung devices in the list of supported devices in our Google Play listing. Nevertheless, you can install and try Haystack using this direct link to the APK under your own responsibility.
Haystack is also a tool to research the mobile ecosystem at scale. The mobile ecosystem is fast-evolving and it is difficult to identify new players and services and how app developers integrate their services. In fact, it is also important to know how secure are data communications across all apps and whether developers are implementing countermeasures against potential attacks.
Your collaboration is very important to successfully achieve our research goals. However, in order to perform our studies, we need to collect certain pieces of information about how your apps behave. We do this without compromising your privacy by aggregating and anonymizing our traces. For instance, we want to know that an application leaked a unique identifier like the IMEI to a given server using TLS while the screen was off, but we do not want to know who you are, which apps you run, the value of the IMEI, neither your location nor your IP address. All your personal data remains on your phone!
Because of this data sanitization process, ICSI-UC Berkeley’s Committee for Protection of Human Subjects (IRB) has not considered this study as a human-subject study. If you have any questions about our data collection process, we strongly encourage you to send us an email to haystack-help“@”icsi.berkeley.edu.
Thanks to our first 200 users, we have identified 423 apps leaking information to 2,995 unique domains (many of them, owned by the same organization as in the case of Google Services). The figure below shows the most popular online services among our monitored apps: 9/12 of them belong to Google.
The interactive figure below illustrates the network of apps leaking sensitive information and the organizations collecting it as seen by Haystack. The color of the edges represent whether the developer uploads securely the data with TLS by default (in blue) or if we have at least one record of an insecure flow (red). Nearly 72% of the privacy leaks happened over HTTPS/TLS. This stress the need to intercept TLS traffic locally in order to identify those leaks. We cannot access completely or partially the flows for certain apps like Mega, Uber and Twitter as they implement effective countermeasures against TLS interception.
Because of rendering and performance issues, we pruned the network. In particular, we excluded leaks caused by browser activity (e.g., Mozilla’s Firefox or Google Chrome), those caused by pre-installed Android processes, and unpopular online services reached by less than three applications (mainly apps reporting data to their own online infrastructure). To reduce the number of nodes, we also group together sub-domains.
The figure shows how applications cluster themselves around popular tracking, analytics and advertisement services, but also for social media integration as in the case of Facebook’s Graph API. Nevertheless, there is a clear power-law distribution in the number of apps using those services: 53 of the services have been reached by at least 5 apps. The most popular services in our dataset are:
- Facebook Graph : Facebook’s Graph allows app developers to integrate their apps with Facebook’s social graph and platform. However, Facebook’s Graph API also allows developers to monetize their apps through advertising and perform app analytics. Facebook’s services are always delivered over HTTPS and they do not seem to use specific domains for each service: all the services seem to be delivered by the domain graph.facebook.com. This characteristic makes it harder for traditional ad blockers to block this traffic as they need to perform TLS interception and analyze the traffic to identify the content of a given flow. We have identified more than 15 types of sensitive data leaks with a significance variation between apps, probably associated with developer’s needs. Nevertheless, Facebook’s official apps (e.g., Instagram, Facebook’s Pages, Facebook’s Messenger and Facebook App) upload the main bulk of sensitive information. The applications that integrate Facebook’s Graph API in their services generally report information like OS build id (which can be seen as a cookie), operator name, device model and brand with the exception of few applications reporting unique identifiers like the IMEI or even the private IP address. We have manually checked that applications using this service can operate even on devices without Facebook’s official apps installed.
- Google Services: This includes more than 20 domains like 1) googleapis.com (for apps interacting with Google’s API, including authentication); 2) gstatic.com, googlesyndication.com, and doubleclick.com for online advertisement, and 3) Google Analytics for Android (analytics-google.com). The native processes Android’s Backup Service and Google Play leak most of the device and user related information to Google’s services. The rest of the apps leak information like device model, brand and build ID.
- Flurry. Flurry is a popular analytics service owned by Yahoo. So far, we have identified more than 10 different types of data leaks associated with it, mostly device information (brand and model and build ID), and ISP information (MCC/MNC codes).
- Crashlytics. Crashlytics, now acquired by Twitter, is a mobile company building crash reporting for iOS and Android. 27 of the apps on devices running Haystack use this service.
We will investigate in detail the content of the protocols for each one of the users on instrumented phones under our control to do not cause any privacy violation to our users. We will release our results in this blog. Stay tuned!
Case Study: Unique Identifiers
Mobile devices contain many types of unique identifiers which are guaranteed to be unique among all the values used for those objects, users, devices or resources. Unique identifiers are extremely useful for mobile advertisement and tracking services to link data to users across all the apps using their services. They play a similar role as the cookies in the context of the browser with the advantage of being immutable. Below, we describe some of the unique identifiers stored on mobile phones:
- The International Mobile Station Equipment Identity or IMEI is a unique 15 digit number that identifies every mobile phone, GSM modem or device with a built-in phone / modem. Based on this value, it is also possible to obtain some additional information about the device brand or model.
- The International Mobile Subscriber Identity or IMSI is a 64 bit value that identifies uniquely the user of a cellular network globally. The subscriber’s phone number is another value falling in this category.
- The Media Access Control Address or MAC Address is a unique identifier assigned to network interfaces for communications on the physical network segment (e.g., WiFi or Bluetooth).
- As any product, mobile phones also contain unique identifiers that are assigned incrementally or sequentially by the manufacturer like serial numbers. Any app can request this information programmatically.
An app developer must request the permission READ_PHONE_STATE, considered on Android’s documentation among the list of dangerous permissions, to read device identifiers such as the IMEI, phone number and IMSI. However, there is no permission to access other unique identifiers like the serial number and the MAC address. In this case, the app developer just need to access the information provided by the getprop command, a system-managed process that stores a vast number of system properties and configurations. This is an anonymized version of a real hexadecimal serial number: [ro.serialno]: [0X8X9XX214084221] as stored by getprop. The information stored in this file has the same impact on user’s privacy as the other values in practical terms. Haystack reads and parses this file and searches for its content on user’s traffic. This allows us to investigate how applications access and leak any unique identifier, and the set of organizations collecting them and the use of encryption for them. The histogram below shows the number of apps leaking those values and if they use encryption to upload this sensitive data.
In addition to unique identifiers, applications can access other sensitive information such as the device hostname, or even the WiFi SSID through the getprop command. This can be used to geo-locate the user without requiring any further permission. As in the case of unique identifiers, the following table list of apps leaking this information and their destination as seen by Haystack. Only Adobe’s analytics service does not use encryption by default.
Identifying apps uploading those values without encryption is very important, specially in oppressive regimes. The presence of this metadata on user’s traffic make them trackeable by in-path middleboxes (some of them such as WiFi APs can use it for advertising purposes), and by any surveillance agency. Nonetheless, as a result of the popularity of ad-blockers working at the network level and the increasing user concerns about mass surveillance, the number of online trackers, advertising and analytics services using TLS is increasing. In total, 69 apps leak any type of unique identifiers which are collected by more than 150 domains. Those values allow online services for analytics and advertising to identify users across the metadata provided by all the apps using their services. The figure below shows which apps leak any unique identifier or network-related information and the organizations (domains) collecting them. As in the interactive graph, we highlight in red the applications leaking any unique identifier without encryption. An interesting case is the app PayTM, an e-commerce app for online payments that leaks the IMEI value over plain HTTP.
An interesting observation that benefits from Haystack’s crowd-sourcing nature is that application developers use third-party libraries in different ways: not all apps leak the same type of information to those online services. Finally, many apps upload this information solely to their own servers.
The use of this information can be legitimate and very useful to prevent device or identity theft, and fraud. That could apply to applications like Cerberus anti-theft. However, even if legit, it is not a good practice to request this information without user awareness by using the getprop command.
You can access the Google Play profile for each one of the apps by replacing the corresponding APP_ID (listed on the Y-axis of the Figures) in the following URL: https://play.google.com/store/apps/details?id=APP_ID.
Apps leaking the IMEI:
Apps leaking the IMSI number:
Apps leaking the serial number:
Apps leaking the MAC address:
Future Work and Improvements
Haystack is still in its early stages and the results that we have presented in this post are just a first analysis of the data we have collected so far. We hope that you’ve found it interesting.
We are confident that as more users try the tool, we will be able to have a better idea of how the different stakeholders of the mobile ecosystem leverage user’s metadata while also increasing the range of leakages that we currently support.
As we can allocate resources to the project, we also plan to make our dataset publicly available so any user can search for information about a given application or online service. We also want to extend the tool to become a platform for general mobile measurement: from performance measurements to security measurements. Nevertheless, our priority is developing and maintaining a tool for the mobile user so users can stay in control of their apps, their traffic, and their privacy.
Once more, we would like to invite you to try the tool if you haven’t done it yet and thank those who have already installed the app. As usual, we love to hear your feedback as this is the best way to improve the tool. You can send us your comments by email at haystack-help “@” icsi.berkeley.edu or through our Twitter account. Likewise, we will be happy to answer any question or concern that you may have about the app or the data collection process.
The ICSI Haystack Team