An automatic text labelling framework to promote digital privacy in the Android ecosystem
by Jesús Alonso, Francesca Sallicati, and Cristian Robledo (Tree Technology)
Natural language texts are still a relatively unexplored data source in the quest for digital privacy. Leveraging this, we present an automatic labelling of texts related to Android apps. This framework relies on state-of-the-art Natural Language Processing (NLP) and deep learning techniques and will be accessible from a user-friendly dashboard.
Analysing text to promote digital privacy
Privacy policies for humans: disentangling their intricate language
In addition to serving as an aid to better understand privacy policies, this NLP-based inspection of contractual documents will help identify inconsistencies between user privacy expectations and the actual software behaviour inferred from static and dynamic analysis.
Assessing description-to-permission fidelity
To meet user expectations and to set adequate boundaries of behaviors, Android protects privacy-critical device functionality. Whenever an app requires access to the camera, microphone or other sensitive features, users are requested to grant permission accordingly. From the privacy and security point of view, if the functionality of an app is detailed in its description, then the request to enable the corresponding permissions would be well understood, what is known as description-to-permission fidelity . This would also contribute to exposing malware and privacy-invasive apps that claim more permissions than their described functionality warrants. Addressing this issue, the TRUSTaWARE NLP framework will also be able to infer a series of dangerous permissions  required by the functionality described in app descriptions. This allows to contrast these to the actual permissions requested by the app and, if they do not match, arise suspicion that it could be malicious.
The methodology used to build this second detector is similar to the one for the analysis of privacy policies. Nevertheless, in contrast to privacy policies, app descriptions are normally written in a simpler language, trying to keep the reader’s attention, they are direct, clearly explained and not exceedingly long. Consequently, some modifications and new techniques are needed to obtain the best-performing model. To begin with, new FastText embeddings have been trained from a large corpus of unlabelled descriptions. Moreover, this time a recurrent neural network with gated recurrent units and an attention layer constitutes the final labelling model .
We have successfully built models to analyse privacy policies and app descriptions in search for the information described in the previous sections. One of the goals of this framework is to help end users to better understand the given information when Android apps are intalled and used, and to assess whether the permissions requested by the apps are really necessary for their functionality. To effectively meet this objective, the models should be accessible from an user-friendly tool. An initial design of the dashboard will be implemented in the coming months.
What is next?
In this post, we describe a first set of services and tools for the analysis of privacy clauses in natural language texts. We will continue developing, improving, expanding and testing new ideas on the TRUSTaWARE NLP framework during the remaining two years of project, including the exploration of other software-related textual pieces like user reviews. Stay tuned!
 McDonald, A. M., & Cranor, L. F. (2008). The cost of reading privacy policies. Isjlp, 4, 543.
 Glavaš, G., Nanni, F., & Ponzetto, S. P. (2016). Unsupervised text segmentation using semantic relatedness graphs. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics (pp. 125-130). Association for Computational Linguistics.
 Piotr Bojanowsk, Edouard Grave, Armand Joulin, Tomas Mikolov, “Enriching Word Vectors with Subword Information”, Transactions of the Association for Computational Linguistics, Vol. 5, pp. 135–146, 2017.
 Yann LeCun, Patrick Haffner, Léon Bottou, “Object Recognition with Gradient-Based Learning“, Shape, Contour and Grouping in Computer Vision, Vol. 1681 of Lecture Notes in Computer Science, pp. 319, Springer, 1999.
 H. Alecakir, B. Can and S. Sen, “Attention: There is an Inconsistency between Android Permissions and Application Metadata!”, International Journal of Information Security, vol. 20, pp. 797–815, 2021.  App Permission Levels Declared by Google [Online]. Available: https://developer.android.com/guide/topics/permissions/overview#normaldangerous