Skip navigation

Using BiLSTM networks for context-aware deep sensitivity labelling on conversational data

Using BiLSTM networks for context-aware deep sensitivity labelling on conversational data

Pogiatzis, Antreas ORCID logoORCID: https://orcid.org/0000-0001-8887-0139 and Samakovitis, Georgios ORCID logoORCID: https://orcid.org/0000-0002-0076-8082 (2020) Using BiLSTM networks for context-aware deep sensitivity labelling on conversational data. Applied Sciences, 10 (24):8924. ISSN 2076-3417 (Online) (doi:10.3390/app10248924)

[thumbnail of Open Access Article]
Preview
PDF (Open Access Article)
30903 SAMAKOVITIS_Using_BiLSTM_Networks_Context-aware_Deep_Sensitivity_Labelling_Conversational_Data_(OA)_2020.pdf - Published Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

Information privacy is a critical design feature for any exchange system, with privacy-preserving applications requiring, most of the time, the identification and labeling of sensitive information. However, privacy and the concept of “sensitive information” are extremely elusive terms, as they are heavily dependent upon the context they are conveyed in. To accommodate such specificity, we first introduce a taxonomy of four context classes to categorise relationships of terms with their textual surroundings by meaning, interaction, precedence, and preference. We then propose a predictive context-aware model based on a Bidirectional Long Short Term Memory network with Conditional Random Fields (BiLSTM + CRF) to identify and label sensitive information in conversational data (multi-class sensitivity labelling). We train our model on a synthetic annotated dataset of real-world conversational data categorised in 13 sensitivity classes that we derive from the P3P standard. We parameterise and run a series of experiments featuring word and character embeddings and introduce a set of auxiliary features to improve model performance. Our results demonstrate that the BiLSTM + CRF model architecture with BERT embeddings and WordShape features is the most effective (F1 score 96.73%). Evaluation of the model is conducted under both temporal and semantic contexts, achieving a 76.33% F1 score on unseen data and outperforms Google’s Data Loss Prevention (DLP) system on sensitivity labeling tasks.

Item Type: Article
Uncontrolled Keywords: BiLSTM, BERT, NLP, context-aware
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Faculty / School / Research Centre / Research Group: Faculty of Engineering & Science
Faculty of Engineering & Science > School of Computing & Mathematical Sciences (CMS)
Last Modified: 23 May 2022 10:57
URI: http://gala.gre.ac.uk/id/eprint/30903

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics