Using BiLSTM networks for context-aware deep sensitivity labelling on conversational data
Pogiatzis, Antreas ORCID: https://orcid.org/0000-0001-8887-0139 and Samakovitis, Georgios ORCID: https://orcid.org/0000-0002-0076-8082 (2020) Using BiLSTM networks for context-aware deep sensitivity labelling on conversational data. Applied Sciences, 10 (24):8924. ISSN 2076-3417 (Online) (doi:10.3390/app10248924)
Preview |
PDF (Open Access Article)
30903 SAMAKOVITIS_Using_BiLSTM_Networks_Context-aware_Deep_Sensitivity_Labelling_Conversational_Data_(OA)_2020.pdf - Published Version Available under License Creative Commons Attribution. Download (1MB) | Preview |
Abstract
Information privacy is a critical design feature for any exchange system, with privacy-preserving applications requiring, most of the time, the identification and labeling of sensitive information. However, privacy and the concept of “sensitive information” are extremely elusive terms, as they are heavily dependent upon the context they are conveyed in. To accommodate such specificity, we first introduce a taxonomy of four context classes to categorise relationships of terms with their textual surroundings by meaning, interaction, precedence, and preference. We then propose a predictive context-aware model based on a Bidirectional Long Short Term Memory network with Conditional Random Fields (BiLSTM + CRF) to identify and label sensitive information in conversational data (multi-class sensitivity labelling). We train our model on a synthetic annotated dataset of real-world conversational data categorised in 13 sensitivity classes that we derive from the P3P standard. We parameterise and run a series of experiments featuring word and character embeddings and introduce a set of auxiliary features to improve model performance. Our results demonstrate that the BiLSTM + CRF model architecture with BERT embeddings and WordShape features is the most effective (F1 score 96.73%). Evaluation of the model is conducted under both temporal and semantic contexts, achieving a 76.33% F1 score on unseen data and outperforms Google’s Data Loss Prevention (DLP) system on sensitivity labeling tasks.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | BiLSTM, BERT, NLP, context-aware |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Faculty / School / Research Centre / Research Group: | Faculty of Engineering & Science Faculty of Engineering & Science > School of Computing & Mathematical Sciences (CMS) |
Last Modified: | 23 May 2022 10:57 |
URI: | http://gala.gre.ac.uk/id/eprint/30903 |
Actions (login required)
View Item |
Downloads
Downloads per month over past year