Loading...
Investigating the Effectiveness of Representations Based on Word-Embeddings in Active Learning for Labelling Text Datasets
Citations
Altmetric:
Date
2021-11-23
Files
Loading...
main article
Adobe PDF, 1.34 MB
Research Projects
Organizational Units
Journal Issue
Citation
Lu, Jinghui, Maeve Henchion, and Brian Mac Namee. "Investigating the Effectiveness of Representations Based on Word-Embeddings in Active Learning for Labelling Text Datasets." arXiv preprint arXiv:1910.03505 (2019).
Abstract
Manually labelling large collections of text data is a timeconsuming
and expensive task, but one that is necessary to support machine
learning based on text datasets. Active learning has been shown
to be an effective way to alleviate some of the effort required in utilising
large collections of unlabelled data for machine learning tasks without
needing to fully label them. The representation mechanism used to represent
text documents when performing active learning, however, has a
significant influence on how effective the process will be. While simple
vector representations such as bag-of-words have been shown to be an effective
way to represent documents during active learning, the emergence
of representation mechanisms based on the word embeddings prevalent
in neural network research (e.g. word2vec and transformer based models
like BERT) offer a promising, and as yet not fully explored, alternative.
This paper describes a large-scale evaluation of the effectiveness of different
text representation mechanisms for active learning across 8 datasets
from varied domains. This evaluation shows that using representations
based on modern word embeddings, especially BERT, which have not yet
been widely used in active learning, achieves a significant improvement
over more commonly used vector representations like bag-of-words.
