Development of a Text Classification Framework using Transformer-based Embeddings

Sumona Yeasmin, Nazia Afrin, Kashfia Saif, Mohammad Huq

2022

Abstract

Traditional text document classification methods represent documents with non-contextualized word embeddings and vector space models. Recent techniques for text classification often rely on word embeddings as a transfer learning component. The existing text document classification methodologies have been explored first and then we evaluated their strengths and limitations. We have started with models based on Bag-of-Words and shifted towards transformer-based architectures. It is concluded that transformer-based embedding is necessary to capture the contextual meaning. BERT, one of the transformer-based embedding architectures, produces robust word embeddings, analyzing from left to right and right to left and capturing the proper context. This research introduces a novel text classification framework based on BERT embeddings of text documents. Several classification algorithms have been applied to the word embeddings of the pre-trained state-of-art BERT model. Experiments show that the random forest classifier obtains the highest accuracy than the decision tree and k-nearest neighbor (KNN) algorithms. Furthermore, the obtained results have been compared with existing work and show up to 50% improvement in accuracy. In the future, this work can be extended by building a hybrid recommender system, combining content-based documents with similar features and user-centric interests. This study shows promising results and validates the proposed methodology viable for text classification.

Download


Paper Citation


in Harvard Style

Yeasmin S., Afrin N., Saif K. and Huq M. (2022). Development of a Text Classification Framework using Transformer-based Embeddings. In Proceedings of the 11th International Conference on Data Science, Technology and Applications - Volume 1: DATA, ISBN 978-989-758-583-8, pages 74-82. DOI: 10.5220/0011268000003269


in Bibtex Style

@conference{data22,
author={Sumona Yeasmin and Nazia Afrin and Kashfia Saif and Mohammad Huq},
title={Development of a Text Classification Framework using Transformer-based Embeddings},
booktitle={Proceedings of the 11th International Conference on Data Science, Technology and Applications - Volume 1: DATA,},
year={2022},
pages={74-82},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011268000003269},
isbn={978-989-758-583-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 11th International Conference on Data Science, Technology and Applications - Volume 1: DATA,
TI - Development of a Text Classification Framework using Transformer-based Embeddings
SN - 978-989-758-583-8
AU - Yeasmin S.
AU - Afrin N.
AU - Saif K.
AU - Huq M.
PY - 2022
SP - 74
EP - 82
DO - 10.5220/0011268000003269