Abstract
Natural Language Processing is an important area of artificial intelligence, but many low-resource languages still lack sufficient datasets and optimized models. This paper presents a framework and preliminary baseline experiment for Uzbek text classification. The study focuses on text preprocessing, feature extraction, model selection, and evaluation. Two baseline models, TF-IDF with Logistic Regression and TF-IDF with Support Vector Machine, are used for comparison. The models are evaluated using accuracy, precision, recall, and F1-score. The proposed framework can support future Uzbek NLP applications in education, media, document classification, and automated text processing.
References
[1] A. Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems, 2017.
[2] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of NAACL-HLT, 2019.
[3] A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” in Proceedings of ACL, 2020.
[4] E. Kuriyozov, U. Salaev, S. Matlatipov, and G. Matlatipov, “Text Classification Dataset and Analysis for Uzbek Language,” arXiv preprint arXiv:2302.14494, 2023.
[5] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” arXiv preprint arXiv:1301.3781, 2013.
[6] Y. Goldberg, Neural Network Methods for Natural Language Processing. Morgan & Claypool Publishers, 2017.
