logo
O‘zbekcha

THE PROCESS OF COLLECTING A DATABASE, ANNOTATING SENTENCES, AND TOKENIZING IN THE CREATION OF A DEPENDENCY PARSING OF THE UZBEK LANGUAGE

Mualliflar

DOI:

https://doi.org/10.56292/SJFSU/vol31_iss3/a120

Kalit so‘zlar:

Annotation, steps, tokenization, lemmatization, text selection, documentation, guideline, result, process

Annotatsiya

This article provides an in-depth overview of the stages involved in building a dependency parsing treebank for the Uzbek language. It outlines the five simplified but essential stages commonly adopted in international practice when constructing a hierarchical corpus for any language. These stages are as follows: Text selection; pre-processing (including the choice of tools and resources); annotation; documentation of language-specific guidelines and treatment of non-universal linguistic features; and finally, transliteration.

Mualliflar haqida

  • , O‘zbekiston milliy universteti

    O‘zbekiston Milliy universteti tayanch doktoranti

  • , Urganch davlat universiteti

    Urganch davlat universiteti talabasi

Adabiyotlar

Bruno Guillaume. 2021. Graph Matching and Graph Rewriting: GREW tools for corpus exploration, maintenance and conversion. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 168–175, Online. Association for Computational Linguistics.

P Qi, Y Zhang, Y Zhang, J Bolton, Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. CD Manning. Association of Computational Linguistics (ACL) System Demonstrations.

https://stanfordnlp.github.io/stanza/

Salaev U. UzMorphAnalyser: A morphological analysis model for the Uzbek language using inflectional endings //AIP Conference Proceedings. – AIP Publishing, 2024. – Т. 3244. – №. 1

Stefanie Dipper, Cora Haiber, Anna Maria Schröter, Alexandra Wiemann, and Maike Brinkschulte. 2024. Universal Dependencies: Extensions for Modern and Historical German. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17101–17111, Torino, Italia. ELRA and ICCL.

https://www.nltk.org/api/nltk.tokenize.html

https://huggingface.co/datasets/tahrirchi/uz-crawl

Yuklab olishlar

Nashr etilgan

2025-06-25

Qanday iqtibos keltirish

THE PROCESS OF COLLECTING A DATABASE, ANNOTATING SENTENCES, AND TOKENIZING IN THE CREATION OF A DEPENDENCY PARSING OF THE UZBEK LANGUAGE. (2025). Scientific Journal of the Fergana State University, 31(3), 120. https://doi.org/10.56292/SJFSU/vol31_iss3/a120