Zgadzam się
Nasza strona zapisuje niewielkie pliki tekstowe, nazywane ciasteczkami (ang. cookies) na Twoim urządzeniu w celu lepszego dostosowania treści oraz dla celów statystycznych. Możesz wyłączyć możliwość ich zapisu, zmieniając ustawienia Twojej przeglądarki. Korzystanie z naszej strony bez zmiany ustawień oznacza zgodę na przechowywanie cookies w Twoim urządzeniu.
This paper addresses the pressing challenge of data scarcity in low‐resource languages,
focusing on a practical methodology for building parallel corpora for the Kyrgyz–Kazakh lan‐
guage pair. The lack of extensive, high‐quality bilingual datasets remains a critical bottleneck
in developing neural machine translation (NMT) systems for such languages. To address this
issue, the study proposes and evaluates a structured approach to generating parallel data
using artificial intelligence tools. The methodology includes selecting an optimal AI‐based
translation system based on accessibility (free availability), translation accuracy, and processing
efficiency. Using a test dataset of 1,000 sentences, the most effective system was identified and
subsequently employed to construct a large‐scale Kyrgyz–Kazakh parallel corpus. A manual
error analysis revealed that approximately 0.5% of the translations contained inaccuracies,
highlighting the need for additional post‐editing and refinement. The findings contribute
to the broader development of linguistic resources for low‐resource language pairs and provide
practical insights into the effective application of modern AI systems for parallel data creation.