Methodology for creating datasets of parallel sentences in low-resource languages by using AI
Artykuł w czasopiśmie
MNiSW
20
Lista 2024
| Status: | |
| Autorzy: | Abduali Balzhan, Miłosz Marek, Tukeyev Ualsher, Karibayeva Aidana |
| Dyscypliny: | |
| Aby zobaczyć szczegóły należy się zalogować. | |
| Rok wydania: | 2025 |
| Wersja dokumentu: | Drukowana | Elektroniczna |
| Język: | angielski |
| Numer czasopisma: | 9 |
| Wolumen/Tom: | 8 |
| Strony: | 13 - 23 |
| Efekt badań statutowych | NIE |
| Materiał konferencyjny: | NIE |
| Publikacja OA: | TAK |
| Licencja: | |
| Sposób udostępnienia: | Otwarte czasopismo |
| Wersja tekstu: | Ostateczna wersja opublikowana |
| Czas opublikowania: | W momencie opublikowania |
| Data opublikowania w OA: | 10 października 2025 |
| Abstrakty: | angielski |
| This study addresses the crucial problem of data scarcity for low-resource languages, with a particular focus on a methodology for creating parallel corpora in two low-resource languages. The lack of large-scale, high-quality bilingual datasets significantly hinders thedevelopment of neural machine translation systems for such languages. This study proposes and validates a methodology for creating such datasets. The methodology involves selecting an AI system to generate a parallel corpus based on criteria of accessibility (free access), translation quality, and efficiency, based on a test dataset of 1000 sentences. Subsequently, a substantial parallel corpus of Kyrgyz-Kazakh was created using the selected AI system. However, manual error analysis revealed that approximately 0.5% of the translations contained inaccuracies, indicating the need for further post-editing and model refinement. This study contributes to the development of resources for low-resource language pairs and provides practical guidance on the effectivecreation of parallel corpora using modern AI systems. |
