Disinformation, Fakes and Propaganda Identifying Methods in Online Messages Based on NLP and Machine Learning Methods
Artykuł w czasopiśmie
MNiSW
20
Lista 2024
Status: | |
Autorzy: | Vysotska Victoria, Przystupa Krzysztof, Chyrun Lyubomyr, Vladov Serhii, Ushenko Yuriy A., Uhryn Dmytro, Hu Zhengbing |
Dyscypliny: | |
Aby zobaczyć szczegóły należy się zalogować. | |
Rok wydania: | 2024 |
Wersja dokumentu: | Drukowana | Elektroniczna |
Język: | angielski |
Numer czasopisma: | 5 |
Wolumen/Tom: | 16 |
Strony: | 57 - 85 |
Scopus® Cytowania: | 0 |
Bazy: | Scopus |
Efekt badań statutowych | NIE |
Materiał konferencyjny: | NIE |
Publikacja OA: | TAK |
Licencja: | |
Sposób udostępnienia: | Witryna wydawcy |
Wersja tekstu: | Ostateczna wersja opublikowana |
Czas opublikowania: | W momencie opublikowania |
Data opublikowania w OA: | 8 października 2024 |
Abstrakty: | angielski |
A new method of propaganda analysis is proposed to identify signs and change the dynamics of the behaviour of coordinated groups based on machine learning at the processing disinformation stages. In the course of the work, two models were implemented to recognise propaganda in textual data - at the message level and the phrase level. Within the framework of solving the problem of analysis and recognition of text data, in particular, fake news on the Internet, an important component of NLP technology (natural language processing) is the classification of words in text data. In this context, classification is the assignment or assignment of textual data to one or more predefined categories or classes. For this purpose, the task of binary text classification was solved. Both models are built based on logistic regression, and in the process of data preparation and feature extraction, such methods as vectorisation using TF-IDF vectorisation (Term Frequency – Inverse Document Frequency), the BOW model (Bag-of-Words), POS marking (Part-Of-Speech), word embedding using the Word2Vec two-layer neural network, as well as manual feature extraction methods aimed at identifying specific methods of political propaganda in texts are used. The analogues of the project under development are analysed the subject area (the propaganda used in the media and the basis of its production methods) is studied. The software implementation is carried out in Python, using the seaborn, matplotlib, genism, spacy, NLTK (Natural Language Toolkit), NumPy, pandas, scikit-learn libraries. The model's score for propaganda recognition at the phrase level was obtained: 0.74, and at the message level: 0.99. The implementation of the results will significantly reduce the time required to make the most appropriate decision on the implementation of counter-disinformation measures concerning the identified coordinated groups of disinformation generation, fake news and propaganda. Different classification algorithms for detecting fake news and non-fakes or fakes identification accuracy from Internet resources ana social mass media are used as the decision tree (for non-fakes identification accuracy 0.98 and fakes identification accuracy 0.9903), the k-nearest neighbours (0.83/0.999), the random forest (0.991/0.933), the multilayer perceptron (0.9979/0.9945), the logistic regression (0.9965/0.9988), and the Bayes classifier (0.998/0.913). The logistic regression (0.9965) the multilayer perceptron (0.9979) and the Bayesian classifier (0.998) are more optimal for non-fakes news identification. The logistic regression (0.9988), the multilayer perceptron (0.9945), and k-nearest neighbours (0.999) are more optimal for identifying fake news identification. |