|
Word recognition of Slavic languages is not an easy task due to the complicated declension of
words and a variety of diacritical signs. Polish is a representative of West Slavic languages, which are written
in Latin characters. Automatic handwritten word recognition in Slavic languages is not easy, due to the poor
recognition rate of letters with diacritical signs and lack of good handwritten text corpora for languages with
declension. The main aim of the research is to investigate the possibility of correcting typos made in the final
phase of recognizing Polish. The method developed is based on letter recognition by means of convolutional
neural networks (CNNs) and text matching algorithms for resulting words. At the first stage, we use a
designed convolutional neural network for character recognition. At the second stage, after combining letters
into words we apply a post-processing error correction method, which improves the efficiency of recognition
of the misspelled words. We checked the efficiency of word matching for a few measures of similarity of
words, i.e: edit distance (Damerau-Levenshtein), string matching (Sorensen-Dice) and list of candidates.
In addition, we examine how word length and the number of misplaced letters affect the behaviour of the
algorithms used. The analysis is carried out for bigram and trigram methods. By combining different methods
to assess the similarity of words, better selection of lists of proposed words has been achieved. The article
proposes an innovative method for correcting post-processing errors in recognizing Polish words with the
efficiency of correct word matching ranging from 76% to 99%, depending on the measure and word length
used.
|