N-gram Retrieval for Word Spotting in Historical Handwritten Collections

doi:10.14273/unisa-5365

Collections of handwritten documents of historical interest are often small, ranging from a few dozen to a few hundred pages, but they may have features typical of the collection itself which make them interesting for scholarly groups. Word retrieval can be complicated, as handwriting recognition techniques applied to images of such text documents can produce unsatisfactory results. To circumvent this, KeyWord Spotting (KWS) techniques promise to retrieve words without having to perform the recognition explicitly. KWS systems often require the construction of a reference dictionary consisting of words that the system can search for in the document. To do this, a dictionary must be created, and usually, a small portion of the collection is transcribed by hand. This limits the search to words from the dictionary (InVoc) and introduces the problem of OOV (Out Of Vocabulary) words, which cannot be searched for. Intuitively, increasing the cardinality of the reference dictionaries by manually transcribing new examples of words may seem an immediate way of limiting the OOV problem. As we can imagine, manually labelling pages is an expensive process that can take a lot of time. For small collections, the time required to transcribe and label even a few pages can prove to be a non-negligible obstacle, calling into question the usefulness of automated word retrieval systems. In this thesis, we will focus on a KWS system that can adapt to the lack of data. First, we show analytically, through the definition of a mathematical model, how the different components of the system affect the time of use of the KWS system by estimating the time gain that the system brings to the transcription of a small collection of handwritten historical documents. After highlighting the importance of speeding up the manual annotation process, we then propose a semiautomatic method for image annotation. In particular, we present a learning-free end-to-end approach that includes a line segmentation algorithm and an algorithm for aligning transcripts to images with handwritten text. The former can extract lines of text with a curved baseline, while the latter allows us to easily get their transcript. Finally, we propose a KWS system for word spotting that bases the search on recognizing sequences of characters (N-grams) rather than directly trying to find whole words. Studies on motor behaviour have shown that writing is the result of very fast and precise motor actions that can be automated. In the learning phase, an individual tends to develop motor programs associated with simple actions that are characterized by a high frequency of execution. It is plausible to assume that motor programs for writing develop in relation to sequences of a few characters. This would mean that each time a subject writes an N-gram to which a motor program is associated, he or she produces an ink trace that is always compatible with and similar to all others. The repeated similarity in the execution of the same movements for each N-gram could make the N-grams recognizable, making them ideal candidates for handwritten cursive recognition. The results of the experiments show how a KWS system can effectively reduce the time to the document collection transcription process. Moreover, it is shown that the process of labelling data and creating reference dictionaries is indeed an extremely costly operation and that methods that enable the acceleration of such processes are crucial. The experiments with the proposed KWS system have shown that focusing the search on the N-gram space enables the retrieval of InVoc and OOV words equally well, showing similar retrieval rates for both groups of words. [edited by Author]

N-gram Retrieval for Word Spotting in Historical Handwritten Collections , 2023 Mar 21., Anno Accademico 2021 - 2022. [10.14273/unisa-5365].