Assisted transcription of historical documents by keyword spotting: a performance model

Santoro, Adolfo; Claudio De Stefano,; Marcelli, Angelo

doi:10.1109/ICDAR.2017.162

We propose a model for estimating the time to transcribe a large collection of historical handwritten documents when the transcription is assisted by a keyword spotting system following the query-by-string approach. The model assumes that the system is segmentation-based and provides as output the transcription of each item (either right or wrong) or a reject. We also assume that any other information the system may need is obtained from the training set. The model has been validated by comparing its estimates with the actual time required for the manual transcription of pages from the Bentham dataset. Eventually, we discuss possible ways of extending the model to consider different kind of keyword spotting system, such as those providing the output in terms of a ranked list of alternatives and/or adopting the query-by-example approach.