SNIC
SUPR
SNIC SUPR
Word spotting in Historical Manuscripts
Dnr:

SNIC 2018/3-232

Type:

SNAC Medium

Principal Investigator:

Anders Hast

Affiliation:

Uppsala universitet

Start Date:

2018-05-05

End Date:

2019-06-01

Primary Classification:

10299: Other Computer and Information Science

Allocation

Abstract

Word spotting in historical manuscripts is a computationally heavy task. To find one word in a document takes up to a couple of minutes per page Thus, finding hundreds of words in hundreds of pages takes weeks. Therefore, HPC resources are necessary in order to test new algorithms and parameter settings. The over all goal of the project is to find ways to do semi automatic transcription of historical manuscripts. We will investigate a new way to do query expansion that improves overall average precision to a level acceptable for automatic transcription. In the end we want the human in the loop and we will investigate ways to do semi-automatic transcription in the final part of the project. But first we need to improve all algorithms involved and to run them on several collections of manuscripts. To make a full evaluation of a single word on a 100 page collection takes about a 1000 hours on a single core, and we need to perform this on several dozens of words to compare with the state of the art.