Multilingual Grammar Extraction from OCRed Grammars


SNIC 2017/1-190


SNAC Medium

Principal Investigator:

Harald Hammarström


Uppsala universitet

Start Date:


End Date:


Primary Classification:

60201: Jämförande språkvetenskap och lingvistik

Secondary Classification:

10208: Språkteknologi (språkvetenskaplig databehandling)




The diversity of the world's 6,500 languages embodies a wealth of information on the communication machinery inside our heads as well as the history of populations. Traditionally language comparison has been done manually by humans reading grammatical descriptions, but the number of languages and books is now far beyond human capacities. In the present project we propose to exploit an existing collection of over 9,000 digitized grammatical descriptions spanning thousands of languages in order to greatly empower the task of language comparison. The key research question is to develop the notion of a language profile -- a representation of the grammar of a language (a complex system) suitable for cross-comparison. The task of specifying a grammar profiles is novel in this form and needs to be parametrized for granularity and practical to extract automatically from raw text descriptive data. Toward this goal, the applicant team combines proficiency in state-of-the-art techniques in Natural Language Processing with expertise in linguistic diversity, aided in the latter by a panel of international experts. Two concrete questions, one concerning population history and one concerning language processing, will be targeted within the scope of the project. More broadly, the project research raises possibilities for a variety of questions in linguistic diversity and even beyond linguistics.