SNIC SUPR
Analysis of Google Ngram data
Dnr:

SNIC 2019/6-82

Type:

SNIC Small Compute

Principal Investigator:

Sverker Sikström

Affiliation:

Lunds universitet

Start Date:

2019-11-07

End Date:

2020-12-01

Primary Classification:

50101: Psychology (excluding Applied Psychology)

Allocation

Abstract

The Google Ngram data is a large dataset of ngrams. An ngram consist of 5 words in a row and how often they are occuring during a year "lisa has gone to school 234 1978". There are approximately ten languages, each have about 700 ngram files, that may expand to 10G each. I would like to take out ngram of pronouns (e.g. he, she) and measure the valence. This allows to measure how groups are evaluated in different languages, context and times.