MGrams: A corpus of music lyrics large n-grams
The corpus consists of music lyrics, which are (to avoid copyright infringment) split into bigger chunks. Each chunk is marked as document and when information vailable, it is assigned to the artist, the approximate date, and genre of it.
Each document itself consists of verses which are marked in the corpus with similar information as document doc nodes.
The corpus is loaded into NoSKE and can be queries online. Please contact directly if you want a dump of the corpus.
You can query the corpus using the NoSKE corpus management system, for instance to look at the frequency of artists associated most with the extracted chunks from here.
This page last edited on 23 October 2025.
