For most of my professional and hobby experience in data science, I haven’t ventured too far into text analytics. Having some more time on my hands with the end of the election, I decided to start learning more about this area of the the machine learning field.
To play around with text mining, I will conduct topic modeling, specially Latent Dirichlet allocation, on this Kaggle dataset, a collection of over 380,000 song lyrics collected from the music site MetroLyrics.
Topic models are a group of statistical methods for classifying documents by the contents of their text. The general intuition behind these methods is that documents, or in this case song lyrics, can be classified based on the occurence and frequency of the words contained within them.
For example, documents about cancer will contain words like health, lungs, chemotherapy whereas documents about web design would likely not and more likely have words such as HTML, markup, clicks.
Latent Dirichlet allocation, or LDA for short, is an unsupervised method that takes in a large corpus of text and attempts to discover these groupings of the documents based on all the words given in the corpus. It assumes that topics are correlated to a mixture of words specific to that topic. By analyzing the frequency of different words in the corpus, LDA back-tracks find these groupings. This outputs a model that then allows us to compute probability distributions that a specific document will contain references to these caculated topics.
LDA can, and is, used for more than just text mining because it is a generative probabilistic method that “allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar”. But its more common usage is for text analysis, and is what I will be using here.
I wanted to explore whether or not genres of music could be differentiated based on the contents of their song lyrics. Although music genres can be characterized by multiple features, most notably their sonic components, my theory was that some genres may feature words more specific to their genre than others.
The dataset contains over 380,000 songs with the features: song title, artist, year of release, genre, and lyrics.
I used Gensim, which is a Python framework for text analysis and topic modeling, that had a great implementation of LDA. Other libraries used were: Pandas, for data manipulation, Seaborn and Matplotlib, for visualization, and NLTK, for natural language processing.
As we can see in this graph, a large portion of the songs uploaded to MetroLyrics are in the Rock genre. This can be attributed to the long history and popularity of rock since 1970, as compared to younger genres such as hiphop, metal or electronic music.
We can also see that there are some songs that were not classified and were labelled as “Not Available” or “Other”. They may be useful to exclude from our analysis.
Here we can see that most of the songs uploaded to the site were from after 1970. Again, this may be useful to exclude outliers.
I made my own feature called song_length just to compare genres by their lyrical character length. Interestingly, hiphop was much more verbose than any other genre, which all generally fall into similar distributions.
To clean the data for topic modeling, I removed all songs that had lyrics of zero length, that did not have a verifiable genre, or that were released before 1970.
I ran the model using this data but could find no clearly definable topics. I figured this might be because the corpus contains many non-English songs. So I used a Python port of Google’s language detection library called langdetect. This way I could classify each song lyric by language and then exclude anything that was not in English.
NOTE: I did try to use Google’s official language detection neural network, CLD3, but couldn’t get the Python bindings to work so I went with the less elegant solution.
I ran the model roughly following this guide by John Barber. The model went through the whole corpus of song lyrics to classify them under two different topics. (I could not find anything significant when I ran the analysis when I increased to N>2 topics).
With the model saved to file and loaded into memory, I then computed the probability for each song belonging to the two topics. I reshaped the data to be displayed in the boxplot below.
What you see here is that through topic modeling we can determine that hiphop contains lyrics that are significantly, and meaninfully distinct from other genres of music. All other genres of music use a similar lexicon (categorized by
topic_2), while hiphop incorporates words that are not commonly found in other types of music (categorized by
I think this is extremely fascinating. Part of this could be hiphop’s large usage of slang, but I also believe this may be due to significant thematic differences in the type of content that hiphop artists discuss in their music.
Some things that I (or others) should look into further:
- Has the usage of language changed over time in each genre’s history?
- Can we use machine learning and statistics to find more meaningful distinctions between genres?
- By extension, can we incorporate other features of music, such as its sonic characteristics, to do so?
Thank you for reading my first forway into text mining! You can find my code in my Blog Projects repo under
This post was inspired by this Medium post, which was written in R. As a big Python fan I wanted to run a similar analysis using the tools I was used to to help myself learn more about text analysis.