7 min read

Text mining Kanye West (Part 1)

Motivation

I became interested in text mining after attending a talk by Julia Silge at the Portland R User meetup group. She showed off some of the concepts from her new book Text Mining With R: A Tidy Approach, coauthored by David Robinson. The book, which I highly reccomend, covers some of the fundamental techniques of text mining using tidytext, an R package created by David and Julia.

Text mining is of personal interest to me because I am often working open-ended survey data in my day job as a market research analyst. The goal of this post was to practice some of those text mining techniques using a unique dataset.

All code and data for this project are available on GitHub.

The data

Survey data is my bread and butter but even I can admit that it sometimes isn’t the most exciting to write about. For this analysis I wanted to look at lyrics and who is better to analyze than Kanye West. More specifically, I collected the lyrics from all of the songs on Kanye’s core discography. This includes his 7 studio releases, no deluxe editions or b-sides. Given the relatively small size of the data set I collected the lyrics manually. My source was genius, from there I pulled: album, song title, year released, length, featured artist(s), writer(s), labeled tracks as songs or skits. I organized the data in Excel and then exported it as a CSV for easy ingestion into R.

Packages

The required packages used in this analysis are

library(tidyverse)
library(scales)
library(tidytext)
library(gridExtra)

Importing and cleaning the data

The dataset was assembled manually and therefore was kept fairly simple and relatively clean so necessary cleaning is minimal. The first thing I did was remove text between [ and ] in the Lyrics column. Genius uses text surrounded by square brakets to indicate parts of songs (e.g., verse, chorus, bridge, etc.), while these may be useful for more advanced analyses I chose to remove them for simplicity here. I also re-ordered the factor levels of the Album variable so that they are in order of album release.

lyrics <- read_csv("Data/Kanye Lyrics (beta2).csv", 
    col_types = cols(Album = readr::col_factor(levels = c("The College Dropout", 
        "Late Registration", "Graduation", 
        "808s & Heartbreak", "My Beautiful Dark Twisted Fantasy", 
        "Watch the Throne", "Yeezus", "The Life of Pablo")))) %>%
  mutate(Lyrics = str_replace_all(Lyrics, "\\[[^]]*]", ""))

glimpse(lyrics)
## Observations: 121
## Variables: 8
## $ Album                <fctr> The College Dropout, The College Dropout...
## $ Song                 <chr> "Intro (Skit)", "We Don't Care", "Graduat...
## $ Year                 <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004,...
## $ `Run-time`           <time> 00:19:00, 03:59:00, 01:22:00, 03:43:00, ...
## $ `Featured artist(s)` <chr> NA, NA, NA, "Syleena Johnson", NA, "GLC, ...
## $ `Writer(s)`          <chr> "Kanye West", "Kanye West, Miri Ben-Ari, ...
## $ Skit                 <chr> "Y", "N", "N", "N", "N", "N", "N", "N", "...
## $ Lyrics               <chr> "Kanye, can I talk to you for a minute? M...

Term frequencies

A common place to start a text mining analysis is to look at term frequencies. This gives us an idea of words commonly used in our documents of interest. We can examine term frequencies with a few simple commands. The first step is to tokenize the text, in this case into single words (unigrams). To tokenize we use the unnest_tokens() function. unnest_tokens() also does some basic cleaning like converting text to lowercase and removing symbols.

lyrics %>%
  unnest_tokens(word, Lyrics) %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  ggplot(aes(reorder(word, -n), n)) +
  geom_col(show.legend = FALSE) +
  theme(axis.text.x = element_text(angle = -60, hjust = 0, size = 12)) +
  labs(x = "Word", y = "Occurances", title = "Most common words in Kanye's lyrics")

Upon examination we see that the most common words found across Kanye West lyrics are not that interesting. This is true in almost any corpora of natural language. These words fall into a category of text called stop words, and tidytext provides a quick way to remove them. But before we remove the stop words we should take note that four of the top 20 most common words in Kanye’s lyrics are direct references to himself (e.g., I, my, me, I’m). To those familiar with Kanye’s antics this will come as no surprise.

lyrics %>%
  unnest_tokens(word, Lyrics) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  ggplot(aes(reorder(word, -n), n)) +
  geom_col(show.legend = FALSE) +
  theme(axis.text.x = element_text(angle = -60, hjust = 0, size = 12)) +
  labs(x = "Word", y = "Occurances", title = "Most common words in Kanye's lyrics")

With the addition of a single anti_join we removed the stop words and are left with words containing more information. Somewhat interestingly, love comes to the top, a topic that isn’t intuitively top-of-mind when thinking about Kanye’s work. Let look at the ocurrance of “love” across his albums.

album_words <- lyrics %>%
  unnest_tokens(word, Lyrics) %>%
  count(Album, word, sort = TRUE) %>%
  ungroup()
  
album_totals <- album_words %>%
  group_by(Album) %>%
  summarize(total = sum(n))

album_words <- album_words %>%
  inner_join(album_totals) %>%
  mutate(freq = n / total)

album_words %>%
  anti_join(stop_words) %>%
  filter(word == "love") %>%
  ggplot(aes(factor(Album, levels = rev(levels(Album))), freq, fill = Album)) +
  geom_col(show.legend = FALSE) +
  scale_y_continuous(label = percent) +
  coord_flip() +
  labs(x = "Album", y = "Frequency of love", title = "Frequency of usage of the word love across Kanye's albums")

When we rank Kanye’s albums by frequency of occurance of the word love, “808s & Heartbreak” rises to the top. If you’ve read Kanye’s Wikipedia page you’ll know that he made 808s soon after his mother passed away and his then fiance ended their engament. We also see that 808s marks a key point in his career, up until then he rarely mentioned love but afterwards there is a roughly four-fold increase in frequency of use. 808s is a somewhat controversial album in Kanye’s discography, many fans label this as his worst album and for some it marks a critical transition point in his career. In this plot that transition is pretty apparent. There also appears to be a relationship between the frequency of his use “love” and career tenure. We can explore this relationship by plotting frequency of “love” as a function of year of album release.

year_totals <- lyrics %>%
  unnest_tokens(word, Lyrics) %>%
  anti_join(stop_words) %>%
  group_by(Year) %>%
  count(word) %>%
  summarise(total = sum(n))

lyrics %>%
  unnest_tokens(word, Lyrics) %>%
  anti_join(stop_words) %>%
  filter(word == "love") %>%
  count(word, Year) %>%
  inner_join(year_totals) %>%
  mutate(freq = n/total) %>%
  ggplot(aes(Year, freq)) +
  geom_point(show.legend = FALSE) +
  geom_smooth(method = "lm", se = TRUE)

A relationship made even more obvious when we remove the outlier of 808s & Heartbreak.

lyrics %>%
  unnest_tokens(word, Lyrics) %>%
  anti_join(stop_words) %>%
  filter(word == "love" & Year != 2008) %>%
  count(word, Year) %>%
  inner_join(year_totals) %>%
  mutate(freq = n/total) %>%
  ggplot(aes(Year, freq)) +
  geom_point(show.legend = FALSE) +
  geom_smooth(method = "lm", se = TRUE)

Next lets look at frequent words for each album in Kanye’s discography.

One thing that quickly becomes apprant when looking at frequent words per album is that many of the most frequent words appear to be hip-hop specific stop words. Words like “gonna” or “y’all” are slang terms that represent two words that would otherwise be considered stop words. In a similar vein we see many of Kanye’s signiture ad-libs show up as frequent terms (e.g., uh, la, ya). One way we could potentially tackle this problem would be to create a custom domain-specific list of stop words but that would be very labor intensive. Alternatively, if we make the assumption that many of these hip-hop stop words occur commonly across Kanye’s albums we can take a different approach where we examine words that appear frequently on a given album but not across multiple albums. Term frequency-inverse document frequency is a statistic that does just that and tidytext makes it easy to calculate using bind_tf_idf(). While using tf-idf in this way does not accomplish exactly the same thing it does give a picture of the uniquely frequent words on each album.

album_words %>%
  bind_tf_idf(word, Album, n) %>%
  group_by(Album) %>%
  top_n(5, tf_idf) %>%
  ggplot(aes(reorder(word, tf_idf), tf_idf, fill = Album)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ Album, scales = "free_y", ncol = 3) +
  labs(x = "Word", y = "tf-idf", title = "Uniquely frequent terms among Kanye's albums")