Data Critique – Gender in Popular Music

Our Dataset: “Evolution of Popular Music: USA 1960-2010”

We base our current claims on a study the “Evolution of Popular Music: USA 1960–2010” dataset, collected by Matthias Mauch, Robert M. MacCallum, Mark Levy and Armand M. Leroi, with a focus gendered words used in songs throughout the time period.

How the dataset was generated

This data was generated from the US Billboard Hot 100 list between the years of 1960 and 2010 using text-mining techniques and audio file analysis. The US Billboard Hot 100 is currently the industry standard for measuring song popularity, using a point system based on sales and playtime. According to the report, 86% of the songs listed in the Hot 100 list were used in the study. Among the 14% missing from this data, there were more frequent gaps in earlier years as some songs were either missing or not yet digitized.

All 17,094 songs used were available online as digital audio files. The authors of this dataset used text-mining techniques to gather surface-level metadata of the songs’ artist name, unique audio ID, track name, date of entry, decade, and musical era.

More specifically, the researchers were interested in the sound qualities of these popular songs. To assign numerical values to musical traits, they started with 30-second audio segments of each song. From there, they used audio analysis software to measure tonal content (certain arrangements of pitches and chords) and degree of timbre (attributes that make a sound vary from another despite identical pitch and volume). These measurements were organized according to chord changes and timbre clusters and assigned semantic labels. They translated each label into 16 topics using a system of latent Dirichlet allocation (LDA) in which the data is measured relative to other labels. At the end of this process, they were able to numerically represent each song as 8 H-topics or (harmonic topics) that measured chord changes and 8 T-topics (timbral topics) that captured aspects like female voice or drums (Mauch et. al).

What was the original source?

The original sources for the data generated stem from a desire to account for the organic and cultural evolution of pop music. Everyone from philosophers, sociologists, journalists, even social scientists have attempted to debate the history of music and the social reasons as to how certain genres arise as superior. This scientific and data analytic approach to analyzing this stems from the emergence of data including collections of audio recordings, musical scores, and lyrics. The songs within the data list are 30-second long segments featured from 17,094 songs covering approximately 86% of the Hot 100 List. The focus of this data is on popular music during the 1980s period, so a fully representative sample was not necessary, but instead only the songs that were considered commercially successful at the time used.

What are the silences?

Because of the scope of the data set, some information is left out of the spreadsheet. First, the data on the spreadsheet is centered on the music and its quality, rather than on the artists. We do not have any information about their age, gender or ethnicity, for example, so we cannot see how who the artists are and how this influences their style of music. It is also important to notice that this data set is focused on popular music so it only includes the top 100 songs. It then does not represent all the music that was created between 1960 and 2010 which may leave out specific genres not popular enough to get into the list. Finally, the spreadsheet is focused only on the style, the tone and the timbre of the music but not on the words of the songs which could give us different insights on the ideas that people were interested about during this period.

If the data set were the only source, a few pieces of information that users may find relevant would be left out. The dataset only includes the top 100 hits from the years 1960-2010. Those interested more in modern music may be disappointed to find the top hits from the past 8 years to be missing. Additionally, 14% of the Hot 100 list from the years in use is missing, creating gaps in the data. Digging further into the metadata of the songs, information such as the writer of the song, producer of the song, copyright information, and genre is missing. These are categories in which users of the data set may wish to see present.

In addition, a critical component that would add to this data set would have been if it presented sufficient data analyzing whether or not the record went platinum, how many records sold, and if the artist received any prestigious awards for their work. Having access to this type of data might have opened doors for future comparison studies in the evolution of pop music.