Data Mining/Visualization for Fun, but not Profit

Just published on Linkin

 

I thought I’d share a personal music related data mining project I like to do at the beginning of every year; figuring out what I listened to over the last 12 months.

Since 2007, I’ve been sending tracks I play to the Last.FM website since through an automated process they call “scrobbling”. While it isn’t a complete set of data, it gives a pretty good picture of my listening habits, and at this point, I’ve uploaded (or “scrobbled”) nearly 240,000 tracks over the last 12 years.

The first step was to download my tracks from Last.FM and import them into Excel. I used formulas to extract the year, date and time from the date/time field, and converted the UTC time to central. This would allow me to filter by year, as well as by time of day; I don’t count tracks played from 11:30 PM to 6:30 AM, because it’s highly likely I’m not actually listening to those tracks. I created a column that calculated a simple “Yes” or “No” based on the time.

Because tracks come from a variety of sources, including my personal library, Spotify, Tidal and even YouTube, there’s plenty of room for dirty data. Normalizing the artist names became the next step, and the first heavy lift.

Using data from a previous effort to jump start the process, I created a table mapping all artist names to a normalized artist name to correct misspellings (Baetles to Beatles) and alternative names (Davis, Miles or Miles Davis Quintet to Miles Davis). Then I created my VLOOKUP to this new table to create a “Normalized Artist” column. Once this was complete I used a pivot table, filtering by year, “Yes” for time of day, and filtering out artists associated with my daughter’s listening. Copying the resulting data to a new tab, I could filter by most listens, and do some additional data scrubbing. Finally, I created a formula to create a list that I could drop into Wordle, an app that creates the word clouds I use to display my listening history.

This year, I decided to create word clouds for my most listened to albums and tracks as well, which meant I had to normalize that data too. Focusing on just the top albums and tracks for 2018 helped keep the project moving so I wouldn’t get too bogged down. I created columns of concatenated data for Album/Artist and Track/Artist, which I used to fill the tables I used to normalize the data. As with the Artists, I created lookup tables and used VLOOKUP to pull the normalized data, though I did move the normalized lists to a separate spreadsheet to keep the size of the main spreadsheet down.

 

I did some manual scrubbing with the final lists to keep the process moving along, but I’ll eventually go back and do the work to create the Album and Track graphics for 2007 – 2017. Eventually I’ll add genre information, though that data already lives in JRiver Media Center, my media player. It should be a reasonably easy task to complete using another table and VLOOKUP.

This is a fun project, and I continue to learn new functions and tricks as I try to simplify the process and make it more efficient. If I can just figure out how to sort the pivot table by the data counts…

The data across the three graphics is as expected. Though one interesting take away is that intense listening of a single album, Jackie Venson’s “The Love Suite,” for example, can ensure that artist is over-represented on the Tracks graphic. The Three Sounds had 602 total scrobbles across 27 albums, to Ms. Venson’s 188 from 3, so she has 8 of my top tracks versus just 4 for the Three Sounds’. And, yes, it’s true. I do love Pat Metheny!

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *