pythonspotifymusicdataproject

Four Years of Spotify Data I Forgot I Had

For nearly four years, a script has been logging every song I listen to on Spotify into Google Sheets. I forgot about it. Then I needed it.

February 22, 2026/5 min read

Every music streaming service has the same fundamental problem. You don't own anything. Songs get pulled, artists disappear, niche tracks from years ago vanish without warning. If you were deep into underground music at any point, you know exactly what I'm talking about. Half the songs that shaped my taste back then aren't on Spotify anymore.

I've been on a crusade recently to preserve my favorite music. Downloading everything I care about in lossless FLAC, building out my local library on my desktop through my own custom build of Dopamine (forked from digimezzo's original) and on my phone through Sonare. The streaming services can do whatever they want. My files are my files.

The problem was figuring out what to download. Back in 2021 through 2023, I almost exclusively listened to very underground, niche music. I didn't use playlists. Spotify's algorithm just knew me well enough to constantly surface songs I liked, and I'd listen to whatever it threw at me. Doubt the algorithm is that good anymore, really, but the algorithm is the only thing Spotify is good for. So when I decided to go back and grab all those old tracks, I had no clean list to work from. I was slumming through similar song radios trying to hit each song individually, thinking to myself: damn, I really wish I'd saved every song I've ever listened to onto a sheet or something.

Then I remembered. I literally did that.

The Sheets

Nearly four years ago, I set up a script that logs every single song I listen to on Spotify into a Google Sheet. The reason was simple, a friend of mine is a big fan of using Google Sheets for data, and he wanted to see what I was listening to. I set it up, he checked it out, and I completely forgot about the whole thing for years.

The script has been running this entire time. Every time a sheet hits 2,000 rows, it creates a new one. We're on sheet 70 now. That's roughly 140,000 rows of listening data spanning from late 2022 to today.

I found the sheets a couple days ago and immediately realized I'd been doing all this individual sorting work for nothing. The data was already there. Every song, every timestamp, every track ID. All I had to do was parse it.

Parsing the Data

I downloaded all 70 sheets as CSVs. Each row has five columns with no header. Timestamp, track name, artist, Spotify track ID, and URL. I wrote a Python script to parse and deduplicate everything.

The parser loads any existing JSON output, iterates over every CSV, and deduplicates by Spotify track ID so only the first occurrence of each song is kept:

def main():
    tracks, seen_ids = load_existing_tracks()
    csv_files = sorted(CSV_DIR.glob("*.csv"))

    listen_counts = count_listens_per_track(csv_files)

    for path in csv_files:
        for track in parse_csv(path):
            tid = track["track_id"]
            if tid not in seen_ids:
                seen_ids.add(tid)
                tracks.append(track)

    for track in tracks:
        track["listen_count"] = listen_counts.get(track["track_id"], 0)

After deduplication: 5,887 unique tracks from 70 CSVs. Each entry gets a listen_count tallied from every appearance across all sheets. Some songs I've listened to over a thousand times.

The Utility Script

Having the JSON is nice, but I needed to actually use it. I have over 2,000 audio files sitting in my music directory already, and I needed to know which of those 5,887 tracks I still hadn't downloaded. So I wrote a utility script that cross-references the JSON against my local library.

The matching logic was the tricky part. Local files follow an Artist - Track Name.ext naming convention, but track names between Spotify and my files don't always match perfectly. Featuring tags get formatted differently, capitalization varies, special characters get dropped. The script normalizes everything, strips accents, collapses whitespace, and removes (feat. ...) tags before comparing. It matches by track name first and only uses the artist to disambiguate when multiple local files share the same track name:

def is_downloaded(track, track_to_artists, all_track_names):
    nt = normalize(track["track_name"])
    nt_stripped = normalize(strip_feat(track["track_name"]))

    for name in (nt, nt_stripped):
        if name in all_track_names:
            artists_with_track = track_to_artists.get(name, set())
            if len(artists_with_track) <= 1 or na in artists_with_track:
                return True
        # Substring matching for partial name differences
        for local_name, artists in track_to_artists.items():
            if name in local_name or local_name in name:
                if len(artists) <= 1 or na in artists:
                    return True
    return False

The CLI has three commands: missing, downloaded, and stats, with filters for date ranges, listen counts, artist name, and sorting. Some examples:

python spotify_utils.py missing --before 2023 --min-listens 10
python spotify_utils.py downloaded --all --limit 20
python spotify_utils.py stats

The --all flag shows every track with + for downloaded and - for missing, so you can see your full catalog status at a glance.

The Numbers

==================================================
  JSON tracks (after filters):  5887
  Downloaded:                   1974 (33.5%)
  Missing:                      3913 (66.5%)
  Local audio files:            2287
==================================================

  Missing tracks avg listens:   11.2
  Downloaded tracks avg listens: 48.6

  Unique artists (filtered):    1566
  Artists fully downloaded:     215

1,974 out of 5,887 tracks downloaded. About a third of every unique song I've listened to on Spotify in the last four years, sitting on my hard drive.

The Download Process

Once I had my missing list, I started working through it. Monochrome and tidal-ui handled the bulk of the lossless downloads from TIDAL. For songs that had been pulled from streaming entirely, I had to get creative. Some I found on SoundCloud or YouTube. Some I tracked down on random third-party sites. I tried SoulSeek through Nicotine+ for the really obscure stuff, with mixed results. And some songs I just couldn't find anywhere.

It's insane to me that in 2026, media is still able to be lost. A song exists, people listen to it, and then one day it's just gone. No archive, no fallback, no way to get it back unless someone happened to save a copy. That's the whole reason I started building a local library in the first place.

The Accidental Archive

The funniest part of all this is that the most valuable tool in the entire process was something I set up on a whim four years ago for a completely unrelated reason. I wasn't thinking about data preservation or music archiving. I just wanted to let a friend see what I was listening to. The script ran silently in the background for years while I forgot it existed, quietly building the exact dataset I'd end up needing.