I Recreated Shazam's Algorithm from Scratch because no one is hiring jnr devs

By Chigozirim

Summary

Topics Covered

Shazam fingerprints songs via peak hashes
Time coherence trumps match quantity
Hamming windows fix spectral leakage
Log bands mimic ear for peak extraction
Anchor-target pairs encode peak relations

Full Transcript

so I decided to recreate shazam's song recognition algorithm out of curiosity okay fine it was mostly out of sheer desperation because let's be real

Landing a junior Dev job these days feels harder than breaking into a bank if you're a developer with less than 3 years of experience you get what I mean

and if you don't well I genuinely Envy you but Shazam has always blown my mind you hold up your phone it listens for a few seconds and then boom it tells you exact what song is playing in the

background magic right well Tech magic and I thought if I can reverse engineer this sorcery maybe hiring managers will finally see that I'm serious about my

engineering career but how do you even start building something like this I mean I was as clueless as anyone at the beginning so I did what any good engineer would do I researched or rather

I Googled I went down a rabbit hole reading every article blog post and research paper I could find about shazam's technology and here's the simplified version of what I learned the

magic of Shazam lies in something called audio fingerprinting it's like creating a unique DNA profile for every song This fingerprint lets Shazam recognize a song

based on a small snippet of audio first the song is converted into something called a spectrogram which captures the frequency content of the audio over time showing which frequencies are present at

each moment and how intense they are from the spectr program Shazam pinpoints frequencies with high intensity the ones representing the notes or beats that

stand out in the song these frequencies known as Peaks are like the song's unique features that make it recognizable Shazam encodes the relationship between these Peaks into

unique hashes and links them to the song's metadata like its title and artist and the exact time each Peak occurred in the spectrogram thousands of

these hashes come together to form the song's fingerprint the fingerprint is then stored in a database where each hash serves as a quick reference point for

searches when you use Shazam the app records a short audio snippet and runs it through the same fingerprinting process this time though the hashes are

not saved to the database instead they're used to query the database for matches the results are then grouped by song ID isolating potential candidates

at first glance it might seem logical to assume the song with the most matches is the correct one but that's not quite the case instead Shazam analyzes the time

coherence of the candidates evaluating how well the snippet timestamps align with those of each candidate the song with the highest time

coherence is ultimately identified as the correct match and voila that's basically how Shazam works pretty straightforward right once I wrap my my

head around the theory it was time to get coding I used goong to code the algorithm from scratch with no libraries because why take the easy route when you can suffer for the sake of learning but

for some of the heavy lifting like handling audio files I had to call in reinforcements if you'd like to check out my implementation I've added the GitHub link to the project in the

description below along with a link to a demo where you can see the algorithm in action the first challenge was transforming raw audio into a recognizable fingerprint to tackle this

I wrote three key functions that the raw audio data has to go through sequentially to generate the fingerprint the first function takes the raw audio

data as input and converts it to a spectrogram this meant transforming the audio from the time domain into the frequency domain so I could visualize its frequency components over time to

achieve this I split the audio into overlapping segments using the sliding window Techni so I could analyze and process smaller parts without missing any key features but this slicing introduced

artificial jumps at the edges of each segment which is a problem also known as spectral leakage it's like trying to cut bread with a Dull Knife it gets messy at the edges to smooth out these jumps I

apply the Hamming window function which tapers off the signal at the edges creating a smooth transition and reducing distortions finally I apply the

fast foro transform fft to each segment the fft is the essential tool that transforms the audio from its time domain into the frequency domain unlocking its frequency components this

process produces the spectrogram a two-dimensional Matrix of complex numbers where rows represent frequencies columns represent time intervals and the

values indicate intensity but here's the thing spectrograms can be massive especially for long audio samples to keep things efficient right before I split the audio and run the f i

downsampled the audio and limited the frequency range to the part where most musical information lies around 20 HZ to 5 khz this reduced processing overhead and improved the signal to noise ratio

making it easier to extract meaningful features later The Next Step was to extract the standout frequencies or Peaks that make the audio uniquely identifiable from the

spectrogram this is where the second function in the process comes into play to extract these Peaks effectively I applied a filter of six logarithmic

frequency bands these bands are designed to mimic the human ears sensitivity to sound human ears struggle to perceive low frequencies less than 500 Hertz but

are naturally more attuned to mid to high frequencies 500 to 2,000 Hertz as a result low frequencies are often Amplified in music during

production so for each segment of the spectrogram I identified the loudest frequencies within each band this this approach ensures a balanced representation across the Spectrum and

prevents low frequency content from dominating next I calculated the average magnitude of these six strongest frequencies and used it as a dynamic threshold frequencies below this

threshold were discarded ensuring that weak frequencies from any band were not retained simply because they were the strongest within their band doing this not only preserves essential Peaks but

also reduces data size optim izing the process for efficient matching down the line with the Peaks extracted from the spectrogram we now have a collection of

data points highlighting the audio's most distinctive features but these Peaks on their own aren't enough for song identification to transform them into something meaningful and searchable

we need to encode the relationships between them in a compact queriable format this brings us to the third function the fingerprinting process here

each Peak called The Anchor is treated as a reference point for every anchor I identify five nearby Peaks or targets within a fixed range forming what's

known as the target Zone this relationship forms the basis of the fingerprint for each anchor Target pair I create a unique hash by encoding the

frequencies of the anchor and Target Peaks along with their time difference into a compact 32-bit integer this hash is small enough to store efficiently but

still packs all the information needed to uniquely identify the audio this data is temporarily stored in a hashmap where the hash serves as the key and the value

is an array containing the Anchor's Tim stamp and the song ID since different songs may share similar features the same hash can be generated for multiple songs when this

happens their details are appended to the array under that hash thousands of these hashes are generated to form a complete fingerprint which is then saved to the

database once the fingerprinting process was up and running the next step was to upload songs and FInd matches that's when I shifted gears and started building the front end now I could have

used plain old vanilla JS but after wrestling with the fingerprinting process for so long I decided I deserved a break so I went with react instead for communication between the front end and

the server I used websocket since the main goal of this project was to implement and explore the algorithm rather than build a commercial app I kept things simple instead of Autos

scanning the world for songs I had users upload the songs they wanted to match to make this process seamless I integrated Spotify allowing users to paste a link

to a song album or playlist since Spotify doesn't allow direct downloads I use their API to fetch relevant data about the song such as the title artist

and album name I then search for the song on YouTube using this data if a match is found I proceed to download the audio from the YouTube video afterward I

assign a unique ID to the song and All relevant data about the song including the YouTube video ID are saved to a song database next the song's audio data goes through the fingerprinting process and

the resulting fingerprints are stored in the fingerprint database once the song uploading process was working okay I was ready to work on the real challenge matching songs from

short audio Snippets this is where the magic happens the process kicks off when a user records a short audio clip either from their microphone or directly from

their computer the snippet is encoded in Bas 64 format and sent to the server for processing on the server side the raw audio is decoded and processed to

generate its fingerprint the hashes in the fingerprint are extracted and used to query the fingerprint database for matches then I organized the retrieved

results mapping each song ID to its Associated anchor times to find the best match I evaluate how well the audio snippet align with

the candidate songs for each candidate I calculate the absolute time difference between successive anchor pairs in both the snippet and the candidate next I calculate the absolute

difference between these two values if this difference Falls within a tolerance of 100 milliseconds the pair is marked as consistent and the candidate's time

coherence score is incremented by [Music] one after calculating the scores I fetch

metadata for the top scoring candidates from the song's database for each match I then create an object containing the song's metadata the earliest anchor time

and the coherence score finally I sort the matches in descending order by their scores and send them to the front

end there I use YouTube's iframe to display the corresponding video for each match starting at the exact time stamp where the match was found

this has definitely been the most challenging project I've worked on so far whether it helps me land a job or not I'm glad I took on the challenge unfortunately I can't do a

proper demo here because copyright laws exist but I've put together a demo video that showcases everything in action you can find the link in the description

along with the GitHub repo if you'd like to check out the code

Loading...

Loading video analysis...