Introduction
At Tarka Labs, we’ve had the privilege of working closely with a diverse range of music clients. From fast scaling distribution platforms to emerging startups. Music distribution involves releasing songs and albums across major DSPs (Digital Service Providers) globally. Once the music goes live, users stream it across different regions, and artists and rights holders are eager to see streaming metrics like listener demographics, location, and engagement trends.
At Tarka Labs, we’ve had the privilege of working closely with a diverse range of music clients. From fast scaling distribution platforms to emerging startups. Music distribution involves releasing songs and albums across major DSPs (Digital Service Providers) globally. Once the music goes live, users stream it across different regions, and artists and rights holders are eager to see streaming metrics like listener demographics, location, and engagement trends.
What Really Happens Behind the Scenes in Royalty Payouts
Let’s see what goes under the hood in royalty payouts. The goal here is simple, making sure the royalties that DSPs pay reach the users in a timely manner. But once you get into the details, it gets a lot more complicated.
An artist might want to split their royalties with collaborators. They might have taken an advance from the distributor that needs to be recovered first. Earnings could have tax components to be deducted. There could be charges for promotions like Spotify Discovery Mode or manual adjustments (both positive and negative) that impact what the user should finally get.
And that’s not all. DSPs usually send massive CSV files containing data like UPCs and ISRCs, and these need to be correctly mapped back to the right users. On top of that, there is added complexity due to different currencies, as DSPs, users, and distributor bank accounts may each operate in separate currencies, which brings currency exchange rates into play.
We’ll walk through step by step how Tarka Labs approached these challenges, and how we built systems that helped pay out millions of dollars to users quickly, accurately, and at scale.
Figuring Out Who Gets Paid — The Hidden Challenge
The first step to solving a scaling problem is understanding it deeply. There’s no point in rushing to implement solutions without first identifying the real bottleneck. In our case, after carefully analysing the process, it was clear that database lookups and Ruby’s single-threaded performance were the major culprits.
Since the process could be easily parallelised, Golang was the obvious choice to rework the pipeline. Its native support for concurrency made it a perfect fit. However, even with Go, the database still posed a problem. Running millions of individual queries, or even trying large IN queries, wasn’t efficient enough. Loading the entire database into memory wasn’t practical either, the memory requirements were simply too high.
The next logical step? Caching.
Why not Redis?
Naturally, Redis came to mind. We tried it, loading all the songs and albums into a Redis cache, refreshed weekly via a cron job. But Redis quickly showed its limitations for this use case:
- The cache size ballooned beyond 10 GB.
- Fetching from Redis still involved a network trip, adding unnecessary latency.
- Plus, using AWS Elasticache at this scale added significant costs.
Clearly, Redis wasn’t the right tool for this particular problem.
PebbleDB as a Cache
We needed something lightweight, fast, and local. A solution that could handle millions of entries without hogging memory or network bandwidth. That’s when we discovered Pebble, an embedded key-value store inspired by LevelDB and RocksDB, designed for high performance.
Pebble turned out to be the perfect fit:
- It compressed the data extremely well bringing 10+ GB of data down to less than 1.5 GB.
- Reads and writes could be easily parallelised.
- Being embedded meant there were no network calls, everything happened locally and lightning-fast.
By combining PebbleDB as our lightweight datastore with Go’s powerful concurrency model, we were able to process records in parallel at scale, dramatically speeding up the mapping process.
Handling Duplicate ISRCs: Overcoming Key-Value Store Limitations
Even after optimising the lookup process with PebbleDB, one tricky problem remained: Duplicate ISRCs.
Both Redis and PebbleDB are simple key-value stores. Our design used UPC and ISRC as keys to retrieve the associated song or album ID. But key-value stores typically allow only a single value per key. PebbleDB, in particular, is optimised for storing raw byte arrays for maximum compression. It doesn’t natively support storing arrays or lists.
To handle cases where the same ISRC pointed to multiple songs (which happens more often than you might think), we customised the insert logic into PebbleDB.
Here’s how we solved it with read-modify-write pattern:
- Before writing a new record, we first performed a GET for the given ISRC.
- If a record already existed, we appended the new song details to the existing entry.
- We then saved the combined result back as a serialised array.
This manual management allowed us to efficiently handle multiple songs mapped to the same ISRC, while still benefiting from PebbleDB’s fast reads, compression, and parallel access.
Anyone who has worked in the music distribution space would immediately recognise that duplicate ISRCs are a major headache. Ideally, an ISRC (International Standard Recording Code) should uniquely identify a recording, but in practice, that’s often not the case.
A few common scenarios where duplicate ISRCs create chaos:
- When a single becomes a hit, artists or labels sometimes re-release the same recording under multiple albums or compilations, but reuse the original ISRC.
- Independent artists or smaller labels may mistakenly assign the same ISRC across different versions or releases, due to poor catalog management.
- Metadata errors during ingestion at DSPs can accidentally duplicate ISRCs across unrelated tracks.
All of these scenarios create massive challenges when trying to match royalty payments accurately, especially at scale. If duplicates aren’t handled correctly, users may be underpaid, overpaid, or assigned the wrong payouts, resulting in serious trust and operational challenges.
Our PebbleDB customization ensured that even in these messy real-world cases, payouts could still be processed accurately and efficiently.
Scaling the Processing: From Hours to Minutes
Since PebbleDB allowed efficient concurrent reads and writes, we further optimised the processing by breaking it down into smaller, manageable batches.
For example, platforms like YouTube were sending us over 250 million records every month. To handle this massive load efficiently, we divided the work into batches of 50,000 records each and processed them in parallel. We ran this entire setup on a modest 4-core, 16 GB RAM machine.
The approach looked like this:
- Load a batch of 50,000 records.
- Parallelise the lookup and matching using Go routines.
- For each batch, collect the matched results and append them into a CSV file.
Why CSV? Because the downstream systems that picked up this data for payout calculations expected a single consolidated CSV input.
The results were dramatic:
- Processing 250 million records now took just around 50 minutes.
- Earlier, the same processing would take 8 to 10 hours — and would get even worse as the dataset grew month over month.
- This new scalable approach not only made the system significantly faster but also made it much more predictable and resilient to future growth.
That’s all for this article. In the next part, we’ll dive into how the downstream system handled reading through the massive CSV file generated by this processing engine and how we optimized that step for even better performance and reliability.
If you’re working on similar large-scale music distribution, royalty processing, or high-volume data handling challenges, feel free to reach out. We’d love to connect and share ideas.