Improving Acoustid accuracy

Acoustid is a great audio fingerprinting service. Open, gratis for non commercial users and with an enormous database of audio fingerprints, provided by a large userbase. This makes it easy for app developers to get up and running, coding against the Acoustid web service and resulting in a useful means by which music can be identified.

Inaccurate data does exist though, so read on to see how to use the Acoustid API to filter out errant entries....

Why use Acoustid in the first place? Back in the dark days of musical metadata, you could find information about musical releases by a limited number of means. You could query using the titles of releases, the artist names, maybe the CD TOC if you happened to have that information. Such information were used as "signposts" to identify music in databases, so more information associated with said music could then be gleaned.

There were a number of problems with these approaches. Chiefly, they relied on the accuracy of the existing titles, not to mention their precense. If you had a bunch of untagged or unnamed files, you were stuck.

In the early 00s a new way of identifying music began to emerge: audio fingerprinting. In audio fingerprinting, a sample of the actual audio is taken and a "fingerprint" is generated to represent that audio. A number of solutions arose for this. This meant the existing tags or filenames applied to music files and streams became less important. The trouble was that most fingerprinting algorithms and services were proprietary and cost a lot of money to implement, or had other restrictions on their use. This made it difficult for app developers to use fingerprinting (unless you had a large budget).

Eventually, though, the open source movement got their teeth into the problem, and a number of open source audio fingerprinting solutions have since been introduced. Probably the most open of these, and the most successful, is Acoustid. The source code to generate a fingerprint is freely available, and services like OneMusicAPI, as well as Acoustid.org itself, accept the fingerprints and return associated metadata.

And the catch is...

Acoustid is a crowd sourced fingerprint database. That gives one enormous advantage: fingerprints are submitted quickly from all around the world. The common disadvantage with crowd sourced databases rears its head though; sometimes errors creep in where fingerprints are assigning incorrect metadata.

Here's an example. A query for Gimme Shelter, by The Rolling Stones:

http://api.acoustid.org/v2/lookup?meta=releases tracks&batch=1&duration.1=270&fingerprint.1=AQABz_kTldCRH00[...]

(Various unrelated parameters and the fingerprint itself are truncated for readability).

As you would expect for such a seminal recording, a large amount of data is returned. Hidden amongst this data, however, are some errant entries, ready to trip you up! Just take a look at the user-facing summary for this fingerprint: http://acoustid.org/track/bda45d4a-ae0c-43f8-a54b-77e62cab42ce (note: the incorrect entries have now been removed!).

If you receive inaccurate data, your app will not work properly and your users get dissatisfied. So how can we deal with that? One way would be to develop some algorithm that builds some form of quoracy into the returned data - only data which agree on "Gimme Shelter" are preferred, if an overwhelming amount of "votes" are made for that recording.

The trouble is the complexity of developing such a solution. Luckily there's an easy to use alternative baked into Acoustid - source counts!

Using source counts to identify incorrect Acoustid assignments

What better way to solve inaccuracy on a crowd sourced database than using metadata about the crowd to identify possibly incorrect submissions? Acoustid keep a record of the number of people who have submitted a given fingerprint. We can use this as an indicator of correctness.

Ok, so this isn't a catch all solution. It only works with popular recordings like Gimme Shelter where enough submissions have been made to give some sort of quoracy. But still, this is useful.

Change the query to:

http://api.acoustid.org/v2/lookup?meta=sources recordings releases tracks&batch=1&duration.1=270&fingerprint.1=AQABz_kTldCRH00[...]

The addition of sources and recordings changes the structure of the returned JSON. The results array now consists of objects containing recordings which themselves contain objects including the sources and references to metadata related to the recording in question.

These "recordings" relate to MusicBrainz's definition of a recording. Thus, there will be different recordings for different entries, including distinct recording objects for the incorrectly assigned recordings. This means each distinct entry has a valid sources attribute, and you can thus interrogate this attribute to decide which recordings may be errant.

It's down to you what you do with the data. In OneMusicAPI, we take the title of the recording, and work out a distance between each title for each recording. We look for the closest clustered title, total up the source count, then remove recordings that don't have the same number of sources within an order of magnitude, or thereabouts.

That appears to get rid of the inaccuracies!

Thanks to luckyfish who made the the image above available for sharing.