A music metadata lookup pattern: "lookup chaining"

Any software, hardware or website that needs to show music metadata needs access to a music metadata lookup service. There are free data sources for such information but they are not as large as the commercial databases. By combining the databases it is possible to improve the coverage and extract even more metadata; in this blog post I introduce a technique for this called "lookup chaining".

There are many types of developers who need to access music metadata. Writers of music software, whether they are music players or music organisers are one such type. There are also hardware music players (software underlies these of course) that need access to the same metadata to display information about the music being played. There are also other types of developers, such as website writers who want to include extra structured or unstructured information about music on their website.

Data sources

So the question is, where to get this metadata from? There are many online music databases that provide access to such information. The larger ones, such as AllMusic and Gracenote, charge large fees for access to their data, and so technology startups of more modest means tend to be forced toward the free large databases: MusicBrainz, Discogs and the like.

The good news is that these free databases are already large in size and getting larger. They have enthusiastic communities dedicated to growing the database, evolving their schemas and supporting their users. Discogs on its own now boasts over four million releases on its database.

On the downside, any one of the free databases is not as large as the commercial databases. Discogs is the largest, but this is half the size of Gracenote. There is a partial solution to this problem implied by the fact that the data covered by the free databases are not subsets of each other. In other words, some data featured in one database are not included on another. This means we can improve the coverage of the free databases by getting them to work together.

Querying problems

There's a further problem to add to the fact that the underlying free databases are not as heavily populated; because of the nature of these free databases, actually querying them can be challenging.

This is the case where there is ambiguity in the data passed as the query. An example is an attempt to query for metadata for a release using the title of the release. Classical releases are a fertile ground for this type of problem because the release titles have questionable syntaxes and the names of the artists that should be associated with the releases are often non-obvious.

Compare the same release on MusicBrainz and Discogs. A title query may find data on one of these databases, but that result may not contain all the data you were seeking (for example, at the time of writing, there's no barcode data on the Discogs version).

This is just an example of a general problem; data accessible on one database may not contain all the information you need, despite it being present (but not accessible with your query methods) on another database.

A more cut-and-dried example of this is FreeDB. This is an enormous database and most officially released CDs can be recognised by it. A FreeDB lookup for a given CD is likely to return some information. But it won't return cover art, detailed information about the performers, the country of release and more. That type of data may be in (amongst others) Discogs, but with no way of querying Discogs using a FreeDB disc ID there's no way of getting it.

After writing and maintaining a lot of software which queries these databases I hit on a pattern to improve lookups and return more data. I call it lookup chaining.

Lookup chaining

The basic idea of lookup chaining is that the results you glean from querying one datasource should be used to query the others.

Here's an example. I cited two entries for the same release of Philip Glass's Heros Symphony above. Imagine I was querying with only release title names, and the release title I had on record was "Heroes / The Light". This is much more likely to match the MusicBrainz version of the metadata, due to the similarity of the title in the database. So first up, we have the data from MusicBrainz.

That's great, but there's no genre information in the MusicBrainz entry. So, I'll extract data from the MusicBrainz entry to use in a query on Discogs. The data I have is:

The release and artist names
Track titles
A barcode
A catalogue number

Any of these can be used to requery Discogs and find the data I need. In this case, the catalogue number is probably the easiest way to go. The catalogue number is on both entries and so it is easy to correlate the two releases. It may seem curious to use the release/artist names in a new query to Discogs given a title query didn't work the first time, but sometimes small differences in a title that does match the first query can "bridge the gap" to the title in the second database.

Lookup chaining in OneMusicAPI

You won't be surprised to hear lookup chaining is used in OneMusicAPI to increase the number of matches your queries return. It's especially useful in MusicBrainz disc ID and FreeDB queries.

I think it will remain a useful pattern with the advent of the Aggregated Database. You might think that the fact I am only querying OneMusicAPI's own database means there's no need for chainining; on the contrary, initial releases of ADB will contain source databases' data pretty much verbatim and so some requerying of ADB will probably still be required.

Thanks to Desmond Kavanagh who made the the image above available for sharing.