Five ways to improve your music metadata queries

Programmers of music apps, or any products that need to access information about music, need to look up music metadata somehow. Normally this is achieved by calling an online database via its API. The programmer develops code to send a message to the API asking for music information by sending some identification of the entity (e.g. to find out information about a release, the release name is sent).

The quality of metadata lookups can be measured in two chief ways. The first is coverage, i.e. how likely a given query is to discover information. The second is accuracy; how likely the returned data is related to the information sought. Achieving high coverage and accuracy tends to depend on the input data used to perform the query. Some data types are less ambiguous than others, and will provide better results. Other types are more ambiguous and require more work to make the queries accurate.

I've been writing music software to discover music metadata for several years now, and I've learned a lot of tricks. Here are five ways to make your metadata discovery both more accurate and more likely to find matches.

1. Try track based searches

A lot of programmers start looking for information about a given musical release by querying using the title and artist of the release. Most release titles are fairly unique and those that aren't (Greatest Hits) are disambiguated by the artist name.

Where this falls down is that every album an artist produces gets multiple worldwide releases, plus subsequent re-releases for successful albums (think digital remasters etc). These different releases have different metadata, and so the results for a query with a simple title/artist name pair may derive results for a release you weren't expecting. For example, the wrong year when a release has been re-issued.

A better way is often to use track names to query the API, if you have those names and the API supports it. Using track names at least helps to remove releases from different countries that contain additional tracks, and the same with re-releases. In addition, for complicated or long release titles which may include unexpected artifacts, track based searching can be the only way of finding a particular release.

2. Remove disc number and year artifacts

When querying releases by title, sometimes "media artifacts" such as an identification of the CD within the release ("CD 1", "Disc II" etc) may be included. In most cases these "disc number artifacts" are of little use for querying online databases. It is best to remove these strings before querying. I've a regex for this; perhaps I'll write that up for a future blog post.

Another common "artifact" polluting release titles are years, sometimes within parentheses. Remove them too! But remember such extra information may be useful for refining the query or filtering results.

3. Split strings around delimiters

Long, complicated release titles are common with classical releases. The longer the release title the more scope for variance there is in the title and between databases. You may see titles for the same release include performer and conductor names, sub-titles and movement titles may be included, and more.

In many cases the different parts to these titles are separated by a delimiter. I often see "-" or "/" used. Here's one I've seen (not from my own library I hasten to add):

Shepard, Vonda and Friends - Ally Mcbeal - For Once In My Life

As a human we can see the actual release title is that final part, "For Once In My Life". However, when you are writing code you don't benefit from such thought processes and so you will have to derive an algorithmic approach. If you split by delimiter ("-" in this case) and make separate queries for each of the above you should eventually find your result.

4. Attempt general queries, then filter on client side

One strategy for improving coverage is to not be too strict in the queries you send to online databases. Pass a general query, and then check the results with your own matching algorithms to make sure it is the data you expect. This can be helpful when the release title you are querying is a little inaccurate.

There are several ways of doing this. One is to use a common string difference algorithm such as Levenshtein distance to compare the expected and actual release/artist names, and track names and positions too if possible.

5. Learn the query syntax

Finally, know your API. Different APIs have different syntaxes for querying them. For example, Discogs offers Lucene syntax for queries which is not obvious from their documentation.

By becoming members of the API's developer community you can learn these tips and tricks and help one another achieve better results.

I hope this list of tips has been helpful. If you have more tips, please list them below!

Thanks to Don Moyer who made the the image above available for sharing.