A while ago I wrote an article on the elsten software blog about best practices for searching Discogs. Since then there have been a few updates to the Discogs API, so I thought I'd repost the article with updated information and tips.
First of all, the raison d'etre for OneMusicAPI is, of course, to help you avoid problems caused by API updates. When updates occur in Discogs, or MusicBrainz, or any of the aggregated music APIs that OneMusicAPI calls, we fix it on our servers so you don't have to. Consider using OneMusicAPI for a more maintainable, lower cost music app!
But if you really want to write your own integration to Discogs and want some tips, read on!
I've been writing apps against the Discogs API for four years now. The coverage of the Discogs database is amongst the most impressive of the free online databases. The API itself provides search features to enable you to search music information across this vast database. But blindly passing in these names does not necessarily mean results are returned or the results are accurate. Indeed, the process of tuning searches for music releases against music databases is a constant balancing act between accuracy and quantity. Make the query too general and you return lots of matches which aren't necessarily the correct release. Make the query very specific and the quantity of matches fall. Neither are desirable situations.
I've written up ten tips for improving your Discogs searches. Please note: where I've shown URLs in sample code below I have not URL encoded them to make it easier to read what's going on. You'll have to do the URL encoding yourself. The examples concentrate on release queries, rather than queries for other entities. However, many of the tips and best practices are transferable.
The basic Discogs search query was updated a little while ago to remove the requirement to specify the query in Lucene search syntax. Lucene syntax can still be used (and, indeed, it's a useful tool for power users... see below) but now we have the rather easier to understand use of query strings to specify parameters for your search.
For example, a search for a release:
http://api.discogs.com/database/search?type=release&release_title=untrue&artist=burial
A few points on basic queries before we get into optimisations...
I purposely queried release_title
and not title
. The latter is a legacy parameter
to support old queries that relied on Discogs' definition of a release title being a title in the form
"[artist name] - [release name]". So it's best to query just the actual release name.
As with all HTTP based APIs, you must URL encode the values assigned to the parameters in the query string. This isn't so obvious in the above case, where there are no spaces nor reserved characters.
This query does not look for all releases which exactly match the parameters; some leniency is afforded.
Our first tip. It's worth sanitising the values you send for release_title
and artist
. This
increases the chances of getting a correct match.
For release_title
, check you aren't sending disc number artifacts. By this, I mean
text such as "Disc 1" or "Disk B" or similar. Such titles are common in digital music tagged via
FreeDB which tends to include such artifacts in release names.
For both release_title
and artist
, remove & and and strings
from queries. I've found this slightly improves the chance of a match.
File this one under 'accuracy'. When you're firing queries against Discogs for albums, you may get a lot of results back. Don't blindly accept the first one that's returned, check that the album looks like the one you are asking for. Discogs' database is enormous, and for any one album there can be many, many entries for different releases of the same album. For instance, check out these different releases for Is This It. You'll also notice album art differences between different releases. Any one of these could've been returned by your query, and they could arrive in any order.
There are a few ways of sanity checking the results. One useful way is to check the tracks for a release are the same as what you expect. This appears to be a good way of improving accuracy where you have common album names for compilations (there appear to be many different releases of The Best of Nina Simone, for example, with different art and track listings).
One for release and master queries. It may not be immediately obvious, but it's perfectly possible to query Discogs for a release specifying only the track names. This might be useful where the release name is difficult to get right. Classical releases are a common case of this.
This is an example of where Lucene syntax can be used. We'll see some more examples later, but note the
use of the [field]:"[value]"
syntax within the q
parameter.
http://api.discogs.com/database/search?type=release&q=track:"she said she said" track:"doctor robert"
Using this type of search, Discogs looks inside the track titles and returns all the releases with matching track names.
Under the covers, Discogs uses Lucene, via Solr, to index the Discogs database and provide free text searching. The advantage of this is that search becomes extremely powerful. The downside is that you need to learn Lucene syntax to make use of it.
The use of Lucene isn't explained explicitly in the API docs. Instead, mention is made under the search function to simply insert the query as the 'q' query string parameter. An example is:
http://api.discogs.com/search?type=release&q=thriller
This simply searches for all occurrences of 'thriller' in all of Discogs' enormous database. This brings back labels, artists and a little-known album by a chap called Michael Jackson. What if it's this album you wanted to find?
Well, this 'q' can be any query made using the advanced Discogs search syntax which is actually Lucene search syntax. So, try prefixing the 'thriller' text with the field name 'title':
http://api.discogs.com/database/search?type=release&q=title%3Athriller
This brings back only releases with the word 'thriller' in the title. To make the query more accurate, you need to query extra fields. Generally speaking you can use any of the parameter names listed in the Discogs API documentation as query string parameters:
http://api.discogs.com/database/search?type=release&q=title:thriller AND genre:"Funk / Soul"
Returns all albums with "thriller" in the title which also have "Funk / Soul" in the genre field.
And now we go mad. Lucene supports fuzzy searching with the '~' (tilde) operator.
http://api.discogs.com/database/search?type=release&q=title:thriller~
... gives us albums called "Thrillzz" as well as "Thriller". This can be useful if you do not 100% trust the exact accuracy of the album name or artist name you are using to search Discogs.
And now the warning. The fact that Lucene is not mentioned in the Discogs docs may make it a bad idea to use it. The absence
of this detail from the docs could be interpreted that Lucene is 'unsupported'. Similarly, too, for the use of extra fields in queries other than those explicitly
mentioned. If Discogs change their search server, that may well break your queries. Don't say I didn't warn you. But
then, you should be using OneMusicAPI anyway ;-)
Discogs is so exhaustive it includes bootlegs, promo releases, cassette releases and more. If you know your audience, maybe they are unlikely to have these. In this case, it is possible to remove such releases as so:
http://api.discogs.com/database/search?type=release&q=-format:"promo"+format:"album"+format:"CD"+title:"thriller"
This looks for all releases entitled "thriller" which are CD albums and NOT promos. This, again, is another use of Lucene syntax allowing us to specify negative searches.
Time to increase the number of matches! One I noticed that can be useful is to attempt to try synonyms. I was searching, for example, for "7 Drunken Nights" by The Dubliners. It turns out this is stored in Discogs as "Seven Drunken Nights" , probably correctly. By swapping "7" for "Seven" I got a hit. The point is that sometimes neither your source data, nor Discogs data, can be 100% trusted for such amorphous concepts as 'album titles'. For instance, Sufjan Stevens "Illinoise" album seems to have many different canonical titles.
Of course, results from such a strategy should be sanity tested to make sure they are retaining accuracy.
It's common to experience artifacts within an album title separating what may be titles, subtitles and sometimes artist names. This happens where the provider of the title is not 100% sure on the canonical name of the release, and is quite common in online tagging databases. For instance, I tried to query for one album with the title "After Hours: Northern Soul Masters", where the title had been provided by FreeDB, but this is recorded in Discogs as simply "After Hours".
By choosing common delimiters to split titles around (I chose colons, forward slashes and hyphens) and then querying using the separate parts it can be possible to improve the number of matches. Again, sanity check to make sure the results are accurate.
It turns out that a multiword Lucene query within quote marks, let's say "Ally McBeal", is case sensitive. For instance,
http://api.discogs.com/database/search?type=release&q=title:"ally mcbeal"
Gives zero results, while:
http://api.discogs.com/database/search?type=release&q=title:"ally mcBeal"
Gives the lot. Note that http://api.discogs.com/database/search?type=release&q=title:"Ally mcbeal" also gives no results, which suggests the case sensitivity only applies in the middle of strings, not the start. So, if you see a similar title, try preserving the case in the middle of strings.
And in case you were wondering what an Ally McBeal album was doing in my collection... it's my wife's, right?
The Discogs documentation specifies that
you must provide a User-Agent
header in requests to the API. Even if your app works initially with
a copy and pasted or made-up header, there's no guarantee this will continue to be the case. The Discogs forums
have a long history of developers' apps being blocked because they failed to specify a User-Agent
, or
they just re-used another one.
Make sure you decide on the format of your User-Agent
early, and use it from the start.
The use of the Discogs API is rate limited
and those limits are strictly observed. Make sure you add in a request throttler to request, at most, once
every second. In my own Java code I use the DelayQueue
core API class.
In addition, you can only download 1000 images per day from the Discogs servers.
Once you sign up to the Discogs premium API, as OneMusicAPI is, you can query faster and also download more images.
I hope these nuggets have helped you improve your Discogs searching!