While preparing a data extract I recently discovered a set of Discogs data which I've begun calling "special entities". These are artists or labels which aren't actually artists or labels, but are intended to work as a meta-entity for grouping or classification purposes.
Why is it worth calling these out? Well, depending on the way you are using the Discogs database or API, it may cause performance issues if you link to these artists or labels with little benefit; some are huge in the number of entities they link to.
There are a set of data entries filed as "artists" which aren't really musical artists.
Folk and Traditional are both "artists" which don't refer to any one artist, but is intended to be a catch all reference to where musical compositions, arrangements etc have been passed down the generations.
Perhaps most annoyingly, an actual artist called Public Domain has been re-used to designate a lot of public domain recordings and releases. This mixed-use means that it's difficult to separate the genuine music by Public Domain from the other releases.
Finally there are catch-all artists, such as Anonymous and localised variants. Fortunately, No Artist and Unknown Artist are at least placeholders, with no real links in the database to cause problems.
Perhaps the biggest performance issue I encountered when generating Discogs dumps was with Not On Label. This links to an enormous number of releases, all which are, of course, not designated to have been released by a record label.
It goes further than that; a number of popular artists also have special labels named with "Not On Label", for example Not On Label (Depeche Mode) which lists all of Depeche Mode's releases not on a record label.
What this all means is you have to be careful when blindly working with the Discogs data set! Any more of these? Let me know in the comments and I'll update the post.
Thanks to Exile on Ontario St who made the the image above available for sharing.