The ambiguities of parsing FreeDB titles

FreeDB is a database of audio CDs, containing the CD's title, overall artist, year of release and details about each individual track. It is queryable in a few ways, most commonly using a CD's TOC (table of contents) which gleans a set of matching CDs with their FreeDB IDs. This ID can then be used to lookup the individual CD information.

There's no artwork in FreeDB, but it is possible to find cover art using a FreeDB ID via OneMusicAPI.

FreeDB's dataset remains the largest of the free databases. For that reason it is used by software and hardware that rip CDs, or simply want a fallback to find more musical metadata. Unfortunately its history means that it has emerged without the same level of data format strictness common in the alternative databases. An example of this is working with title strings, and ambiguities that arise.

FreeDB data format

When querying FreeDB for the data for an individual CD, here's some sample output [abridged]:

...
#
DISCID=1c029503
DTITLE=Sugababes / Freak like me
DYEAR=2003
DGENRE=Pop/Funk
TTITLE0=Freak like me [radio edit]
TTITLE1=Freak like me [we dont give a damn mix]
TTITLE2=Breathe easy
EXTD=
...

FreeDB output is divided into a set of common fields that are repeated for each CD, DISCID, DTITLE, TTITLE0 and others. The data about the disc is prefixed with a 'D', the data about individual tracks is prefixed with a 'T' and suffixed with the track position (zero based).

So we can see there are two types of titles: one for the disc and those for the tracks, DTITLE and TTITLEn respectively.

Titles and slashes

It's when we begin to inspect the contents of title fields that we begin to see how ambiguities can be introduced. On the subject of DTITLE, the FreeDB HOWTO states:

Technically, this may consist of any data, but by convention contains the artist and disc title (in that order) separated by a "/" with a single space on either side to separate it from the text. There may be other "/" characters in the DTITLE, but not with space on both sides, as that character sequence is exclusively reserved as delimiter of artist and disc title!

So in theory, we can parse the artist and release title by splitting the string by a " / " and assigning the values to the artist and the title respectively.

Ambiguities

So much for theory! In practice, as a community maintained database, the dataset is not 100% perfect. In the wild we see several different variations of these formats.

DTITLE=Sugababes/Freak like me

This first one is easy to deal with. The problem is there's no space between the artist and the title. No problem; our parser just needs to be a little more lenient.

TTITLE1=Sugababes / Freak like me/we dont give a damn mix

A slightly different variation, this time in track titles. Again we can add lenience, but because there're spaces between the first slash we know that preceeding that is the artist, succeeding that is the track title Freak like me/we dont give a damn mix.

It's probably obvious where this is heading...

TTITLE1=Sugababes / Freak like me / we dont give a damn mix

What to do here? The fore-slash delimiter is essentially creating three fields and is illegal according to the HOWTO. The trouble is that it happens, albeit rarely, and so we have to deal with it.

With no formal way of automating the decision, this is a case where the question must be raised to the user as to how to assign parts of the title to the artist name or release name. It's a shame, but automation gone wrong is often worse than no automation at all.

Thanks to wise.adam who made the the image above available for sharing.