/Everybody stand back/

A regular expression is a textual expression that can be used to match other strings, or portions of strings. Regular expressions ("regexes" as they are also commonly known) are very powerful and are capable of describing and matching strings in multiple ways.

I've written before how you should strip disc number artifacts from release titles before querying. So, given a release title, what's the best way of recognising these disc number artifacts ("disc A", "disk II" etc) and removing them from release titles before querying? Using regexes is one way. Within OneMusicAPI and bliss I have been using and refining a regular expression to strip disc number artifacts for some years.

The base regex I have developed is:

(?i)(( )?(-|by)?(( )|_|\(|\[))+(diskLabelType[_\. ]*(([0-9]+|One|Two|Three|Four|Five|Six|Seven|(IX|IV|V|I{1,3}))(/([0-9]+|One|Two|Three|Four|Five|Six|Seven|(IX|IV|V|I{1,3})))?)\)?\]?)"

So, a brief explanation of this regex...

First, it should be noted that this is a regex developed and used against the standard Java regex API. It may require a little tweaking to get it working in other environments such as POSIX compliant tools, e.g. grep.

The regex makes use of grouping using parentheses. This is useful because it allows sections of the matched strings to be demarcated and used for other purposes.

(?i) states that the regex is case insensitive. This is important because in the world of crowd sourced tagging and music databases, you never know where some weird capitalisation may be used.

( )?(-|by)?(( )|_|\(|\[))+ allows for a preamble before the disc number artifact. If you are just matching the string, looking for any disc number artifacts to extract them and use them elsewhere, this is not necessary. However, if you intend to strip the disc number artifact out of the release title so the "cleaned" title can be used in a query, this is useful. For example, consider the release title "All Things Must Pass (Disc I)". Simply stripping the media type and number would give you "All Things Must Pass ()". The regex portion above removes the spaces and the parentheses.

diskLabelType[_\. ]. Now we're getting to the meat of the regex. diskLabelType should be replaced with any of the media or disc labels that you may encounter. I use:

  • CD
  • Disc
  • Disk
  • Vol
  • Volume

The last two are controversial... in some cases, a volume number may be a legitimate part of the release title.

([0-9]+|One|Two|Three|Four|Five|Six|Seven|(IX|IV|V|I{1,3}). Now we match the number itself, which is either a numeric digit, the textual form of the number (in English) or a Roman numeral. The regex is then optionally repeated following a match for a "/" delimiter, to catch cases where disc number artifacts include a total number, e.g. "Disc 1/3".

So that covers the regex. This regex is used within OneMusicAPI in case you send a query that includes a disc number artifact. That means OneMusicAPI users can send release titles with or without the disc number and results will be obtained. I hope that helps in your music metadata queries!

Thanks to Lasse Havelund who made the the image above available for sharing.
comments powered by Disqus
© 2017 elsten software limited, Unit 4934, PO Box 6945, London, W1A 6US, UK | terms and conditions | privacy policy