We've recently been doing some work on reducing the time it takes to perform an import from MusicBrainz, Discogs and our other sources into ADB.
Before, the import would take a long time. The Discogs import on its own, for example, would take over 24 hours! It was easy to see why: the import process was highly network-bound. Despite optimising the way we write to SimpleDB and CloudSearch, much of the time during the import was waiting for uploads to be made to these services.
A year ago, one of our developers, Vitaliy, began using the Apache Spark framework for new imports; this was used for the artist and track imports. The benefit of Spark for our case was that we were able to parallelise the uploads. This reduced the import time significantly.
We just finished porting the MusicBrainz and Discogs release imports to Spark and the difference is staggering. Our Discogs import now takes a couple of hours!
But parallelising the uploads wasn't the only aspect of the improvement. We profiled the import and discovered another significant performance hog.
The track numbering format used in Discogs data holds a lot of information; as well as the track position of each track, it also holds the media or disc number, potentially the "side" in the case of two-sided media (vinyl etc), sub tracks, media formats (CD, DVD) and the total count of media in the release.
Some of this is explicit, and some of this is implied by the track numbering, which means it's important to analyse the track number in its entirety to extract all this data. The trouble was that we were using regular expressions to do this, and over the 7m release records, the continual regular expression execution added up to a lot. It was specifically the backtracking of one of the expressions that caused the performance problems.
I realised that using a "proper" parser for the track position grammar should glean faster results than trying to pattern match with a regular expression. Having used Scala Parser Combinators before this was my first port-of-call, but I didn't want to use a regular expression based parser again, and I found the lower level parsers a little confusing. At that point I discovered parboiled2 which appeared to be a more straightforward, and also performant, way of defining the grammar.
Cut to the chase, here's the code!
The code can be executed as so:
val parser = new DiscogsTrackPositionParser(track.position) parser.mediumAndTrack.run() match { case Success(ParsedTrackPosition(Some(format), Some(mediumPos), Some(trackPos), Some(subTrack))) => { // Do stuff with the fully populated ParsedTrackPosition } // ... more match clauses }
The code does the basic parsing as defined in the Discogs guidelines... plus it also deals with where data has been entered which actually contravenes those guidelines.
For example, it's not uncommon for track positions to be separated from media positions using periods,
'.'. According to the guidelines, periods should only be used for separating track positions from sub-tracks,
but the reality is that they aren't, always. If you want as much data as possible from Discogs, you need
to be tolerant of this. This is what the isMultiDisc
parameter is for; you need to provide
a hint to the parser so it can accept this. Otherwise, tracks separated by a period are assumed to
be sub-tracks, as per the spec.
The parser handles auto-coupling, although you must tell the parser this is an auto coupled release. Auto coupled releases are where the sides and media numbers are collated in a different order, specifically A, D sides on media (typically LP) number one, B and C sides on the second medium. Therein lies a limitation in this parser - only the first four sides are handled. I found some releases had autocoupled sides, then a bonus LP where the sides are not autocoupled! I didn't quite know how to handle this.
I hope this is of use to someone!
Thanks to mikecogh who made the the image above available for sharing.