It's been almost a month since I announced I had started work on OneMusicAPI's Aggregated Database (ADB is the rather unoriginal codename used internally... better suggestions welcome!). So, I thought: a good time to update you with work done so far, how the ADB is shaping up, and lessons learned!
The good news is that the first beta of OneMusicAPI using ADB is not far off. We're currently in the "tuning" stage, where we run large scale tests of music metadata searching and assess the coverage (whether any results are returned), the accuracy (whether the correct results are returned) and the quality (for cover art, the quality of the image). This means a first beta should hopefully be appearing in the next week or two.
So how is the ADB shaping up?
The choice to move to a two tier API and database architecture poses a number of options. The API layer will stay as before, but the way in which the database is incorporated, populated and searched dazzles with possibilities.
OneMusicAPI is currently deployed inside Amazon's Elastic Beanstalk PaaS. One option would be to spin up our own EC2 instance with a database, populated with the aggregated data and something like Lucene to provide powerful textual search possibilities. That's kinda how MusicBrainz works, for instance.
There are some obvious downsides to this, and they both impact on my most precious resource: time. The time to get set up is considerable; starting the new EC2 instance, installing a database, configuring Lucene and then inevitably, being a novice in these sysadmin-type tasks, realising I had made a mistake and re-doing the lot. And, of course, on going maintenance of the instance is costly; moving to a new instance when I need more hardware, improving horizontal scalability and more.
So I figured the cheapest solution to get me up and running straight away would be to combine two higher-level Amazon offerings: SimpleDB and CloudSearch. SimpleDB is an online No-SQL database. Given the immediate requirement to only search by tag names I knew I wouldn't require the benefits of a relational database, so I went for a simpler store. It would be CloudSearch that performed the actual textual searches. I use CloudSearch to search for music metadata, then use a correlating datum in the CloudSearch result tuples (a 'foreign key', if you will) to look up the structured data in SimpleDB.
So that's the static, if you will, infrastructure, what about the moveable pieces? How is data moved from source database to ADB, and then served via OneMusicAPI?
Data is taken from source database (so far, just DBPedia/Wikipedia) using an extractor
and then transformed into a common JSON structure. It is this extractor in each source's case that defines
much of the smarts of OneMusicAPI... how to aggregate different data from different databases into one
combined database. In the case of the DBPedia/Wikipedia data, the latest "live" DBPedia dump is downloaded,
filtered to include only "
MusicalWorks", and then queried using a
SPARQL select statement to extract all the required data.
Once an extract is complete, the result is a lot of JSON files. The importer then takes each JSON file and populates both SimpleDB and CloudSearch with the data. Once that's done, we have a queriable aggregated database!
The final step, and to make it all worthwhile, sees a new querier added to OneMusicAPI so that when a client performs a search, ADB is consulted as well as any still-active downstream APIs (remember I am implementing ADB one-source-at-a-time).
Most of the challenges have been around increasing my familiarity with various technologies. I've learnt more about some additional Amazon Web Services products which I didn't know before, including about some newer services which don't have quite the maturity in terms of developer tools as the likes of EC2. For example, SimpleDB doesn't have a Web UI, and so I perform all tasks using either the command line or the REST API.
I've also further increased my understanding of SPARQL, a very interesting and powerful technology.
Another chief challenge is round trip time. Testing tweaks and tuning for this project takes a while because the acceptability of many test results (e.g. cover art accuracy) is subjective and cannot be programmatically determined. Of course, I have unit tests where possible, but large scale tests with representative data sets must still be run. These take time, even though the changes between tests can sometimes be minute.
I'm pleased at the progress made so far, if a little disappointed at the speed of implementation. I had hoped to have a beta running before now. Still, it appears we're on track for that first beta soon. Currently the long running tests show very little difference between OneMusicAPI running against ADB to that when running against DBPedia directly. And hopefully with far less downtime!
If you're interested in being involved in the new beta, get in touch! I offer free access to OneMusicAPI to active beta programme participants.
Once a given extract has been seen to work, I'll remove the downstream API queriers from the OneMusicAPI codebase. This will progressively make OneMusicAPI more reliable, more available and much, much faster.
Next step is a new data source; probably Discogs. A basic extractor has already been written, which needs to be tested and the resulting extract imported into ADB.
Thanks to 竜次 ryuuji who made the the image above available for sharing.