Work in progress: an aggregated database

Larry Ellison and the Sun Oracle Database Machine

While OneMusicAPI is, and will continue to be, an aggregated music metadata API which represents multiple aggregated databases, there are multiple ways of achieving this. The current way owes much to OneMusicAPI's ancestory of being part of bliss, but I am now starting work on a more scalable approach which should improve availability and lower query latency.

The 'quo

OneMusicAPI is an aggregated music metadata API. What does aggregated mean in this case? It means that, when you send a search query to OneMusicAPI it, in turn, sends queries to the "downstream" APIs it aggregates, such as MusicBrainz and Discogs. By only quering OneMusicAPI you are isolated from changes in those downstream APIs.

There are problems with this approach, though. These problems are mainly "non-functional". That is, the problems affect the availability and performance of the API.

First, the quality and coverage of OneMusicAPI's results depend on its downstream APIs being available. If DBPedia becomes unavailable (not uncommon, unfortunately), then OneMusicAPI's results will not include results from DBPedia (which is populated with data from Wikipedia).

Second, the speed with which results can be delivered suffers through this approach. OneMusicAPI throttles the number of connections it makes to downstream APIs, both to obey those APIs' rate limit policies and also to avoid exhausting the OneMusicAPI servers of network bandwidth. This means, with more queries being made by more and more users, OneMusicAPI can slow down to an unacceptable rate.

An aggregated database

I want to address these problems, and the solution will be the gradual phasing in of a new "aggregated database". This will be an aggregation of the data found in the downstream APIs, which means we no longer need to perform the actual querying of the downstream APIs on demand. It will be gradual because I will slowly move data from each source in turn, testing and tweaking as I go to ensure quality and coverage. Also, once each data source has been moved into the aggregated database I will be able to turn off the downstream API calls which were causing problems.

What's the architecture for this new approach? OneMusicAPI is currenly deployed as an Elastic Beanstalk application within Amazon's AWS architecture. The early beta versions of the aggregated database will use AWS's SimpleDB to store structured data and AWS's CloudSearch to perform text searches on things like album and artist names. When a query is sent to OneMusicAPI, and depending on the query, it will search a combination of those data sources, returning data found.

This moves OneMusicAPI's downstream integration points from being realised at query time (the downstream APIs are called when a OneMusicAPI client makes a query) to being used at data import time. Thus, data import will be run once a period (say, once a week) to update the OneMusicAPI's database.

The benefits of this approach should mean better isolation from downstream API failure, improving OneMusicAPI's availability, and also lower latencies in terms of delivering query results.

The work on this has already begun, with DBPedia/Wikipedia being the first data source that will be transferred to the "aggregated database". Probably Discogs, then MusicBrainz, will follow.

Thanks to Oracle Corporate Communications who ~~surprisingly~~ made the the image above available for sharing.