Importing data into CouchDB – Java, Ruby and Erlang way
Working recently with CouchDB I become interested on how to import existing data into the database quickly. The CouchDB is written in Erlang so the first thing which came to my mind was to use that language for the job, but to find out if it runs ‘quickly’ or not, I had to compare it with some other languages/libraries. The natural choice was Java of course ( I consider it my ‘native’ language 🙂 ), on top of that I added another test with Ruby as I had my first steps with CouchDB using that language (that’s because of the ‘Seven Databases in Seven Weeks‘ book).
So, what I’m trying to do: Import the large xml file into the CouchDB as quickly as possible. I have used Jamendo data dump as in my previous tutorial about parsing the XML with Woodstox library.
As the Jamendo data dump XML file is quite large I’m using stream parsing in all three approaches (Java, Ruby and Erlang)
Java
This part was the easiest, I just had to combine my previous tutorial ‘Parsing large XML files with Woodstox library‘ with the Ektorp library I used in another tutorial ‘Quering CouchDB for scalar values with Map/Reduce View and Ektorp Library‘
The source for Java importer is available at github on ‘couchdb’ branch. This is the same code for the XML parsing tutorial, I have added dependency to Ektorp library in the pom.xml and changed the lines for handling prepared Artist record:
try {
HttpClient httpClient = new StdHttpClient.Builder()
.url("http://localhost:5984")
.build();
dbInstance = new StdCouchDbInstance(httpClient);
db = new StdCouchDbConnector("java_music", dbInstance);
db.createDatabaseIfNotExists();
} catch (MalformedURLException e) {
e.printStackTrace();
}
//For each Artist parsed execute 'processArtist' method
private void processArtist(Artist artist) {
db.create(artist);
count++;
}
Ruby
This script is a modified version of the one available within the ‘Seven Databases..’ book. I have added some code to measure the execution time and number of records inserted, nothing fancy.
require 'libxml'
require 'couchrest'
include LibXML
class JamendoCallbacks
include XML::SaxParser::Callbacks
def initialize()
@db = CouchRest.database("http://localhost:5984/ruby_music")
@startTime = Time.now
@count = 0
@max = nil
@stack = []
@artist = nil
@album = nil
@track = nil
@tag = nil
@buffer = nil
end
def on_start_element(element, attributes)
case element
when 'artist'
@artist = { :albums => []}
@stack.push @artist
when 'album'
@album = { :tracks => []}
@artist[:albums].push @album
@stack.push @album
when 'track'
@track = { :tags => []}
@album[:tracks].push @track
@stack.push @track
when 'tag'
@tag = {}
@track[:tags].push @tag
@stack.push @tag
when 'Artists', 'Albums', 'Tracks', 'Tags'
#ignore
else
@buffer = []
end
end
def on_characters(chars)
@buffer << chars unless @buffer.nil?
end
def on_end_element(element)
case element
when 'artist'
@stack.pop
@artist['_id'] = @artist['id']
@artist[:random] = rand
@db.save_doc(@artist, false, true)
@count += 1
if !@max.nil? && @count >= @max
on_end_document
end
when 'album', 'track', 'tag'
top = @stack.pop
top[:random] = rand
when 'Artists', 'Albums', 'Tracks', 'Tags'
#ignore
else
if @stack[-1] && @buffer
@stack[-1][element] = @buffer.join.force_encoding('utf-8')
@buffer = nil
end
end
end
def on_end_document()
puts "Time elapsed #{Time.now - @startTime} seconds"
puts "TOTAL #{@count} records inserted!"
exit(1)
end
end
class String
def force_encoding(enc)
self
end
end
parser = XML::SaxParser.io(ARGF)
parser.callbacks = JamendoCallbacks.new
parser.parse
The script is self explanatory (the most readable script from the three IMHO)
Erlang
The hardest part was to import the data using the native CouchDB language – Erlang, as I’m not an erlang programmer I had to learn a lot to create such a script. I’m kind of proud of it but its probably the most terrible piece of code I have ever written :). Inserting the records itself is dead easy, especially with the Hovercraft erlang library for interacting with the CouchDB, but parsing the XML with streaming was a nightmare, I don’t blame erlang here, just my erlang programming skills, anyway to create this script I had to learn a lot and it was fun, so most probably I won’t finish my erlang adventure here.
The Erlang script uses Erlsom library to parse the XML content using its sax method:
start() -> StartTime = now(),
run("xmlimporter/dbdump_artistalbumtrack.0.290905586176.xml",0),
Duration = timer:now_diff(now(), StartTime) / 1000000,
io:format("Finished in: ~p~n",[Duration]),
ok.
%% Read the file and execute the callback fun for each tag (start/end and character content)
run(File, Result) ->
delete_db(<<"erlang_music">>),
{ok, created} = create_db(<<"erlang_music">>),
case file:read_file(xml(File)) of
{ok, Bin} ->
{ok,_,_} = erlsom:parse_sax(Bin, [], fun callback/2);
Error ->
Error
end,
Result.
Example XML file looks like this (showing just a version with single artist record for used for development purposes and debugging:
<JamendoData epoch="1345766577" documentation="http://developer.jamendo.com/en/wiki/DatabaseDumps"
type="artistalbumtrack">
<Artists>
<artist>
<id>370255</id>
<name>""ATTIC""</name>
<url>http://www.jamendo.com/artist/ATTIC_(3)</url>
<mbgid></mbgid>
<Albums>
<album>
<id>94168</id>
<name>UP IN THE ATTIC !</name>
<url>http://www.jamendo.com/album/94168</url>
<releasedate>2011-07-05T01:05:52+01:00</releasedate>
<filename>ATTIC - UP IN THE ATTIC !</filename>
<mbgid></mbgid>
<license_artwork>http://creativecommons.org/licenses/by-nc-sa/3.0/</license_artwork>
<Tracks>
<track>
<id>804908</id>
<name>No No No</name>
<duration>213.0</duration>
<numalbum>1</numalbum>
<filename>01 - ATTIC - No No No</filename>
<Tags>
<tag>
<idstr>rock</idstr>
<weight>0.125</weight>
</tag>
</Tags>
<mbgid></mbgid>
<license>http://creativecommons.org/licenses/by-nc-sa/3.0/</license>
</track>
<track>
<id>804909</id>
<name>i'm a dog!!!</name>
<duration>229.0</duration>
<numalbum>2</numalbum>
<filename>02 - ATTIC - i'm a dog!!!</filename>
<Tags>
<tag>
<idstr>rock</idstr>
<weight>0.125</weight>
</tag>
</Tags>
<mbgid></mbgid>
<license>http://creativecommons.org/licenses/by-nc-sa/3.0/</license>
</track>
</Tracks>
</album>
</Albums>
</artist>
</Artists>
</JamendoData>
On each element (start tag, characters data, end tag) the callback method is called, my callback method takes two arguments, one with the element encountered and the other one playing a role of accumulator object which will hold our Artist structure as we go through the xml. Each parsed element is then converted to appropriate erlang type (list, tuple or binary string) and inserted into ‘Acc’ accumulator.
The accumulator structure looks like the following:
where:
- Path – location within the Artist element of the element we currently parse. The ‘Path’ property is changed dynamically depending on which tag we are currently parsing, e.g.: lets say we are at first album of the artist, parsing the property ‘filename’, our ‘Path’ would look like the following: [“filename”,”album”,”artist”]
- Album Number – As a single Artist can have many albums we need to track the currently parsed album number to modify its data while parsing this album properties
- Track Number – As a single Album can have many Tracks we track the currently parsed track number
- Tag Number – Single track can have zero or more tags, same as above
- Artist – the structure holding data for a single artist
For example when we find an ‘album’ start element we prepare and insert the structure for holding other album attributes in the form of a tuple: ‘{[]}’
{Key,Albums} = getByKeyFromArtist(Artist,<<"Albums">>),
{Art} = Artist,
processStartTag(<<"album">>,[Path,AlbumNumber+1,TrackNumber,TagNumber,{lists:keyreplace(Key,1,Art,{Key,Albums++[{[{<<"album">>,{[]}}]}]})}]);
The whole parser can be described with the following steps:
- (Start Tag) On start element prepare the appropriate structure inside ‘Artist’ within our accumulator ‘Acc’
- If the start element is ‘album’, ‘track’ or ‘tag’ increment appropriate counter
- Update the ‘Path’ structure to point to the currently parsed element
- (Character Data) When character data is encountered while parsing the doc, find the currently parsed element using the ‘Path’ attribute and AlbumNumber/TrackNumber/TagNumber attribute and set the character data value to this element
- (End Tag) When closing tag is encountered, for ‘Albums’,’Tracks’ and ‘Tags’ reset the counter, for closing ‘Artist’ tag – create the CouchDB record
After the Artist record from the XML is parsed as a whole we end up with the following erlang structure
{<<"name">>,<<"Münich Payloader">>},
{<<"url">>,<<"http://www.jamendo.com/artist/Munich_Payloader">>},
{<<"location">>,
{[{<<"country">>,<<"USA">>},
{<<"state">>,<<"FL">>},
{<<"city">>,<<"Lake Helen">>},
{<<"latitude">>,<<"28.9846">>},
{<<"longitude">>,<<"-81.2401">>}]}},
{<<"Albums">>,
[{[{<<"album">>,
{[{<<"id">>,<<"46282">>},
{<<"name">>,<<"Super Sampler">>},
{<<"url">>,<<"http://www.jamendo.com/album/46282">>},
{<<"releasedate">>,<<"2009-05-28T12:18:05+01:00">>},
{<<"filename">>,<<"Münich Payloader - Super Sampler">>},
{<<"license_artwork">>,
<<"http://creativecommons.org/licenses/by-nc-sa/3.0/">>},
{<<"Tracks">>,
[{[{<<"track">>,
{[{<<"id">>,<<"354333">>},
{<<"name">>,<<"music box">>},
{<<"duration">>,<<"102.0">>},
{<<"numalbum">>,<<"1">>},
{<<"filename">>,
<<"01 - Munich Payloader - music box">>},
{<<"Tags">>,
[{[{<<"tag">>,
{[{<<"idstr">>,<<"electronic">>},
{<<"weight">>,<<"0.1375">>}]}}]}]},
{<<"license">>,
<<"http://creativecommons.org/licenses/by-nc-sa/3.0/">>}]}}]},
{[{<<"track">>,
{[{<<"id">>,<<"354335">>},
{<<"name">>,<<"mechanical_world">>},
{<<"duration">>,<<"156.0">>},
{<<"numalbum">>,<<"2">>},
{<<"filename">>,
<<"02 - Munich Payloader - mechanical_world">>},
{<<"Tags">>,
[{[{<<"tag">>,
{[{<<"idstr">>,<<"electronic">>},
{<<"weight">>,<<"0.1375">>}]}}]}]},
{<<"license">>,
<<"http://creativecommons.org/licenses/by-nc-sa/3.0/">>}]}}]},
{[{<<"track">>,
{[{<<"id">>,<<"354334">>},
{<<"name">>,<<"Iced Tea and Family">>},
{<<"duration">>,<<"104.0">>},
{<<"numalbum">>,<<"3">>},
{<<"filename">>,
<<"03 - Munich Payloader - Iced Tea and Family">>},
{<<"Tags">>,
[{[{<<"tag">>,
{[{<<"idstr">>,<<"electronic">>},
{<<"weight">>,<<"0.1375">>}]}}]}]},
{<<"license">>,
<<"http://creativecommons.org/licenses/by-nc-sa/3.0/">>}]}}]}]}]}}]}]}]}
Saving the records
Once we have the erlang structure ready to be saved into our CouchDB instance things get much easier. For this purpose I have used some code snippets from the ‘Hovercraft‘ library which looks like its not developed since 2010 but the code is easy to follow and I was able to reuse some functionality I needed.
An easy direct Erlang CouchDB library.
Use this to abstract CouchDB behind a simple Erlang function call. Currently supports the database and document APIs, with views on the way.
The ‘lightning’ test works with the latest CouchDB version and thats more than enough for my needs, if it wouldn’t work, it was a good starting point to play with the CouchDB from Erlang anyway :).
Saving our erlang structure is as easy as a single method call:
%% Closing tags and resetting counters..
%
processEndTag(<<"artist">>,Acc) ->
[_,_,_,_,ArtistData] = Acc,
save_doc(<<"erlang_music">>,ArtistData),
[];
Whole part for interacting with the CouchDB from erlang looks like the following: (taken mostly from Hovercraft sources):
couch_db:open(DbName, [?ADMIN_USER_CTX]).
create_db(DbName) ->
create_db(DbName, []).
create_db(DbName, Options) ->
case couch_server:create(DbName, Options) of
{ok, Db} ->
couch_db:close(Db),
{ok, created};
Error ->
{error, Error}
end.
delete_db(DbName) ->
delete_db(DbName, [?ADMIN_USER_CTX]).
delete_db(DbName, Options) ->
case couch_server:delete(DbName, Options) of
ok ->
{ok, deleted};
Error ->
{error, Error}
end.
save_doc(#db{}=Db, Doc) ->
CouchDoc = ejson_to_couch_doc(Doc),
{ok, Rev} = couch_db:update_doc(Db, CouchDoc, []),
{ok, {[{id, CouchDoc#doc.id}, {rev, couch_doc:rev_to_str(Rev)}]}};
save_doc(DbName, Docs) ->
{ok, Db} = open_db(DbName),
save_doc(Db, Docs).
ejson_to_couch_doc({DocProps}) ->
Doc = case proplists:get_value(<<"_id">>, DocProps) of
undefined ->
DocId = couch_uuids:new(),
{[{<<"_id">>, DocId}|DocProps]};
_DocId ->
{DocProps}
end,
couch_doc:from_json_obj(Doc).
Full source code for the erlang script as well as ruby is available from GitHub, for Java version you can get it from ‘woodstoxex‘ example under the ‘couchdb’ branch.
The Test
Assuming you have downloaded the test data from Jamendo website, and modified the sources appropriately you do the following:
Java – this is really straightforward, just execute the main method from the ‘Main’ class to get the job done.
Ruby – run the script feeding it with the xml stream using cat command:
Erlang – for this to work you would have to download and install the CouchDB from source. CouchDB is available from GitHub repository. Follow the instructions there on how to install it, don’t forget to run ‘./bootstrap’ beforehand (not mentioned within the CouchDB installation instruction).
Our erlang script will be installed and used the same way a Hovercraft library was used, inside the CouchDB trunk, create a folder called ‘xmlimporter’ and place our erlang script in there (modifying the xml file location earlier).
Once you have a CouchDB build you can start it up in interactive mode with the following command:
Hit Enter when the CouchDB finishes launching and you should be inside the erlang console.
To start our script execute the following command:
where (1>) is the erlang shell prompt.
The Results
Putting the database inserts on the side, it has to be highlighted that every script described here takes a different amount of time to parase xml (without doing anything usefull with the records themselves). To compare the results between those 3 scripts I have to take that into account and first find out the time of parsing and then the time of parsing and inserting the records.
The results of parsing and inserting for each language/library of choice used is as follows:
Final Notes
Process of importing data could be done quicker if we would insert that data in batches instead of each record one by one. This would eat a bit more memory but will increase the speed significantly I believe. Another thing to note is that the scripts described here are by no means perfect (especially the erlang one) and could be improved for speed and efficiency. It was obvious from the beginning that the Java and Ruby versions would be most probably slower than the erlang version (no matter how bad the code is) as they are communicating with the database through the HTTP protocol where the Erlang script is using the CouchDB code directly.
Surprisingly the results differs just a bit. Where parsing in Java using Woodstox was the fastest one it had the slowest insertion times, probably because of the Ektorp performance. Ruby and Erlang parsing times look similar but Erlang was quicker while inserting the data as it didn’t had to go through the loopback interface.