Geospatial (location based) Searches in MongoDB – Part 1 – data acqusition and loading

March 29, 2012

This is the first post of a multi-part series on performing geospatial (location based) searches on large data sets in MongoDB.

In this part, we will focus on getting large sets of geo-data into MongoDB using the Mongo data drivers. This will be code focused approach; the programming language is Java (Mongo supports a wide variety of programming languages). I will be using the Mongo Java driver parse the geo data, format the data, and insert it into Mongo. You will need to download the Java driver (jar file) from mongodb.org.

First, we will need the some geo-data. One of the best sources for ‘reasonably’ well formatted and consistent data is at GeoNames.org  Here you can download, files from individual countries or the allCountries.zip file containing over 7 Million (~200 MB). To start working I recommend that you download a small, country-specific zip file as it will be much easier to work with (I used um.zip, it contains 230 records). Once you have things working, you can down larger countries or the allCountries files.

The follow code segments perform for basic steps: getting a connection to the Mongo db, getting the collection object (used to store the data in the db), reading the data from the country specific file, and writing the data to the database.

(1)Get a connection to Mongo database

System.out.println(“mongo”);
Mongo mongo = new Mongo(“localhost”, 27017);
System.out.println(“getting db geo1”);
DB db = mongo.getDB(“geo1”);

(2) Create your collection and data store object

// get a single collection
System.out.println(“collection UM”);
DBCollection collectionUM = db.getCollection(“um”);
DBObject dbObject = null;
String jsonObj = “”;

(3) Read the data

There is nothing really fancy here. The data is in a text file, one record per line. Just read the line, tokenize the string, and extract the data. The only tricky part is that the data is not consistently delimited so that you have to look for and find the lat/lng data fields. I use a reg expression to find the floating point data fields (token.matches(“-?\\d+(.\\d+)?”), they are the only floating point fields in the record. To keep things simple I only retained four pieces of data: the geonameID, the location information (the text info between the geonameID and the latitude data field), the latitude data, and the longitude data.
Note: As per good programming practices you do need to check the lat/lng data to insure that is is a floating point number between +/- 180. Also, make sure that you do not lose precision of the data. This should not be a problem in Java, but this sort of thing can be be a bit of a headache in PHP.

(4)_Writing the data to Mongo

We will write the data to the database using the DBCollection.insert() method. In set (2) you created collections object that uses the collection that you will write your documents to. We will us that method to write a JSON object to the collection.
Writing the data is fairly straight forward, the only tricky part is properly formatting the JSON document to include a location array that can be indexed and used in a geospatial search.  The ‘loc’ field is an array.  It stores the lat and long data that you will index and use to perform the location based searches (will be described in part 2 of this series)
The format for the ‘json’ string is:
-> jsonObj = “{geonameID:” + geonameID + “,geoInfo:” + geoInfo + “,loc: [ ” + lat + “, ” + lng + “] }” ;
Remember, the lat and lng fields must be placed inside an array element (the name loc is arbitrary).
The ‘json’ string is loaded into a DBObject:
-> dbObject = (DBObject)JSON.parse(jsonObj);
And the dbObject is written to the database:
-> collectionUM.insert(dbObject);

Using this approach, you can write 100s or millions of records into the data store.

In the next part of this series, I will cover how to perform the geospatial (location based) radial and polygon searches of the geo-coded documents.

Advertisements