Geospatial (location based) Searches in MongoDB – Part 2 – simple searching

April 2, 2012

This is the 2nd post of a multi-part series on performing geospatial (location based) searches on large data sets in MongoDB.

In this part we will focus on using simple queries to perform geo searches on the location tagged data that we loaded into the MongoDB (see part 1 for details).

MongoDB supports two-dimensional geospatial indexes. It is designed with location-based queries in mind, such as “find me the closest N items to my location.” It can also efficiently filter on additional criteria, such as “find the items that are within some distance of the centroid of my search”.

Assuming that the data has been properly loaded the first step it to index the data so that it can be geo searched by mongo. This can be done from Java, but for this exercise we will set it up from the command line:
db.name_of_collection.ensureIndex( { loc : “2d” } )

Once you have created the index, it is best to check and make sure that it was created properly:
db.name_of_collection.getIndexes()
You should see something like:
{ “v” : 1, “key” :
{ “loc” : “2d” },
“ns” : “geo1.um”,
“name” : “loc_”
}
Note: The key part is that “loc” should be “2d”

Now for the ‘fun part’, building and running the geospatial queries in Java. We will start simple and build more complex queries, using more advanced Java concepts (i.e. classes) as we go along. If you are not completely comfortable with building/running queries with the BasicDBObject then please review.

A simple geo search from the command line would look like: db.um.find( { loc : { $near : [15,150] }})
In java it would be:

Double locationLongitude = new Double (15);
Double locationLatitude = new Double (150);
BasicDBObject locQuery = new BasicDBObject();
locQuery.put(“loc”, new BasicDBObject(“$near”, new Double[]{locationLongitude, locationLatitude}));
// run the query
DBCursor locCursor = collectionUM.find( locQuery );
// use cursor to view results

Note: Use Double array in this approach; otherwise is can lead to precision issues.

A more complex (useful) search from the command like could be: db.um.find( { loc : { $near : [15.,-150.11] , $maxDistance : 40 } } ).limit(10)
In Java, using a JSON document:

String sLng = “15.5”; Double dLng = new Double(sLng);
String sLat = “150.11”; Double dLat = new Double(sLat);
String sDistance = “40”;
DBCursor cur = collectionUM.find(new BasicDBObject(“loc”,JSON.parse(“{$near : [ ” + dLng + “,” + dLat + ” ] , $maxDistance : ” + sDistance + “}”))).limit(10);

Note: Even with this relatively simple query, it does not appear that you can easily wrap the geo query parameters in a Java BasicDBObjec object. There have been a number of posting on Stack Overflow and the Mongo Google Groups on this issue. However, I have not yet see an example of a BasicDBObject implementation that does not have to resort to using (parsing) a JSON document. Also, there is some mention of implementing this query with the BasicDBObjectBuilder. I am currently looking to that.

In the next installment I plan to using more advanced Java concepts (i.e. classes) to implement the searches.


Geospatial (location based) Searches in MongoDB – Part 1 – data acqusition and loading

March 29, 2012

This is the first post of a multi-part series on performing geospatial (location based) searches on large data sets in MongoDB.

In this part, we will focus on getting large sets of geo-data into MongoDB using the Mongo data drivers. This will be code focused approach; the programming language is Java (Mongo supports a wide variety of programming languages). I will be using the Mongo Java driver parse the geo data, format the data, and insert it into Mongo. You will need to download the Java driver (jar file) from mongodb.org.

First, we will need the some geo-data. One of the best sources for ‘reasonably’ well formatted and consistent data is at GeoNames.org  Here you can download, files from individual countries or the allCountries.zip file containing over 7 Million (~200 MB). To start working I recommend that you download a small, country-specific zip file as it will be much easier to work with (I used um.zip, it contains 230 records). Once you have things working, you can down larger countries or the allCountries files.

The follow code segments perform for basic steps: getting a connection to the Mongo db, getting the collection object (used to store the data in the db), reading the data from the country specific file, and writing the data to the database.

(1)Get a connection to Mongo database

System.out.println(“mongo”);
Mongo mongo = new Mongo(“localhost”, 27017);
System.out.println(“getting db geo1”);
DB db = mongo.getDB(“geo1”);

(2) Create your collection and data store object

// get a single collection
System.out.println(“collection UM”);
DBCollection collectionUM = db.getCollection(“um”);
DBObject dbObject = null;
String jsonObj = “”;

(3) Read the data

There is nothing really fancy here. The data is in a text file, one record per line. Just read the line, tokenize the string, and extract the data. The only tricky part is that the data is not consistently delimited so that you have to look for and find the lat/lng data fields. I use a reg expression to find the floating point data fields (token.matches(“-?\\d+(.\\d+)?”), they are the only floating point fields in the record. To keep things simple I only retained four pieces of data: the geonameID, the location information (the text info between the geonameID and the latitude data field), the latitude data, and the longitude data.
Note: As per good programming practices you do need to check the lat/lng data to insure that is is a floating point number between +/- 180. Also, make sure that you do not lose precision of the data. This should not be a problem in Java, but this sort of thing can be be a bit of a headache in PHP.

(4)_Writing the data to Mongo

We will write the data to the database using the DBCollection.insert() method. In set (2) you created collections object that uses the collection that you will write your documents to. We will us that method to write a JSON object to the collection.
Writing the data is fairly straight forward, the only tricky part is properly formatting the JSON document to include a location array that can be indexed and used in a geospatial search.  The ‘loc’ field is an array.  It stores the lat and long data that you will index and use to perform the location based searches (will be described in part 2 of this series)
The format for the ‘json’ string is:
-> jsonObj = “{geonameID:” + geonameID + “,geoInfo:” + geoInfo + “,loc: [ ” + lat + “, ” + lng + “] }” ;
Remember, the lat and lng fields must be placed inside an array element (the name loc is arbitrary).
The ‘json’ string is loaded into a DBObject:
-> dbObject = (DBObject)JSON.parse(jsonObj);
And the dbObject is written to the database:
-> collectionUM.insert(dbObject);

Using this approach, you can write 100s or millions of records into the data store.

In the next part of this series, I will cover how to perform the geospatial (location based) radial and polygon searches of the geo-coded documents.


To SQL (relational) or not to SQL (NoSQL) that is the question

March 19, 2012

This document is work in progress. I have been asked to review this issue and I am providing this info in a draft form.
This post focused on MongoDB; you will find these issues related with other NoSQL DBs, but the particulars may be different.

As always, the answer to the question “do we go relational DBMS or NoSql?” is: ‘it depends’. And it depends on a number of issues. I will try to enumerate and address those issues in the posting. I don’t assume that I have covered all issues; you should not either.

Before we begin, I am going to ask a question of the reader, “do you have a data description document?” Something in writing (written down, not just in your head) that describes your data requirements and how you will be using your data. I know that some people are thinking that I want you to have a E-R diagram fully developed, that is not the case. I just want the reader to have some sense what data they will be working with, how is that data related (or not), what they need to do with the data, etc. If you know those things then you will be able to determine what issues are relevant to you and how important are they.

Schema
All data is related. If your data is not related then you may/may not need a DB to store your data.  Generally, in all  non-trivial, large-scale, productions deployments of data stores, there is some piece of data (entity or attribute) that is related to another. Therefore, the idea that NoSQL databases are schema-less may not be the best way of thinking about your data.  I believe that you will want/need to develop some sort of schema – description of the data, how is organized, how is it related and how it will be used.
Note: I do not suggest creating a E-R diagram as it will drive you down the ‘relational’ path. However, you will want that description to define your collection and documents if you decide go down the ‘NoSQL’ path.

Relationships Between Entities
This is a very tricky question, because the answer depends on what data entities you have and how are they related. Again, having a schema makes addressing this issue easier. If you find that you data (read entities) have a large number of relationships, then NoSQL may not be the best solution.
I suggest that you create a very high-level E-R diagram and then take the entities and see if you can ‘easily’ refactor the schema into a MongoDB schema – how efficiently and effectively and can you embed the related entities into objects and arrays inside a BSON document. Also, this ‘embedding’ will be supported by client side linking … more on that later.

Atomicity
While Mongo does support some [built in] atomic operations (e.g. findAndModidy), it currently it does not guarantee atomic operation on a single documents. If you have a schema where a number of entities need to be atomically updated at the same time (read transactions) then Mongo is not right for you

Consistency – Single DB
RBRMS are strongly consistent by design; most allow table and/or row level locking.
MongoDB (current version) is strongly consisted because it has one global lock for read/write operations. There is some talk of have global, collection level locking. Bottom line, if your DB is going to be to a significant number of concurrent read/write operation then Mongo may not be the best solution for you.

WYOW Consistency (single server)
One proposed work around is to leverage MongoDB’s atomic update semantics with an optimistic concurrency solution. This comprises four basic steps: 1. read a document; 2. modify the document (i.e. present it to the user in a web form); 3. validate that the document hasn’t changed; 4. commit or abandon the user’s update.
Note: There are a number of posts regarding read-your-own-write consistency that that would be good to review is this is a large issues for you.

Consistency – Distributed DBMS
For ‘industrial strength’ DBMS this is a solved problem. For example, Oracle has RAC. If you really, really need it then it may be worth the money, but be very sure you need it as it is a very expensive solution.
MongoDB does not offer master-master replication or multi-version concurrency. In other words, writes always go to the same server in a replica set. By default, even reads from secondaries are disabled so the default behavior is that you communicate only with one server at a time.   This may need to be its own posting as this is a complicated issue.  More on this later.

Querying
Many RDBMS have standardized on SQL (ANSI) and are generally consistent. However, your stored procedures are not portable.
MongoDB has a relatively rich set of data access and manipulation commands. The find (select) command returns cursors. However, the language is particular to Mongo.

Indexing
Both RDBMS and MongoDB support the declaration and use of indexes.

Scalability
Scale-out is relatively easy. Scale reads by using replica sets. Scale writes by using sharding (auto balancing). There are issues with Sharding that need to be understood. More on that later.

Cost
MongoDB is ‘free’. However, there are a number of ‘free’ RDBMS. But, as always, you need to factor in the costs for development and production support – which are non-trivial.

Maintainability
This is a challenge for a new player like MongoDB. The administrative tools are pretty immature when compared with a product like MySQL.