Geospatial (location based) Searches in MongoDB – Part 2 – simple searching

April 2, 2012

This is the 2nd post of a multi-part series on performing geospatial (location based) searches on large data sets in MongoDB.

In this part we will focus on using simple queries to perform geo searches on the location tagged data that we loaded into the MongoDB (see part 1 for details).

MongoDB supports two-dimensional geospatial indexes. It is designed with location-based queries in mind, such as “find me the closest N items to my location.” It can also efficiently filter on additional criteria, such as “find the items that are within some distance of the centroid of my search”.

Assuming that the data has been properly loaded the first step it to index the data so that it can be geo searched by mongo. This can be done from Java, but for this exercise we will set it up from the command line:
db.name_of_collection.ensureIndex( { loc : “2d” } )

Once you have created the index, it is best to check and make sure that it was created properly:
db.name_of_collection.getIndexes()
You should see something like:
{ “v” : 1, “key” :
{ “loc” : “2d” },
“ns” : “geo1.um”,
“name” : “loc_”
}
Note: The key part is that “loc” should be “2d”

Now for the ‘fun part’, building and running the geospatial queries in Java. We will start simple and build more complex queries, using more advanced Java concepts (i.e. classes) as we go along. If you are not completely comfortable with building/running queries with the BasicDBObject then please review.

A simple geo search from the command line would look like: db.um.find( { loc : { $near : [15,150] }})
In java it would be:

Double locationLongitude = new Double (15);
Double locationLatitude = new Double (150);
BasicDBObject locQuery = new BasicDBObject();
locQuery.put(“loc”, new BasicDBObject(“$near”, new Double[]{locationLongitude, locationLatitude}));
// run the query
DBCursor locCursor = collectionUM.find( locQuery );
// use cursor to view results

Note: Use Double array in this approach; otherwise is can lead to precision issues.

A more complex (useful) search from the command like could be: db.um.find( { loc : { $near : [15.,-150.11] , $maxDistance : 40 } } ).limit(10)
In Java, using a JSON document:

String sLng = “15.5”; Double dLng = new Double(sLng);
String sLat = “150.11”; Double dLat = new Double(sLat);
String sDistance = “40”;
DBCursor cur = collectionUM.find(new BasicDBObject(“loc”,JSON.parse(“{$near : [ ” + dLng + “,” + dLat + ” ] , $maxDistance : ” + sDistance + “}”))).limit(10);

Note: Even with this relatively simple query, it does not appear that you can easily wrap the geo query parameters in a Java BasicDBObjec object. There have been a number of posting on Stack Overflow and the Mongo Google Groups on this issue. However, I have not yet see an example of a BasicDBObject implementation that does not have to resort to using (parsing) a JSON document. Also, there is some mention of implementing this query with the BasicDBObjectBuilder. I am currently looking to that.

In the next installment I plan to using more advanced Java concepts (i.e. classes) to implement the searches.


Geospatial (location based) Searches in MongoDB – Part 1 – data acqusition and loading

March 29, 2012

This is the first post of a multi-part series on performing geospatial (location based) searches on large data sets in MongoDB.

In this part, we will focus on getting large sets of geo-data into MongoDB using the Mongo data drivers. This will be code focused approach; the programming language is Java (Mongo supports a wide variety of programming languages). I will be using the Mongo Java driver parse the geo data, format the data, and insert it into Mongo. You will need to download the Java driver (jar file) from mongodb.org.

First, we will need the some geo-data. One of the best sources for ‘reasonably’ well formatted and consistent data is at GeoNames.org  Here you can download, files from individual countries or the allCountries.zip file containing over 7 Million (~200 MB). To start working I recommend that you download a small, country-specific zip file as it will be much easier to work with (I used um.zip, it contains 230 records). Once you have things working, you can down larger countries or the allCountries files.

The follow code segments perform for basic steps: getting a connection to the Mongo db, getting the collection object (used to store the data in the db), reading the data from the country specific file, and writing the data to the database.

(1)Get a connection to Mongo database

System.out.println(“mongo”);
Mongo mongo = new Mongo(“localhost”, 27017);
System.out.println(“getting db geo1”);
DB db = mongo.getDB(“geo1”);

(2) Create your collection and data store object

// get a single collection
System.out.println(“collection UM”);
DBCollection collectionUM = db.getCollection(“um”);
DBObject dbObject = null;
String jsonObj = “”;

(3) Read the data

There is nothing really fancy here. The data is in a text file, one record per line. Just read the line, tokenize the string, and extract the data. The only tricky part is that the data is not consistently delimited so that you have to look for and find the lat/lng data fields. I use a reg expression to find the floating point data fields (token.matches(“-?\\d+(.\\d+)?”), they are the only floating point fields in the record. To keep things simple I only retained four pieces of data: the geonameID, the location information (the text info between the geonameID and the latitude data field), the latitude data, and the longitude data.
Note: As per good programming practices you do need to check the lat/lng data to insure that is is a floating point number between +/- 180. Also, make sure that you do not lose precision of the data. This should not be a problem in Java, but this sort of thing can be be a bit of a headache in PHP.

(4)_Writing the data to Mongo

We will write the data to the database using the DBCollection.insert() method. In set (2) you created collections object that uses the collection that you will write your documents to. We will us that method to write a JSON object to the collection.
Writing the data is fairly straight forward, the only tricky part is properly formatting the JSON document to include a location array that can be indexed and used in a geospatial search.  The ‘loc’ field is an array.  It stores the lat and long data that you will index and use to perform the location based searches (will be described in part 2 of this series)
The format for the ‘json’ string is:
-> jsonObj = “{geonameID:” + geonameID + “,geoInfo:” + geoInfo + “,loc: [ ” + lat + “, ” + lng + “] }” ;
Remember, the lat and lng fields must be placed inside an array element (the name loc is arbitrary).
The ‘json’ string is loaded into a DBObject:
-> dbObject = (DBObject)JSON.parse(jsonObj);
And the dbObject is written to the database:
-> collectionUM.insert(dbObject);

Using this approach, you can write 100s or millions of records into the data store.

In the next part of this series, I will cover how to perform the geospatial (location based) radial and polygon searches of the geo-coded documents.


To SQL (relational) or not to SQL (NoSQL) that is the question

March 19, 2012

This document is work in progress. I have been asked to review this issue and I am providing this info in a draft form.
This post focused on MongoDB; you will find these issues related with other NoSQL DBs, but the particulars may be different.

As always, the answer to the question “do we go relational DBMS or NoSql?” is: ‘it depends’. And it depends on a number of issues. I will try to enumerate and address those issues in the posting. I don’t assume that I have covered all issues; you should not either.

Before we begin, I am going to ask a question of the reader, “do you have a data description document?” Something in writing (written down, not just in your head) that describes your data requirements and how you will be using your data. I know that some people are thinking that I want you to have a E-R diagram fully developed, that is not the case. I just want the reader to have some sense what data they will be working with, how is that data related (or not), what they need to do with the data, etc. If you know those things then you will be able to determine what issues are relevant to you and how important are they.

Schema
All data is related. If your data is not related then you may/may not need a DB to store your data.  Generally, in all  non-trivial, large-scale, productions deployments of data stores, there is some piece of data (entity or attribute) that is related to another. Therefore, the idea that NoSQL databases are schema-less may not be the best way of thinking about your data.  I believe that you will want/need to develop some sort of schema – description of the data, how is organized, how is it related and how it will be used.
Note: I do not suggest creating a E-R diagram as it will drive you down the ‘relational’ path. However, you will want that description to define your collection and documents if you decide go down the ‘NoSQL’ path.

Relationships Between Entities
This is a very tricky question, because the answer depends on what data entities you have and how are they related. Again, having a schema makes addressing this issue easier. If you find that you data (read entities) have a large number of relationships, then NoSQL may not be the best solution.
I suggest that you create a very high-level E-R diagram and then take the entities and see if you can ‘easily’ refactor the schema into a MongoDB schema – how efficiently and effectively and can you embed the related entities into objects and arrays inside a BSON document. Also, this ‘embedding’ will be supported by client side linking … more on that later.

Atomicity
While Mongo does support some [built in] atomic operations (e.g. findAndModidy), it currently it does not guarantee atomic operation on a single documents. If you have a schema where a number of entities need to be atomically updated at the same time (read transactions) then Mongo is not right for you

Consistency – Single DB
RBRMS are strongly consistent by design; most allow table and/or row level locking.
MongoDB (current version) is strongly consisted because it has one global lock for read/write operations. There is some talk of have global, collection level locking. Bottom line, if your DB is going to be to a significant number of concurrent read/write operation then Mongo may not be the best solution for you.

WYOW Consistency (single server)
One proposed work around is to leverage MongoDB’s atomic update semantics with an optimistic concurrency solution. This comprises four basic steps: 1. read a document; 2. modify the document (i.e. present it to the user in a web form); 3. validate that the document hasn’t changed; 4. commit or abandon the user’s update.
Note: There are a number of posts regarding read-your-own-write consistency that that would be good to review is this is a large issues for you.

Consistency – Distributed DBMS
For ‘industrial strength’ DBMS this is a solved problem. For example, Oracle has RAC. If you really, really need it then it may be worth the money, but be very sure you need it as it is a very expensive solution.
MongoDB does not offer master-master replication or multi-version concurrency. In other words, writes always go to the same server in a replica set. By default, even reads from secondaries are disabled so the default behavior is that you communicate only with one server at a time.   This may need to be its own posting as this is a complicated issue.  More on this later.

Querying
Many RDBMS have standardized on SQL (ANSI) and are generally consistent. However, your stored procedures are not portable.
MongoDB has a relatively rich set of data access and manipulation commands. The find (select) command returns cursors. However, the language is particular to Mongo.

Indexing
Both RDBMS and MongoDB support the declaration and use of indexes.

Scalability
Scale-out is relatively easy. Scale reads by using replica sets. Scale writes by using sharding (auto balancing). There are issues with Sharding that need to be understood. More on that later.

Cost
MongoDB is ‘free’. However, there are a number of ‘free’ RDBMS. But, as always, you need to factor in the costs for development and production support – which are non-trivial.

Maintainability
This is a challenge for a new player like MongoDB. The administrative tools are pretty immature when compared with a product like MySQL.


AT&T launches Synaptic Compute with support for hybrid clouds

March 7, 2012

AT&T launches Synaptic Compute with support for hybrid clouds. AT&T says the system supports bursting, disaster recovery and mobile application development and deployment.

AT&T’s goal is to ease the transition for enterprises to upgrade from a private cloud to a hybrid cloud system using AT&T’s network to provide addition storage and compute power. The service targets VMWare customers that are using the vCloud Datacenter.

AT&T claims that the system supports bursting, data center extensions, disaster recovery, and mobile application development and deployment. New features include virtual machine cloning, scalable computing and memory resources, multiple user interfaces, a multi-layer firewall and open standard software.


Selection of Hybrid Cloud Vendors

February 22, 2012

How do you evaluate and select a hybrid cloud vendor; it really depends on the problem(s) that you need to address and solve. If data transfer and storage are critical then the most important issues are those of bandwidth and data transferring. If the system needs to support bursting, or spikes in web traffic and/or computation loads then price may be more important.

What follows is a description of our evaluation and comparison of vendors based on our needs for a hybrid cloud infrastructure that provides a private cloud (more like managed hosting) and a public cloud (that provided elastic/on-demand computing resources).
Please bear in mind that your needs are likely to be different.

Cloud computing vendors were evaluated using the follow criteria:
Completeness of Hybrid Offering – By vendor or in combination with 3rd party.
Maturity of Offering(s) – Relative length of time vendor has been providing hybrid offerings.
Cost – Total costs.
Reliability – SLAs for the private and public clouds
Bandwidth and Data Transfer – Maximum bandwidths for data transfers between clouds.
Self-service Support –
Developer Support – How ‘developer friendly’ is the infrastructure (and vendor).
Portability of Deployments – How easy is it to move deployment from one vendor to another.
Integration Support – Support for open standards or public APIs for integration.
Security – Tools and capabilities.
Management – Management tools for both public and private clouds.

Notes
1) As portability of deployment is a critical requirement, no PaaS solutions were considered as those solutions, by their design and implementation, are not portable.
2) Computing and storage costs are highly dependent on configurations.
3) Portability of IaaS cloud implementation can be heavily dependent on how the systems are configured and deployed.

The following vendors were considered as we believe that their current offerings could address most of our evaluation criterion: AWS, ATT, Datapipe, Go Grid, IBM, RackSpace, Terrmark.

Potential candidates
Datapipe – Has strong manage hosting and ability to hybridize Amazons solutions with its own. Claims seamless integration between AWS and Datapipe environments, high I/O performance, and integrated support and management.
Go Grid – Smaller, independent provider of public and private clouds. Very high SLAs. Competitive pricing. All APIs are proprietary, portability may be an issue.
RackSpace – Strong managed hosting. Open source development via OpenStack project. Offers some hybrid configuration. Is moving quickly to provide fully featured, hybrid offerings.

Vendors that are lacking in one/more critical areas
ATT – Very strong in managed hosting. And, Synaptic Compute is an ambitious offering. However, the services appears to still be in beta (not fully released)
AWS – Amazon does not provide a native, hybrid cloud offerings and they do not provide non-virtualized servers. They do provide hybrid offering in partnership with 3rd party vendors (e.g. Equinix) via Direct Connect. However, that would require us to provision two separate clouds with two different vendors.
CSC – Nascent hybrid solutions.
IBM – Strong managed offerings. Complex contracts and pricing structures. Focused on large enterprises. Level of commitment to full set of hybrid offerings is unclear at this time.
Terrmark – Moving quickly into the hybrid cloud space with the acquisition of CloudSwitch. However, their hybrid offering are relatively new.


The Future of Hybrid Cloud Computing

January 25, 2012

Currently there are few standards for interoperability between public and private clouds.

There are a number of ‘forces’ that are shaping the future of cloud computing

Amazon – AWS is still the dominate force in cloud computing.  They are the largest and at the same time the most innovative vendor in the space – no mean feat.  Amazon is working to working to support interoperability between its offering and private cloud enterprises via a published API.

Rackspace –  OpenStack – The community has over 150 members that are dedicated to creating an interoperability model for a variety of cloud configuration.  However, OpenStack is still ancient; it’s future as an industry standard is not certain.

VMWare – VMWare claims 80% market share in private cloud deployments.  vCloud – The service is based on VMware’s vSphere and vCloud Director (vCD), exposes the vCloud API.  vCD is a key part of VMware’s strategy for driving adoption of hybrid clouds.  It provides interoperability between VMware-virtualized infrastructures and 3rd party service providers.  These service providers are part of VMware’s service provider partner program.  Note. It can be challenging integrating public cloud services from vendor that are not VMWare based.

The efforts of the major players is shaping the future of hybrid cloud computing.

– AWS Direct Connect to private cloud vendors such as Equinox
– AT&T’s Synaptic Compute as a Service makes the company’s IaaS public cloud compatible with VMware’s vCloud Datacenter offering.
– CSC Cloud Services
Datapipe
Go Grid
– IBM enhanced its Smart Cloud offering by the acquisition of Cast Iron
– Rackspace’s commitment to OpenStack
Terrmark


How Cloud Computing Will (already has) Transformed Enterprise Computing

October 25, 2010

There is no shortage of definitions of cloud computing.  See the article in Cloud Computing Journal 21 Experts Define Cloud Computing.  And yes, there are 21 different definitions and many of them have significant differences.
Needless to say, the definition is subject to a variety of interpretations.  The latest Gartner report on Cloud Computing Systems did not include Google (their app engine was seen as an application infrastructure) or Microsoft (Azure was seen as a services platform).  You have to take these things with a ‘grain of salt’ – Gartner’s report did not have Amazon in the ‘leader’s quadrant’.
One general description that I like is that cloud computing involves the delivery of hosted services over the Internet that are sold on demand (by time, amount of service and/or amount of resources), that are elastic (users can have as much or as little of a service or resource as they need), and that are managed by the [service] provider.

I attended a recent TAG Enterprise 2.0 Society meeting (un-conference).  During the discussions one of the participants asked “how do we go about starting to use cloud computing?”   The first thought that came to mind was ‘you already are’.  If you socialize on Facebook or LinkedIn, if you collaborate/network using Ning or Google Groups, if you Twitter, if you get your Email via Gmail or Hotmail, or if you use Saleforce.com then you are already using cloud computing – using applications/services that, in some form, run in the cloud.

A recent Newsroom release Gartner Research predicted that by 2012 (just two or three years hence), cloud computing will become so pervasive that “20 percent of business will own no IT assets”. No matter how you slice it that is a pretty bold statement to make (even for Gartner).
I don’t know if I believe that 20 percent of businesses will have no IT assets (by 2010).  I believe that there are significant issues that will preclude business from putting 100% of their IT assets in the cloud.  These include security of data (that is stored in the cloud), control and management of resources, and the risks of lock-in to cloud platform vendors.
What seems more plausible are reports by ZDNet and Datamonitor which predict that within the next few years up to 80% of Fortune 500 companies will utilize cloud computing application services (i.e. SaaS applications), and up to 30% will purchase cloud computing system infrastructure services.
In the near term, I see cloud computing as more of an implementation strategy.  Enterprise computing assets and resources (including social computing software and social media) that are currently implemented within enterprise datacenters will migrate into the cloud.
The shift toward cloud services hosted outside the enterprise’s firewall will cause a major shift in how enterprises develop and implement their overall IT strategies and, in particular, their Enterprise Social Computing strategies.
This shift toward and the eventual wide spread adoption of cloud computing by the enterprise will be driven by a number of factors

Cost (computing resources)
Late last year (2009) Amazon, Google and Azure lowered their published pricing for reserved computing instances (computing cores).  Amazon’s rate for a single CPU, continuously available cloud computing instance was little as 4 cents an hour (effective hourly rate based on 7×24 usage) for customers that sign up for a three year contract.
Single year contract rates were about 20% higher.  Pricing for on-demand instances (no upfront payments or long term commitments) was about two and a half to three times the three year contract rates.
A rough calculation says that a cloud data center of 10, single core servers (at a three year contract rates) could be operated around the clock under $0.50 an hour, or just under $3,500 a year (about $350 per server per year).  And that includes data center facilities, power, cooling, and basic operations.  Pretty impressive numbers!

Commoditization of Cloud Computing
And if the costs of cloud computing weren’t low enough Amazon announced pricing for EC2 ‘spot instances’.  This pricing model will usher in the beginnings of a trading market for many types of cloud computing resources: support services, storage, computing power, and data management.
Under the old model you had to pay a fixed price that you negotiated with a bulk vendor or a private supplier.  Now in the new spot market you can look that the latest price of available cloud capacity and place a bid for it.  It your bid is the highest, then the capacity is yours. Currently this is available from Amazon’s EC2 Cloud Exchange.

Leveling the playing field for startups and SMBs
One of the most important aspects of cloud computing is that SMBs can afford to do things they could not have afforded to do before;  they can do new, exciting, innovative things – not just the same old things for less money.
In the past, when SMBs needed to build a new IT infrastructure (or significantly upgrade the current one) they often could not afford to buy large amounts of hardware and the latest/greatest enterprise software.
In the cloud you pay for the hardware and software that you need in bite-sized chunks. Now the SMBs can afford clustered, production-ready databases and application servers, and world class, enterprise software (via SaaS).  Having equivalent technology can help ‘level the playing field’ when competing against large enterprises.
New Products and Services
The availability of large amounts of computer processing power and data storage will allow innovative companies to create products and services that either weren’t possible before or were not economically feasible to deploy and scale.
In the past, business ideas that required prohibitive amounts of computing power and data storage may not have been implemented due to technical restrictions or cost-effectiveness.  Many of these ideas can now be realized in the cloud.

Reliability
Most cloud computing vendors offer three and a half nines of service level availability – annual percentage uptime of 99.95% (or about 4 ½ hours down time per year).  If applications can be deployed to clusters of servers then downtimes will be greatly reduced.
Note:  ‘Five nines’ of SLA is said to available from a few vendors.  However, upon closer reading of their offerings you may find wording such as “we are committed to using all commercially reasonable efforts to achieve at least 99.999 percent availability for each user every month.”
As always, read the SLAs very carefully.

Agility
Cloud computing enables two types of ‘agility’.  The first is time to realization; how fast you can see that an idea is working or is not working.  Cloud computing support the rapid acquisition, provisioning, and deployment of supporting resources (potentially much faster than in traditional TI environments).
The second type of agility is flexibility (aka elasticity) of computing and service resources.  Elasticity can reduce the need to over-provision.  The enterprise can start small, and then scale up when demand goes up.  And, if they have been prudent with their contractual obligations, they can scale down when resources are no longer needed.

Cloud Vendors – The New and the Old
The early leaders Amazon, Google and Microsoft have been joined by big names like HP, IBM, Dell, and Cisco; even Oracle has gotten into the game. They are utilizing existing strengths to create successful cloud computing products and services for their customers and partners.
There is new generation of companies that are developing cloud offerings – see The Top 150 Players in Cloud Computing.  These new companies are likely to be more nimble and move more quickly than the current leaders.  We are already seeing a number of new, innovative approaches (technologies, business models, and openness) to cloud based services.

It is not an exaggeration to say that ‘the IT industry landscape will be remade by cloud computing’.