Relational databases aren't right for the cloud
I attended the Cloud Connect 2010 conference in Santa Clara, California, one of the first major gatherings of the year on cloud computing. One of the main topics that came up is not using relational databases for data persistence. Called the "NoSQL" movement, it is about leveraging more efficient databases that are perhaps able to handle larger data sets more effectively. I have already written about the "big data" efforts that are emerging around cloud, but this is a more fundamental movement to drive data back to more primitive, but perhaps some more efficient models and physical storage approaches.
NoSQL systems work with data in memory, typically, or uploading chunks of data from many disks in parallel. The issue is that "traditional" relational databases do not provide the same models and, thus, the same performance. While this was fine in the days of databases with a few gigabytes of data, many cloud computing databases are blowing past a terabyte, and we will see huge databases supporting cloud-based systems going forward. Relational databases for operations on large data sets are contraindicated, because SQL queries tend to consume many CPU cycles and thrash the disk as they process data.
If you think we have heard this song before, you are correct. Object and XML databases made some inroads back in the 1990s, but many enterprises kept the relational databases around, such as Oracle, Sybase, and Informix, despite the fact that many nonrelational databases did indeed provide better performance. However, the cost and risks of moving from relational databases, as well as the relatively small sizes of the databases, kept it pretty much a relational world.
However, the cloud changes everything. The requirement to process huge amounts of data in the cloud is leading to new approaches to database processing, based on older models. MapReduce, the fundamental way Hadoop processes data, is based on the older "share-nothing" database processing model from years ago, but now we have the processing power, the disk space and the bandwidth.
I believe the movement to cloud computing will indeed reduce the use of relational databases. It is nothing we have not heard before, but this time we have a true need.
When this column was posted online, it attracted the following comment: "Its always been the same. You want to process high volumes of data with large CPU requirements? Don't use a database. You want resilience, recoverability, reliability, consistency and a language that allows those without a PhD in machine code to join data up in new ways - ways not intended by the original design? You need a database. Yes, there's an overhead, go figure. But if it won't scale, you're either using a duff relational database or a duff designer. Built right, a set of relational tables will take 10 times as long to process 10 times the amount of data in a join. That's scalability. Anything claiming more is likely to be wool pulling or smoke and mirrors. (Also applies to your article's talk about a new database that can do memory-caching of data and pull data from many disks at a time. Any relational database which cannot do that does not deserve to be called a database)."