NoSQL Experiments

by Matt Cholick
While working on my Master's project I decided to try some experiments with NoSQL. I wanted experience using this type of technology and my data seemed to fit this model well:
  • There are no real relationships; the different pieces of data can be stored in unrelated tables.
  • Transactions aren't important because the data is either read or written in single large operations spanning the entire dataset. It is not randomly accessed from multiple threads of execution.
  • Scale seemed important because I'm pulling data from Twitter's Firehose and this can result in large volumes of data I need to process with quickly.

I moved my data from PostgreSQL to Cassandra. One attractive aspect of Cassandra is that it's a Java project. Working in Groovy means that Java focused tools are very easy to integrate. Cassandra was developed by Facebook and became an Apache project, so I know it's software with a solid history as well as current backing. Finally, write-ups of the tool commented that write performance is quite good.

I added Cassandra and, as much as I possibly could, tried to keep my algorithms constant. I also made sure to include indexes and to slice the data in similar chunks for processing. I experimented with two of the longer running pieces of my program: the algorithm to clean-up and do post processing on Tweets and building a Bayes classifier from the cleaned data. The cleanup operation is very write intensive while the training operation only reads data. Here are the results.

PostgreSQL Cassandra
Clean 19:18 10:18
Bayes 1:18 2:11

Cassandra did show improved write performance. Running the cleanup operation took half the time. Read heavy operations, on the other hand, did not perform as well. It's likely that I could have made changes to the algorithm to improve performance in the Bayes implementation, but the same could be said for the PostgreSQL version. This was not a scientific experiment. This was instead dropping a different back-end behind the same implementation; the database operation involved in training is a simple series of reads pulling a slice data. The improved write performance in clean is nice, but unfortunately this is an algorithm that I rarely run. Training and modifying the classifier is where I've spent most of my development time.

These kinds of tools fill a real need, but they're simply not a drop-in database replacement. In this project, where my own time is the scarcest resources, a model that's familiar enough to implement quickly makes the most sense. The experimentation and exploration comes from the machine learning, recommender, and the large scale Groovy implementation.

My biggest takeaway: this type of technology isn't something to just drop in as a database replacement. My mental models still need adjusting.