Glenn Engstrand

This story starts long ago with a Model Driven Software Development system that I wrote. The generated server side Java code used JDBC to access data in a database. Later on, I provided support for Cassandra, Amazon SimpleDB, and Solr by writing JDBC connectors for those NoSQL data stores.

It was nothing fancy. I subclassed Connection and PreparedStatement. I used Antlr to parse the SQL. It was a very simplified and stripped down version of SQL. The Cassandra connector would map tables to column families and fields to columns. The Hector API was used where a select would end up with an Indexed Slices Query and an insert or update would end up calling add insertion on a mutator. The only thing you could do with the where clause was filter by primary key field = value. Nothing to see here. Pretty straight forward.

That would be the end of this story except recently the DataStax folks released a new version of Cassandra where the preferred mode of communication is a DSL they call CQL. The Cassandra Query Language is very similar to the Structured Query Language. You can create schemas, change, and access data against narrow rows in a way very reminiscent of my JDBC connectors. CQL also works against wide row data including the time series use case that is Cassandra's best fit. The first part of the compound column key name becomes the primary key for each row in the CQL result set and the second part becomes the column name. Very clever.

Why did they do this? Well, it wouldn't be the first time that a non-relational db technology vendor adapted SQL for their own purposes. OLAP vendors have their own extension to SQL called MDX. HIVE is a way to access Hadoop File System data through commands that look like SQL. I think the real reason why vendors do this is familiarity. Developers who already know SQL may be more likely to use Cassandra if they believed that they didn't have to learn yet another API.

This is what I like about embedding a domain specific language in your technology. It can make your app more declarative. Separating out data access into a language that is more expressive, in the context of the application's domain, can improve your program's readability while at the same time increasing its flexibility. These are good things.

So DSL is good. What I don't like about this trend is picking SQL as the DSL. Don't get me wrong. I love SQL. It is a very expressive way to describe sets of data. The reason why I don't like embedding SQL into NoSQL is the very reason why vendors do, familiarity. There are a lot of parts to SQL that programmers may assume that they can use with a NoSQL technology but they either can't or shouldn't. What is missing in CQL are field expressions, grouping, joins, and nested queries. Even more dangerous is the parts to SQL that CQL does support but will not scale.

You should not explicitly sort in CQL. Instead, you should always denormalize with materialized views for each way that you will need to access the same data with a different ordering. You can build new indexes in CQL but performance will degrade significantly as the sort space increases. Something similar happens when you attempt to use complex filtering criteria in CQL. Just say no to anything but the most basic features of SQL in CQL.

I advocate strongly for the use of embedding domain specific programming languages into NoSQL data stores but exercise caution. Trying to make NoSQL look like SQL is an effort that, at first glance may seem glamourous and likely to attract attention, will in the end cause disappointment and dissatisfaction.