Jon Haddad

Learn Apache Cassandra by Example with CDM

21 Sep 2016

Two weeks ago marked another Cassandra summit. As usual I submitted a handful of talks, and surprisingly they all got accepted. The first talk (video linked) I gave was an introduction to a tool I started back at DataStax called Dataset Manager for Apache Cassandra, further referred to as CDM. CDM started as a a simple question - what can we do to help people learn how to use Apache Cassandra? How can new users avoid the headaches of incorrect data modeling, repeated production deployments, and costly schema migrations.

Of course, this is not a new problem. Introducing any new or foreign technology comes with biases and baggage. Everyone learns a different way. Some people prefer the academic approach of learning theory through notation then applying that theory in practice. Others, myself included, prefer to learn by example.

Learning by example is powerful. Seeing a solution, identifying a pattern, turning the pattern into an abstraction, then applying the abstraction to solve a similar problem is what programmers do all day when writing code.

Applying this learning technique isn’t new. Stack Overflow is built on the premise that a well crafted question with a well crafted answer will be of some use in the future, and rewards it’s users for that model.

How can we apply learning by example to a database? The idea of having example schemas and data isn’t new either. Both MS Access and SQL Server came with a database called Northwind which new users could play with in order to get familiar with the basics.

CDM is my attempt to help people learn by example, like they do with Northwind. Combining it with the power and ease of tools like apt and yum. Instead of simply including a simple data model, by leveraging the collective knowledge of the Cassandra community we can create many data models that solve a variety of problems.

There are currently 3 datasets that can be loaded into a cluster:

movielens - A popular project frequently used to learn collaborative filtering, from the GroupLens team. I also used this dataset as the basis for my second talk showing how to perform graph queries over Cassandra data using Apache Spark + GraphFrames.
killrvideo - A YouTube-like video site leveraging the search and analytics features of DataStax enterprise. Microservice applications have been developed in C# and Node.js, with Java in the pipeline.
killrweather - A project showing how to store weather readings using time series data modeling. The original project for this dataset, Killweather, shows how to perform streaming aggregations using Apache Kafka, Akka, and Spark Streaming.

Want to try it out CDM on your laptop? Assuming you already have a local Cassandra node up and running, it’s pretty straightforward. I’ve linked to the pre-release 0.11 version which is currently published to GitHub. At the moment builds are only provided for OSX.

wget https://github.com/riptano/cdm-java/releases/download/0.11/cdm 
chmod 755 cdm
./cdm list
./cdm install movielens

Once it’s finished, you can hit the usual cqlsh and take a look at what’s there. As the project matures and more datasets are added, I expect it’ll start to be seen regularly in blog posts, presentations, and courses in the Apache Cassandra community.

cassandra learning