Jon Haddad

Performing User Defined Compactions in Apache Cassandra

18 Oct 2016

In this post I’ll introduce you to an advanced option in Apache Cassandra called user defined compaction. As the name implies, this is a process by which we tell Cassandra to create a compaction task for one or more tables explicitly. This task is then handed off to the Cassandra runtime to be executed like any other compaction.

This is not an operation you’re likely to need on a daily basis. However, it is extremely useful in situations where you want to reclaim disk space immediately, and don’t want to wait for normal compaction to kick in.

Unless you are running Cassandra version 3.4 (see CASSANDRA-10660), you will need to use JMX to issue a user defined compaction. If you haven’t used JMX commands before, it can seem overwhelming at first. If you’re coming from a non-java background it’s likely going to be a completely foreign concept. Don’t let that scare you off! By the end of this post you’ll be able to perform a user defined compaction using a utility called jmxterm.

To demonstrate the process, we will use a local install of Cassandra 3.0.9 loaded with the movielens dataset (see the MovieLens project) using the CDM utility as well as forced a flush to disk (to read more about CDM see this TLP blog post):

  cdm install movielens
  nodetool flush

Calling nodetool flush is needed in order to ensure our memtables have been written to disk. If we didn’t do this, our data would be sitting in memory, and compaction requires data be written to disk.

I’ve taken note of the data files in the user directory of the movielens keyspace:

jhaddad@rustyrazorblade ~/dev/cassandra$ find data/data/movielens/  -name '*Data.db'
data/data/movielens//movies-6728183094d311e68b105dbb96ed2de2/mc-1-big-Data.db
data/data/movielens//ratings_by_movie-6c2408d094d311e68b105dbb96ed2de2/mc-1-big-Data.db
data/data/movielens//ratings_by_user-69a85a7094d311e68b105dbb96ed2de2/mc-1-big-Data.db
data/data/movielens//users-68668ba094d311e68b105dbb96ed2de2/mc-1-big-Data.db

You can see in the above output we have a “users” directory with a data file mc-1-big-Data.db. We’ll need the full path later.

Now that we have SSTables on disk, let’s use JMX to invoke a compaction. To do this, we’ll first need to grab jmxterm. This is probably the trickiest part of the whole process as the download links on the original jmxterm page are broken. Start jmxterm by using the following command from the directory in which it was downloaded:

  java -jar jmxterm-1.0-alpha-4-uber.jar

To see all the commands available to you, use the help command (output truncated):

$>help
#following commands are available to use:
about    - Display about page
bean     - Display or set current selected MBean.
beans    - List available beans under a domain or all domains
...
open     - Open JMX session or display current connection
option   - Set options for command session
quit     - Terminate console and exit
run      - Invoke an MBean operation
set      - Set value of an MBean attribute

The first thing we want to do though is actually open a connection to our Cassandra process. The standard JMX port is 7199, you can pass it to the open command along with a host:

$>open localhost:7199
#Connection to localhost:7199 is opened

With the connection open, we can type beans here to get a list of MBeans that you can access. MBeans are just a way of controlling the database via JMX. I’ve simplified the output to make it easier to find what we’re looking for, the CompactionManager:

$>beans -d org.apache.cassandra.db
#domain = org.apache.cassandra.db:
org.apache.cassandra.db:columnfamily=IndexInfo,keyspace=system,type=ColumnFamilies
org.apache.cassandra.db:columnfamily=aggregates,keyspace=system_schema,type=ColumnFamilies
...
org.apache.cassandra.db:type=CompactionManager
...

Now that we know the name of the MBean, we can invoke the run command, passing the MBean method name forceUserDefinedCompaction and one or more file paths as arguments:

$>run -b org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction data/data/movielens//users-68668ba094d311e68b105dbb96ed2de2/mc-1-big-Data.db
#calling operation forceUserDefinedCompaction of mbean org.apache.cassandra.db:type=CompactionManager
#operation returns:
null

Not very exciting output unfortunately. It’s only after we take a look at the directory that we see the file number has changed from mc-1 to mc-2:

jhaddad@rustyrazorblade ~/dev/cassandra$ ls data/data/movielens/users-68668ba094d311e68b105dbb96ed2de2/*Data.db
data/data/movielens/users-68668ba094d311e68b105dbb96ed2de2/mc-2-big-Data.db

Compacting multiple files is just a matter of passing them to the MBean, comma separated.

At this point you should be comfortable with the process of initiating a user defined compaction through JMX using jmxterm. I recommend trying this out on your laptop to get comfortable with the process, and explore the other MBeans that are available. If you prefer a visual tool instead of a command line one, check out jconsole, which comes with the Oracle JDK, but is generally less useful in production.

cassandra compaction operations