tl;dr

When using the Datastax driver use long lived sessions!

Symptom

We saw errors in our Java services coming from the Datastax driver trying (and retrying) to connect to Cassandra. The most frequent error was the following:

ERROR [2016-09-07 21:07:37,833] com.datastax.driver.core.Cluster: Unknown error during reconnection to /xx.x.x.xxx:9042, scheduling retry in 600000 milliseconds
java.lang.IllegalArgumentException: rpc_address is not a column defined in this metadata 
at com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273) ~[cassandra-driver-core-2.1.2.jar:na]
at com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279) ~[cassandra-driver-core-2.1.2.jar:na]

Our monitoring showed Cassandra was operating normally. Google threw up this bug in Cassandra version (2.1.3) arising when Cassandra was under load but our web services had very little load in terms of requests going into them. Digging further in the service logs the initial error showed the service failing to connect to Cassandra:

ERROR [2016-09-08 14:22:47,196] com.datastax.driver.core.Session: Error creating pool to /xx.xx.x.xx:9042
 java.net.ConnectException: Connection refused: /xx.xx.x.xx:9042

Problem

One of my colleagues mentioned an issue he had previously encountered when load testing one of his services. He kindly pointed me to the Datastax docs on the driver (always good to read!)...

While the API of Session is centered around query execution, the Session does some heavy lifting behind the scenes as it manages the per-node connection pools. The Session instance is a long-lived object and it should not be used in a request/response short-lived fashion. Basically, you will want to share the same cluster and session instances across your application"

Sure enough, we were naively opening a session for each query to the Cassandra cluster:

public static <T> T doInCassandraSession(Function<Session, T> f, Cluster cluster) {
   Session session = cluster.connect();
   try {
      return f.apply(session);
   } catch (DriverException e) {
      LOGGER.error("Failed to load data from cassandra", e);
      throw new Exception();
   } finally {
      session.close();
   }
}

Reproduce

So was our repeated opening of sessions with their “heavy lifting” overloading Cassandra? We recreated the errors on our test environment by doing a quick load test using the excellent Apache Bench Tool:

ab -n 10000 -c 10 localhost:8080/xxxxx

We also saw in the Cassandra machine logs during the test that the firewall synflood protection was tripped by an aggressive client access. It was blocking packets to the Cassandra port (9042)!

2016-09-08T14:10:15.136454+01:00 xx-xx-xx-xx kernel: [14684240.26732
1] TCP SYNFLOOD: IN=eth0 OUT= MAC=xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:
xx:xx:xx SRC=xx.xx.xx.xxx DST=xx.xx.xx.xx LEN=52 TOS=0x00 PREC=0x00 TTL=63 ID=34028 DF PROTO=TCP SPT=40196 DPT=9042 WINDOW=29200 RES=0x00 SYN URGP=0

The cluster was definitely being hammered!

Fix

We fixed up the services to use a long lived sessions (created when the service starts up and shared throughout the service as per the docs recommendations). Deploying to the test environment and rerunning the same load test, happily threw no errors and the firewall logs were also clean.