Cassandra Amazon EC2, Read Performance experiments

1.1k Views Asked by At

I need some help improving Cassandra read performance. I am concerned about degradation of read performance as the size of the column family increases. We have the following stats on single-node Cassandra.

Operating System: Linux - CentOS release 5.4 (Final)
Cassandra version: apache-cassandra-1.1.0
Java version: "1.6.0_14" Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)

Cassandra Configuration: (cassandra.yaml)

  • rpc_server_type: hsha
  • disk_access_mode: mmap
  • concurrent_reads: 64
  • concurrent_writes: 32

Platform: Amazon-ec2/Rightscale m1.Xlarge instance with 4 ephemeral disks with raid0. (15 GB Total Memory, 4 Virtual Cores, 2 ECU , Total ECU = 8)


Experiment configurations: I have tried to do some experiments with GC

Cassandra config:
10 GB RAM is allocated to Cassandra Heap, 3500MB is Heap NEW size.

JVM Config:
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=1000"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=0"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=40"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedOops"



Result stats from OpsCenter community 2.0:

Read Requests 208 to 240 per second
Write Requests 18 to 28 per second
OS Load 24.5 to 25.85
Write Request Latency 127 to 160 micros
Read Request Latency 82202 to 94612 micros
OS Sent Network Traffic 44646 KB avg per second
OS Recieved Network Traffic 4338 KB avg per second
OS Disk Queue Size 13 to 15 requests
Read Requests Pending 25 to 32

OS Disk latency 48 to 56 ms
OS Disk Read Throughput 4.6 Mb per second
Disk IOPs Reads 420 per second

IOWait 80 % CPU avg

Idle 13 % CPU avg

Rowcache is disabled.


The Column Family
One of the column family i am only reading from is created through CLI

create column family XColFam 
with column_type='Standard'  
and  comparator = CompositeType(BytesType,IntegerType)';"

Column family SSTable Size = 7.10 GB, SSTable Count = 2

XColFam column family has 59499904 no. of estimated row keys (most are utf8 literal with varying length, estimated through mx4jtools) with columns like thin in nature, with the value 0 bytes.....now.

Most of the rows should have very small number of columns, maybe 1 to 10, so with approx 20 to 30 bytes of 1st component of column name and 2nd is of 8 bytes integer....2nd component of composite column is dynamic could repeat but probability is low.......1st component repeats in varieties but number of columns in rows could be different.

I have tried SnappyCompression to compress the column family but there was no change in size.

I have a scheduled service that run for hours with 20 threads and make random read requests for multiple keys (for now its 2 keys per request) to this column family and read full rows, no column slice or etc.

I think it is not performing good now because it is processing too few request per minute. It was working better before when the column family size was not that big. It was around 3 to 4 GB.

I am afraid read performance degrade too fast with the increase in size of the column family.

I have also tried to tweak some GC and memory stuff, because before that I was having lots GC and CPU usage. When data size was smaller and there was very small iowait in wave form.


How can I increase the Cassandra performance. Your suggestions will be appreciated.

2

There are 2 best solutions below

2
On

Look cassandra is relative I/O dependent.EC instances have "insuficient" I/O by design (Xen virtualization) And my first recomendation is to use Cassandra on real hardware, where you have a control. e.g u can use SSD disk for CommitLog. Look at Cassandra hardware proposals.

However, switching to own hardware is a bit a radical option. To stay with Amazon try EBS

Amazon Elastic Block Store (EBS) provides block level storage volumes for use with Amazon EC2 instances. Amazon EBS volumes are network-attached, and persist independently from the life of an instance. Amazon EBS provides highly available, highly reliable, predictable storage volumes that can be attached to a running Amazon EC2 instance and exposed as a device within the instance. Amazon EBS is particularly suited for applications that require a database, file system, or access to raw block level storage.

Amazon EBS allows you to create storage volumes from 1 GB to 1 TB that can be mounted as devices by Amazon EC2 instances. Multiple volumes can be mounted to the same instance. Amazon EBS enables you to provision a specific level of I/O performance if desired, by choosing a Provisioned IOPS volume. This allows you to predictably scale to thousands of IOPS per Amazon EC2 instance.

Also check out Cassandra Performance Testing on EC2

0
On

Short Answer: Row Cache and Key Caches.

If your data contains subsets that will be frequently read like most systems try to use row caches and key caches.

Row caches is a in memory cache, which stores the frequently read rows completely in memory. Please keep in mind, that this may have not a desired effect if you are data is spread out.

Key caches are generally more suited as it only stores the partition keys and their offsets on disk. This generally will help skip a lookup by Cassandra(no need to use partition indexes and partition summaries).

Try enabling key cache with the keyspace and table and check out your performance.