Huge performance drop in cassandra-orm after million records

214 Views Asked by At

I'm using cassandra-orm plugin (cassandra-orm:0.4.5) for migrating clicks from Postgres DB to Cassandra. (I know I could use raw data import, but I want to make use of groupBy and explicit indexes maintained by the plugin).

The migration procedure is simple: I select a bunch of clicks from Postgres (via GORM) and then I flush them to Cassandra. Every click is a new record and a new object is created in Grails and saved in Cassandra. With 20 threads I was able to reach throughput of 2000 clicks/sec. After importing 5 mil clicks the performance started to degrade dramatically to 50 clicks/sec.

I made some profiling and I've found out, that 19 threads were waiting (parked) and one thread was performing a rehash on Groovy's AbstractConcurrentMapBase.

stack trace for waiting threads:

Name: pool-4-thread-2
State: WAITING on org.codehaus.groovy.util.ManagedConcurrentMap$Segment@5387f7af
Total blocked: 45,027  Total waited: 55,891

Stack trace: 
 sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:34)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:101)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:97)
org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:35)
org.codehaus.groovy.runtime.metaclass.ThreadManagedMetaBeanProperty$ThreadBoundGetter.invoke(ThreadManagedMetaBeanProperty.java:180)
groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:1604)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1140)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:3332)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1152)
com.nosql.Click.getProperty(Click.groovy)

stack trace for rehash thread:

Name: pool-4-thread-11
State: RUNNABLE
Total blocked: 46,544  Total waited: 57,433

Stack trace: 
 org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment.rehash(AbstractConcurrentMapBase.java:217)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:105)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:97)
org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:35)
org.codehaus.groovy.runtime.metaclass.ThreadManagedMetaBeanProperty$ThreadBoundGetter.invoke(ThreadManagedMetaBeanProperty.java:180)
groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:1604)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1140)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:3332)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1152)
com.fma.nosql.Click.getProperty(Click.groovy)

After hours of debugging I've found out, that the issue is in dynamic property "_cassandra_cluster_" which is added to all plugin managed objects:

// cluster property (_cassandra_cluster_)
clazz.metaClass."${CLUSTER_PROP}" = null

This property is then internally saved in ThreadManagedMetaBeanProperty instance2Prop map. When the dynamic property is accessed def cluster = click._cassandra_cluster_ then the click instance is saved to instance2Prop map with soft reference. So far so good, soft references can be garbage collected, right. However there seems to be a bug in the ManagedConcurrentMap implementation which disregards the garbage collected elements and keep rehashing and expanding the map (described here and here).

Workaround

Since the map is internally saved on the class level, the only working solution was to restart the server. Eventually I've developed a dirty solution, which clears the internal map from zombie elements. Following code is running in a separate thread:

public void rehashClickSegmentsIfNecessary() {
    ManagedConcurrentMap instanceMap = lookupInstanceMap(Click.class, "_cassandra_cluster_")
    if(instanceMap.fullSize() - instanceMap.size() > 50000) {
        //we have more than 50 000 zombie references in map
        rehashSegments(instanceMap)
    }
}

private void rehashSegments(ManagedConcurrentMap instanceMap) {
    org.codehaus.groovy.util.ManagedConcurrentMap.Segment[] segments = instanceMap.segments
    for(int i=0;i<segments.length;i++) {
        segments[i].lock()
        try {
            segments[i].rehash()
        } finally {
            segments[i].unlock()
        }
    }
}

private ManagedConcurrentMap lookupInstanceMap(Class clazz, String prop) {
    MetaClassRegistry registry = GroovySystem.metaClassRegistry
    MetaClassImpl metaClass = registry.getMetaClass(clazz)
    return metaClass.getMetaProperty(prop, false).instance2Prop
}

Do you have any production experience with cassandra-orm or any other grails plugin connecting to cassandra?

0

There are 0 best solutions below