Uneven load distribution after data import to DSE Search cluster

459 Views Asked by At

I am experimenting with DataStax Enterprise Search. I have a two node cluster and I am importing data using Solr console Dataimport capability. I have my virtual nodes disabled (num_tokens = 1 in cassandra.yaml) as per "Configuring Solr" doc (http://www.datastax.com/docs/datastax_enterprise3.2/solutions/dse_search_schema#configuring-solr). My simplified schema is as follows:

<schema name="spatial" version="1.1">

<types>
    <fieldType name="string" class="solr.StrField" omitNorms="true"/>
    <fieldType name="boolean" class="solr.BoolField" omitNorms="true"/>
    <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/> 
    <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="tfloat" class="solr.TrieFloatField" omitNorms="true"/>
    <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true"/>
    <fieldType name="binary" class="solr.BinaryField"/>

    <!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->
    <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
</types>

  <fields>
      <field name="id"  type="string" indexed="true"  stored="true"/>
      <field name="objectid" type="tint" indexed="true" stored="true" required="true" multiValued="false" />
      <field name="guwi" type="string" indexed="true" stored="true" required="false" multiValued="false" />
      <field name="country" type="string" indexed="true" stored="true" required="false" multiValued="false" />
      <field name="region" type="string" indexed="true" stored="true" required="false" multiValued="false" />
      <field name="latlong" type="location" indexed="true" stored="false"/>
  </fields>
  <defaultSearchField>objectid</defaultSearchField>
  <uniqueKey>id</uniqueKey>
</schema>

Data import succeeds. However when I run "nodetool status" I can see that the load is not evenly distributed across my two node but is all concentrated on the node I used to perform data import. I tried to modify uniqueKey to be a composite key, like (id,latlong) or even a just latlong, but it does not seem to change load distribution. Am I missing something?

Thanks, Leon

1

There are 1 best solutions below

5
On

Your problem, as seen in the nodetool output, is that the two nodes have tokens that are too close together. Because of this, node (10.30.161.137) is responsible for 94% of the token range.

This is most likely because when you set the num_token=1 you did not set the initial token value. When initial token isn't set, undesirable values may be assigned.

initial_token (Default: disabled) Used in the single-node-per-token architecture, where a node owns exactly one contiguous range in the ring space. If you haven't specified num_tokens or have set it to the default value of 1, you should always specify this parameter when setting up a production cluster for the first time and when adding capacity. For more information, see this parameter in the Cassandra 1.1 Node and Cluster Configuration documentation.

Configuring Cassandra

A token calculator is available here Token Generator