ElasticSearch IndexShardGatewayRecoveryException

4.7k Views Asked by At

ElasticSearch is using too much CPU. Today I found these logs. Is there anyone can help me with this?

[2015-06-24 16:16:52,309][WARN ][cluster.action.shard     ] [Bereet] [logstash-2015.06.24][0] received shard failed for [logstash-2015.06.24][0], node[ucXcuxuQQTSz_leAzWq6mQ], [P], s[INITIALIZING], indexUUID [ieIR8uWLQHycnEC_szsNZQ], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[logstash-2015.06.24][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchIllegalArgumentException[No version type match [99]]; ]]
[2015-06-24 16:16:52,332][WARN ][cluster.action.shard     ] [Bereet] [logstash-2015.06.24][0] received shard failed for [logstash-2015.06.24][0], node[ucXcuxuQQTSz_leAzWq6mQ], [P], s[INITIALIZING], indexUUID [ieIR8uWLQHycnEC_szsNZQ], reason [master [Bereet][ucXcuxuQQTSz_leAzWq6mQ][iZ23cth9hh5Z][inet[/10.162.41.162:9300]] marked shard as initializing, but shard is marked as failed, resend shard failure]
[2015-06-24 16:16:52,339][WARN ][index.engine             ] [Bereet] [logstash-2015.06.24][4] failed to sync translog
[2015-06-24 16:16:52,345][WARN ][indices.cluster          ] [Bereet] [[logstash-2015.06.24][4]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [logstash-2015.06.24][4] failed to recover shard
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
        ... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [116]
        at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
        at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:376)
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
        ... 5 more
2

There are 2 best solutions below

0
On

For me this happened after a system crash (the disk had enough space).

There's now an official way to fix a corrupted translog using the provided tool elasticsearch-translog, but you may lose unindexed data, so I suggest to backup the translog (i.e. for compliance reasons; maybe with enough effort someone will want to analyze it at some point).

First confirm the issue by running:

curl -XGET localhost:9200/_cluster/allocation/explain?pretty

An easier way to find affected shards:

curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

Stop eleasticsearch.

Let's say shard number 2 is affected in index qABaulDIRJyT06G3rBFfrC (your paths may vary), run:

/usr/share/elasticsearch/bin/elasticsearch-translog truncate -d /var/lib/elasticsearch/nodes/0/indices/qABaulDIRJyT06G3rBFfrC/2/translog

Make sure the newly created files belong to the proper user/group in case you ran the tool as root:

chown -R elasticsearch:elasticsearch translog*

Start elasticsearch. Finally run the following command to force elasticsearch to fix the issue if it stopped attempting to reuse the shard:

curl -XPOST localhost:9200/_cluster/reroute?retry_failed=true

The command to view unassigned shards should not return results anymore.

0
On

A translog appears to be corrupt: TranslogCorruptedException[translog corruption while reading from stream]

I believe if you simply delete the corrupt translog(s) (within the nodes' /indices/${index_name} sub-directories) it should resolve this specific problem. Further problems might be revealed upon removing/fixing the corrupt translog(s).

Here is a potentially helpful link: http://unpunctualprogrammer.com/2014/05/13/corrupt-elasticsearch-translogs/