MongoDB SDK Failover not working

789 Views Asked by At

I have set up a replica set using three machines (192.168.122.21, 192.168.122.147 and 192.168.122.148) and I am interacting with the MongoDB Cluster using the Java SDK:

ArrayList<ServerAddress> addrs = new ArrayList<ServerAddress>();
addrs.add(new ServerAddress("192.168.122.21", 27017));
addrs.add(new ServerAddress("192.168.122.147", 27017));
addrs.add(new ServerAddress("192.168.122.148", 27017));
this.mongoClient = new MongoClient(addrs);
this.db = this.mongoClient.getDB(this.db_name);
this.collection = this.db.getCollection(this.collection_name);

After the connection is established I do multiple inserts of a simple test document:

    for (int i = 0; i < this.inserts; i++) {
        try {
           this.collection.insert(new BasicDBObject(String.valueOf(i), "test"));
        } catch (Exception e) {
            System.out.println("Error on inserting element: " + i);
            e.printStackTrace();
        }
    }

When simulating a node crash of the master server (power-off), the MongoDB cluster does a successful failover:

       19:08:03.907+0100 [rsHealthPoll] replSet info 192.168.122.21:27017 is down (or slow to respond): 
       19:08:03.907+0100 [rsHealthPoll] replSet member 192.168.122.21:27017 is now in state DOWN
       19:08:04.153+0100 [rsMgr] replSet info electSelf 1
       19:08:04.154+0100 [rsMgr] replSet couldn't elect self, only received -9999 votes
       19:08:05.648+0100 [conn15] replSet info voting yea for 192.168.122.148:27017 (2)
       19:08:10.681+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
       19:08:10.910+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
       19:08:16.394+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
       19:08:22.876+.
       19:08:22.912+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
       19:08:23.623+0100 [SyncSourceFeedbackThread] replset setting syncSourceFeedback to 192.168.122.148:27017
       19:08:23.917+0100 [rsHealthPoll] replSet member 192.168.122.148:27017 is now in state PRIMARY

This is also recognized by the MongoDB Driver on the Client Side:

       Dec 01, 2014 7:08:16 PM com.mongodb.ConnectionStatus$UpdatableNode update
       WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: Read timed out
       WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017]  bc:java.net.SocketTimeoutException: connect timed out
       Dec 01, 2014 7:08:36 PM com.mongodb.DBTCPConnector setMasterAddress
       WARNING: Primary switching from /192.168.122.21:27017 to /192.168.122.148:27017

But it still keeps trying to connect to the old node (forever):

       Dec 01, 2014 7:08:50 PM com.mongodb.ConnectionStatus$UpdatableNode update
       WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host
       .....
       Dec 01, 2014 7:10:43 PM com.mongodb.ConnectionStatus$UpdatableNode update
       WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException -message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host

The Document count on the Database stays the same from the moment the primary fails and a secondary becomes primary. Here is the Output from the same node during the process:

"rs0":SECONDARY> db.test_collection.find().count() 12260161

"rs0":PRIMARY> db.test_collection.find().count() 12260161

Update: Using WriteConcern Unacknowledged it works as designed. Insert Operations are also performed on the new master and all operations during the election process get lost.

With WriteConcern Acknowleged it seems that the Operation is waiting infinitely for an ACK from the crashed master. This could explain why the program continuous after the crashed server boots up again and joins the cluster as a secondary. But in my case I don't want the driver to wait forever, it should raise an error after a certain time.

Update: WriteConcern Acknowledged is also working as expected when killing the mongod process on the primary. In this case the failover only takes ~3 Seconds. During this time no inserts are done, and after the new primary is elected the insert operations continue.

So I only get the problem when simulating a node failure (power off/network down). In this case the operation hangs until the failed node starts up again.

2

There are 2 best solutions below

0
On BEST ANSWER

Explicit specifying a Connection Timeout Value solved the error. See also: http://api.mongodb.org/java/2.7.0/com/mongodb/MongoOptions.html

8
On

Does your app still work? Since that server is still in your seed list, the driver will try to connect to it as far as I know. Your app should still work so long as any of the other servers in your seed list can gain primary status.