We try to set up a SOLR Cloud environment using 1 shard with 2 replicas (1 leader). The replicas are managed by 3 zookeeper instances.
The setup seems fine when we do the normal work. The data is being replicated at runtime.
Now we try to simulate erroneous behavior in several cases:
- Turn off one of the replicas in two different scenarios: leader and non-leader
- Cutting off the network making the non-leader replica down
In both cases the data is being written contentiously to the SOLR Cloud.
CASE 1: The replication process starts after the failed machine gets boot up again. The complete data set is present in both replicas. Everything works fine.
CASE 2: Once reconnected to network the non-leader replica starts the recovery process ,but for some reason the new data from leader is not being replicated onto the previously failed replica.
From what I was able to read from logs comparing both cases I don't understand why SOLR sees
RecoveryStrategy ###### currentVersions as present and RecoveryStrategy ###### startupVersions=[[]] (empty)
compared to CASE 1 when RecoveryStrategy ###### startupVersions are filled with objects that are in currentVersions in CASE 2
The general question is... why restarting SOLR results in a successful migration process, but reconnecting the network does not?
Thanks for any tips / leads! Greg