We are currently trying to get a configuration on Cassandra cluster to work with Karaf on 6 VMs (3 on each site).
The cluster works as expected on one site and when you query the CQL shell manually all of QA has the same count of transactions but NBY (different site) has different counts - see below
DC1 APP3 - 228604 APP4 - 228604 SET - 228604
DC2 NBY11 - 228599 NBY16 - 228597 NBY-SET - 228601
WE had QA cluster working successfully and we added the 2nd site as an empty DB so it took some time to replicate and synchronise. However, we had thought that the counts would be the same for all 6 VMs and not varied even a little.
We have set the Cassandra consistency level to LOCAL_QUORUM as well as we want to be able to down a site easily without causing any impact to our service.
When you query the CQL shell it also doesn't always work first time - it times out sometimes. The error isn't much to go on - see below
Request did not complete within rpc_timeout.
After a few attempts it works fine.
The counts above were queried on Friday but we have re-run this and now its even more different - see below
DC1 APP3: 228,603 APP4: 228,601 SET: 228,604
DC2 NBY11: 228,602 NBY16: 228,599 NBY-SET: 228,599
Is using the CQL shell the best way to check that synchronisation is completed ? Is there another way to do this to give us more accurate results ? Is this working as designed ?
QA cluster all seemed to work as expected as a single site - it is only since we added the new site that we have had some issues as above.
Happy to provide more information as required so please let me know if anyone has come across this before. There is some more explanation in the document I have cut and pasted from one of the technical sys admins below
(Data Centre names and IP’s changed)
Initial config: One Data Centre (DC1), 3 nodes, NetworkTopologyStrategy, PropertyFileSnitch
Second Data Centre (DC2), also with 3 nodes configured and tested OK as a standalone cluster. Cassandra then stopped on all nodes, PropertyFileSnith modified on all 6 nodes, cassandra.yaml modified to correctly list one seed per DC and cluster name adjusted on the Second Data Centre to match the first.
auto_bootstrap set to false on the 3 nodes in DC2
sudo rm -rf /var/lib/cassandra/* - done on all 3 nodes in DC2
Cassandra started on all 3 nodes in DC2
Keyspace altered, replication required on all nodes for our intended purposes. LOCAL_QUORUM used for consistency
ALTER KEYSPACE test_cluster WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 3, 'DC2' : 3};
Nodetool rebuild run on all 3 new nodes in DC2 and DC1 specified to ensure they replicated from that, all completed successfully.
Propertyfilesnitch (with names/IP’s changed)
# Cassandra Node IP=Data Center:Rack
192.168.1.1=DC1:DC1-LS2
192.168.1.2=DC1:DC1-LS2
192.168.1.3=DC1:DC2-LS2
192.168.2.1=DC2:DC2-LS1
192.168.2.2=DC2:DC2-LS1
192.168.2.3=DC2:DC2-LS2
# default for unknown nodes
default=DC1:r1
Nodetool output (truncated)
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 192.168.2.2 120.59 MB 16.2% DC2-LS1
UN 192.168.2.1 120.62 MB 16.1% DC2-LS1
UN 192.168.2.3 120.59 MB 16.1% DC2-LS2
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 192.168.1.3 120.66 MB 17.6% DC1-LS2
UN 192.168.1.2 121.2 MB 15.8% DC1-LS2
UN 192.168.1.1 121.03 MB 18.2% DC1-LS2***
Version of Cassandra is 2.0.7 and Karaf is 3.0.1
Let me know if you need more information
Thanks Helen