One of my C* cluster design expects nodes to hold between 1 and 2 TBs of data each, and I expect a huge amount of data in a few months. Pretending I can get 1PB of data and that each node will hold 1TB of data, that means I should plan for a 1000x growth over time, and starting from a "misere" N=3 nodes with RF=3 for 1TB of data, I would keep adding nodes up to N=3000 over time.
The high number of nodes involved put some pressures on how to deal with disks/servers failures, keep the cluster healthy and how to perform backups.
Healthy Cluster
Assuming you don't want any data loss and perform reads/writes with LOCAL_QUOROM Consistency Level, using RF=3 when you have N<10 nodes is very reasonable, however when you go up with N the MTBF of your nodes goes down accordingly, so keeping RF=3 is going to call for troubles and you may want to "upgrade" to RF=5 or more.
Q1: What's a good RF that would fight against the increased MTBF and keep the cluster healthy (and you sleeping peacefully) with say 100 nodes? and 500? and 1000?
BACKUP
Making backups of all the nodes seems to be a bit not viable due to the following reasons:
- Doubles the costs of the solution instantly.
- I would backup the redundant data due to the RF of the cluster.
I see no way to remove the redundancy introduced by the RF and backup only the data expect adding another DC to C* with RF=2 (I could go for RF=1 but if I lose one node all the backup cluster is down). That would mean adding 2/RF of the cost of the cluster for backup purposes which seems to me a good alternative.
Q2: Are there any other methods to perform this task without increasing too much the cost of the solution?