We are currently facing a situation where we can't avoid doing a collection full-scan. We have already optimize the query and the data structure but we would like to go further and take full advantage of sharding and replication.
Configuration
- mongodb version 3.2
- monogo-java-driver 3.2
- storageEngine: wiredTiger
- compression level: snappy
- database size : 6GB
Documents structure:
individuals collection
{
"_id": 1,
"name": "randomName1",
"info": {...}
},
{
"_id": 2,
"name": "randomName2",
"info": {...}
},
[...]
{
"_id": 15000,
"name": "randomName15000",
"info": {...}
}
values collection
{
"_id": ObjectId("5804d7a41da35c2e06467911"),
"pos": NumberLong("2090845886852"),
"val":
[0, 0, 1, 0, 1, ... 0, 1]
},
{
"_id": ObjectId("5804d7a41da35c2e06467912"),
"pos": NumberLong("2090845886857"),
"val":
[1, 1, 1, 0, 1, ... 0, 0]
}
The "val" array contain an element for each individual (so the length of the array is up to 15000). The id of the individual is it's corresponding index in the "val" array.
Query
The query is to find documents from values collection where the sum of val[individual._id] is above a specific treshold for a list of idividuals. We can't just pre-compute the sum of the array since the list of individuals wanted change during runtime (we may want to get the result for only the first 2000 individuals for example). This query use the aggregation framework.
What we're currently doing:
We split the query in 100-500 subqueries and run them 5 by 5 in parallel.
The first subquery would be the same query for documents where pos > 0 and pos < 50000, the second for documents where pos > 50000 and pos < 100000 ect...
We would like to be able to run more subqueries in the same time, but we're facing performance loss when running more than 5 on a single mongod instance.
So the question is : should we go for replication or for sharding (or for both) in order to run the maximum number of subqueries in the same time ? How could we configure mongodb to dispatch subqueries among replica/shards as best as possible?