I am having a dilema of choosing correct affinity key.
Basically the data structure is following:
ClientId relationshipType AccountId
Client is the client, distinct person
Account is the credit, having an attribute of closed/open
relationshipType is the type of link, e.g. client can be borrower (or main borrower) or co-borrower or guarantor.
1. Account can have the flag which specifies whenever it's close or open.
2. The data set is growing over time, but application is not interested in accounts (and related clients) which are closed (mean after they have been processed), with exception 2a
2a. Scoring engine can potentionally navigate all the links in the distinct group (grath), e.g. if credit is open, scoring may potentially navigate all links in the grath, related to this open account, either through client id or account id.
3. The size of the working data should be limited and depends on the number of open accounts and their associated links.
4. Data for each group should be allocated in same node so the grid could do computing grid.
5. Application shoud be able to process events in form of new updates with insert/update/deletion of new client -> account links (if solution with using groups are used)
6. Size of the data (client to account links table) is 80M, of which it's estimated that around 8 mil links are really needed (size of links in independent graths with at least one account open)
1. First thing I have thinked of, is to create synthetic independent groups which do not have any links to any other client/account ids and have a flag of being open/closed
G1 - C1 - A1
G2- C2 - A2
G3 - C3 - A1
G4 - C2 - A6
But during investigation of real production data, there have been anomal groups which size more then 20000 (one is even 300k, but I think it's data issue, investingating further) links, which would bring the individual group size up to several hundred MBs.
This would potentially break the system to it's knees on processing certain groups, but I can not think of any good safeguard options. E.g. limit groups to size of let's 200 and just add new relationships to existing groups (may be an option) - but this would have same problem with data affinity key as in option 2 below.
Also during merging (process update/insert)/breaking groups (delete/update) the new groups may be potentially created and old deleted, this mean partiion key will have to be updated, creating a potential window of inconsistency in grid as I would have to update groupId in account and client entities.
2. Use the root as account with relationship of borrower (which we know can be only one) and potentially duplicate links the client account data structure groups.
However this breaks the requirement of having proper data affinity key, e.g.
G1 - C1 -BORROWER - A1 (ROOT)
G1 - C2 - GUARANTOR - A1
G2 - C2 - BORROWER - A3 (ROOT)
G3- C3 - GUARANTOR - A3
G3- C2 - GUARANTOR - A1
So if I choose groupId as affinity key, obviosly this would not work as the choosen roots would be different and the data will end up on different nodes (client/account entities) or even worse randomly in different nodes.
If anyone have an experience with such problem, any help would be appeciated.