We are trying to use the "Big data" as a store large binary data (about 2 GB or more per key/value pair). This scientific data that are stored in compressed form.
The problem is that the "NoSQL API" does not provide a user-friendly interface in our case. Let me explain in more detail.
Data recording is performed by "putLOB", which either creates a new value or replacing an old one. However, he does not know how to append to the end. There is a possibility in the case of a partial "lob" do not overwrite what is already recorded, and record only what you need. But this is only useful if used in recording fundamental idea lies behind, once recorded and forgotten. And if you want gradually as the availability of data record them, this method is not suitable.
I propose to implement an additional method "putLOB" the following:
OutputStream putLOB (Key lobKey, long offset, Durability durability, long lobTimeout, TimeUnit timeoutUnit)
He opens a stream for write starting from "offset".
All LOB methods is not atomic. Failures (like the loss of network connectivity) while an insert operation is in progress may result in a partially inserted LOB.
An attempt to access a partial LOB (if call getLOB) will result in a
PartialLOBException being thrown.
In this case, I can not even know the real size of the recorded data. I need it to append data to the end.
In our case, it does not matter whether it is a partial or not. It is important to know the actual size of the data. Control of data integrity is our responsibility.
I propose to implement a new method "getLOB" that will not throw an exception (PartialLOBException).
In "InputStreamVersion" missing method "long getLOBLength ()". We have to get InputStream and do "skip(Long.MAX_VALUE)". Although the size is already known. Inside the object "InputStreamVersion" there is a field "lobsize".
Is it possible to implement these things?
Sorry for my english)
We understand your requirement. Let me suggest a couple alternatives worth trying.
Case 1: If you append relatively small amounts of data successively to the record (for example, you might store 100KB of experiment data, then another 100KB...etc), then you can just use a sequence of minor keys to store each chunk of data. To retrieve, you could iteratively fetch the contents of each minor key in sequence.
Case 2: If you first store a LARGE amount (e.g. 2 GB) of data, and then add small (e.g. 100KB) of data subsequently, you could store the large content as a LOB and then store the small chunks as a sequence of minor keys. To retrieve, you'd first read the LOB and then iteratively get() the minor key contents in sequence.
Does this work for you?
Key / value pairs with the data will be of the order of several million. If you split them apart yet, you get even more. Database can work with as many keys without loss of speed?
Store all the data in one pair in my opinion would be better, because we need a fast sequential access. If we needed a fast random access, we could think about your proposal.
This is not the our case.
My recommendation is to do some experiments with different sizes of "value" in order to get an idea of how the system performs. NoSQL Database is designed to handle large numbers of key-value pairs. We've tested NoSQL Database performance with over 2 billion rows. In our test scenario, each key-value pair was approx. 1000 bytes (we use YCSB workload for testing).
Your scenario is different from the YCSB workload since you have very LARGE rows. Some experimentation with data modeling will be very helpful. I'd love to see the results of your experiments.
Hope this was helpful.