9 Replies Latest reply: Mar 4, 2012 12:27 AM by 921522 RSS

    Compression of long-values

    921522
      Hello,

      I want to compress an array of long-values (packedlong/packedsortedlong) but I couldn't find a method. I think the TupleOutput.writePackedLong(long)-method would fit but not for arrays. Maybe first writing into a ByteArrayOutputStream via TupleOutput.writeTo(...)/writePackedLong and then compress with Deflater the whole byte-array once more. The whole thing is that I want to compress long-pointers for a DOM-like tree-structure the tuple (firstchild,leftsibling,rightsibling,parent,childCount,descendantCount,depth,node). The one who designed the structure somehow decided to use long-values for the pointers which is in almost all cases far too much overhead, but I think it will compress very well. As most nodes are still numbered in preorder it probably will compress very well. Furthermore I could just somehow use distances but then I don't know how I will know which one is which value for serializing/deserializing, if I'm ordering the long-values at first.

      So, maybe one can give any suggestions if at least writing packed-long values in a byte-buffer and then compression with Deflater will work. Maybe I don't have to sort the values but still just store the differences, but sometimes I assume this would result in bigger values :-(

      kind regards,
      Johannes
        • 1. Re: Compression of long-values
          Greybird-Oracle
          Why not call writePackedLong repeatedly for each long in the array?
          --mark                                                                                                                                                                                       
          • 2. Re: Compression of long-values
            724410
            I have TupleBindings combined with Google Protocol Buffers and compression, this code gets you started:

            public final class ArchiveTupleBinding extends TupleBinding<ArchiveMessage> {

                 private static final Logger LOGGER = Logger.getLogger(ArchiveTupleBinding.class.getName());

                 @Override
                 public ArchiveMessage entryToObject(final TupleInput input) {

                      final ArchiveMessage object = new ArchiveMessage();
                      final ByteArrayOutputStream baos = new ByteArrayOutputStream();
                      final byte[] buffer = new byte[1024];
                      int length = 0;
                      while ((length = input.read(buffer)) != -1)
                           baos.write(buffer, 0, length);

                      try {
                           byte[] decompressed = ByteUtil.decompress(baos.toByteArray());
                           if (decompressed == null)
                                decompressed = baos.toByteArray();
                           final Archive pb = Archive.parseFrom(decompressed);

                           if (pb.hasContent())
                                object.setContent(pb.getContent());
                           if (pb.hasDate())
                                object.setDate(new Date(pb.getDate()));
                           if (pb.hasSourceName())
                                object.setSourceName(pb.getSourceName());
                           if (pb.hasSourceDate())
                                object.setSourceDate(new Date(pb.getSourceDate()));

                      } catch (final InvalidProtocolBufferException e) {
                           LOGGER.log(Level.SEVERE, null, e);
                           return null;
                      }

                      return object;

                 }

                 @Override
                 public void objectToEntry(final ArchiveMessage object, final TupleOutput output) {

                      final Builder builder = Messages.Archive.newBuilder();

                      if (object.getContent() != null)
                           builder.setContent(object.getContent());
                      if (object.getDate() != null)
                           builder.setDate(object.getDate().getTime());
                      if (object.getSourceName() != null)
                           builder.setSourceName(object.getSourceName());
                      if (object.getSourceDate() != null)
                           builder.setSourceDate(object.getSourceDate().getTime());

                      output.write(ByteUtil.compress(builder.build().toByteArray()));

                 }

            }
            • 3. Re: Compression of long-values
              Greybird-Oracle
              Thanks Peter, I'm sure this will be useful to others.

              But if the objective is to use protocol buffers, why use a TupleBinding? It seems more straightforward (and faster) to implement the EntryBinding interface. That way, the entryToObject method is passed a DatabaseEntry. You can call DatabaseEntry.getData to get the byte array, rather than reading it from the TupleInput. And likewise for objectToEntry -- you can call DatabaseEntry.setData to set the byte array.

              --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
              • 4. Re: Compression of long-values
                724410
                OMG thanks, I did not know I can use anything other than TupleBinding, will check that out..

                The history of my code was that I used to use TupleBinding with writeXXX to serialize objects into Tuple but found out that it is not (easily) possible to have backwards compatible code to read old records. So I decided that the easiest way was to rewrite my TupleBindings to use protobuf and compress at the same time.
                • 5. Re: Compression of long-values
                  Greybird-Oracle
                  That makes sense. Yeah, EntryBinding is the interface implemented by all bindings, and is the interface expected by the collections API. SerialBinding and TupleBinding are just two classes that implement the interface, if you happen to need what they have to offer.

                  --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                  • 6. Re: Compression of long-values
                    921522
                    I'm also using TupleOutput, but I think that's ok. As of now I'm also writing consecutive writeSortedPackedLong(). I don't know about this Google thing but maybe using Deflater after that on a byte[] representation of the underlying ByteArrayOutputStream would be sufficient, what do others think? :-) Usually at first the node-IDs are in consecutive order in preOrder, after inserts/replaces of nodes however it might be that a first child or right sibling has a much higher number (maxNodeKey + 1). But I think a subsequent compression of at least 6 long-values will usually yield a better compact result, however it must be fast (Deflater.BEST_SPEED) or something like that probably. However in main memory it must be decompressed in almost all cases for navigation (if some upcoming path index structures can't be used). That's why it must be really fast.
                    • 7. Re: Compression of long-values
                      Greybird-Oracle
                      I have no experience with Deflater. But a question I have is whether the cost of compression will be worth the benefit. I assume the benefit is proportional to the size of the data, and that small data sizes are not worthwhile. But that's just a guess. What is your per-record data size?

                      If you are able to take measurements -- the decrease in the speed of a get() operation, for example, and the reduction in overall JE cache size -- I'm sure everyone would be interested in the results.

                      --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                      • 8. Re: Compression of long-values
                        921522
                        Currently I can't spend much time on benchmarking but it was something which bothered me for a long time because I was involved in the project for about 4 or 5 years with a little break ;-) but now have to finish my master thesis which is not based on the storage-structure itself, but I'm currently more interested in tuning the storage and perhaps the in-memory cache of "nodepages" which are stored on disk. Pointer-based trees are much more space-consuming than for instance array-based tree-structures (or sturctures based on functional datatypes for efficient sequences), but we are aiming towards persistent growing trees (that is we store revisions after a write-transaction commits the data based on a COW-principle). My supervisor is currently working on a cloud-based secure storage and sadly the node-layer is currently not used and not maintained. We store Revision-Root-Pages, a meta-page for each root-page which I want to extend to provide a path-like index structure and NodePages whereas all Pages are organized in a Btree-like structure whereas the nodePages are the leafes (if someone is interested: https://github.com/disy/treetank, but I have to put my java7-branch with lots of changes and improvements on github as well).

                        I haven't done any compression things before (except a few days ago added text-value compression for TextNodes (which are currently created for XML character content)) and I don't know if it's worth to do compression on about 6 long-values besides the packedSortedLong which I hope has a great effect as probably all tree-structures which are created in the first revisions or probably all the time would be smaller than about 500_000 nodes and fit without any problems into an Integer-range (at least I haven't found any XML databases which are using long-values) -- I think even Integers can greatly profite from UTF-8 (or 16?-like compressions). Perhaps sometimes it makes more sense to store long-ranges, for instance shredding/importing a very simple XML-document

                        <root>
                        <bla>blubb</bla>
                        foobar
                        </root>

                        into our internal tree-representation will yield the following tuples (plus descendantCount()/childCount() and maybe I want to further add a lastChild-pointer and a level-integer in the future):

                        root: node-ID (1) / firstChild (1) / rightSibl(-1) / leftSibl(-1) / parent(0) (an implicit document-root with node-ID 0)
                        bla: node-ID (2) / firstChild(3) / rightSibl(4) / leftSibl(-1) / parent(1)
                        blubb: node-ID (3) / firstChild(-1) / rightSibl(-1) / leftSibl(-1) / parent(2)
                        foobar: node-ID (4) / firstChild(-1) / rightSibl(-1) / leftSibl(2) / parent(1)

                        This would benefit from a range-encoding for instance for the first node, the root: (1/0/-1/0/0) to keep the numbers smaller and compression of that (will most probably have a great impact on large trees). Maybe even after an insert (asFirstChild / asRightSibling) in some cases (the next node-ID will be node-ID 5 in the example (maxID + 1) and thus the document-order isn't preserved anymore).

                        Maybe this will be the way to go for the disk-storage and for in-memory a "compressed"-path like structure which can be easily updated, but well, usually only a small fraction of the tree is hold in memory through a simple cache and the updates are stored in a BerkeleyDB cache log.

                        kind regards,
                        Johannes
                        • 9. Re: Compression of long-values
                          921522
                          Ok, I've added the range encoding, that is first node - rightSibl, node - leftSibl,... descendantCount - childCount and so on is saved, furthermore still using the writeSortedLong(long) method. But for the pointers I needed an additional bit, a boolean value to indicate if there is no neighbour node, because then I don't have to deserialize and serialize the special -1 long-value to indicate that it's missing (and I can't use the method to reconstruct the pointer because it would yield wrong results). Now I don't know, I've simply used TupleOutput.writeByte(byte) but I think I only have to write a single bit, but I don't know how to do it. Well, Java lacks a bit datatype, so I wouldn't know how to even cast an integer (1 and 0) to a bit nontheless ;-)

                          private static final void writePointer(final ITTSink pSink, final long pSelf, final long pPointer) {
                          boolean isNullNodeKey = pPointer == EFixed.NULL_NODE_KEY.getStandardProperty() ? true : false;
                          pSink.writeByte(isNullNodeKey ? (byte)1 : (byte)0);
                          if (!isNullNodeKey) {
                          final long toStore = pSelf - pPointer;
                          assert toStore != 0 : "May never be 0!";
                          pSink.writeLong(toStore);
                          }
                          }

                          and

                          private long deserializeStructDel(final ITTSource pSource, final long pSelf) {
                          boolean isNullKey = pSource.readByte() == (byte)1 ? true : false;
                          if (isNullKey) {
                          return EFixed.NULL_NODE_KEY.getStandardProperty();
                          } else {
                          final long pointer = pSource.readLong();
                          assert pointer != 0;
                          return pointer >= 0 ? pSelf - pointer : Math.abs(pointer - pSelf);
                          }
                          }

                          are the simple methods. For instance called via

                          writePointer(pSink, pDel.getNodeKey(), pDel.getRightSiblingKey());
                          writePointer(pSink, pDel.getNodeKey(), pDel.getLeftSiblingKey());
                          writePointer(pSink, pDel.getNodeKey(), pDel.getFirstChildKey());

                          kind regards,
                          Johannes