This discussion is archived
13 Replies Latest reply: Mar 28, 2013 8:32 AM by greybird RSS

Any plans for supporting Direct Byte Buffers?

vinothchandar Newbie
Currently Being Moderated
Hi,

I was wondering if there is any plans on the roadmap to support Direct Byte Buffers for the byte[] inside the internal nodes? This would just leave the thinner wrappers (BIN, IN classes) on the heap. and a take a lot of pressure of the JVM heap I believe (bringing down GC times). I have experimented with speed of the access and allocation of direct byte buffers and it seems a feasible approach. http://distributeddreams.blogspot.com/2012/07/memory-allocation-speed-check.html

In fact, I attempted to change the code along these lines and made some decent progress before hitting some roadblock with the checksumming code. So, even if you think, it will not work, I am very interested in understanding why.

Thanks for patiently answering all my deep questions..

Thanks
Vinoth
  • 1. Re: Any plans for supporting Direct Byte Buffers?
    stotch Newbie
    Currently Being Moderated
    I agree, this would be very helpful in reducing heap usage and improving gc times.
  • 2. Re: Any plans for supporting Direct Byte Buffers?
    greybird Expert
    Currently Being Moderated
    Vinoth,

    Could you please describe in more detail what you changed in JE, that gave you a performance benefit? Did you implement your own "Slab allocation" that handles variable sized byte arrays, and use this to allocate the byte arrays used by INs?

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 3. Re: Any plans for supporting Direct Byte Buffers?
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    Let me try to be more descriptive.

    I have not done any perf tests against JE itself. Traditionally the JNI access has been slow as opposed to accessing over the heap and ideally we want BDB to bottleneck IOPS on the device and not on the memory allocation. So, I simply tested I can allocate direct byte buffers fast enough (Thats the numbers in my blog link I posted) not to impede with throughput or add additional milliseconds of latency.

    Here is my rationale to think it was fast enough :
    I could allocate upto 250000 1k buffers per second in a single thread. Generally since keys are smaller for us (I am not sure if this is the majority across all JE users), you could allocate multiple keys worth of buffers in a single 1kb allocation. For example, if key is 100 bytes, with nodeMaxEntries 128, then 12-13 such 1kb byte buffers are enough to allocate a BIN. So, we could possibly do upto 250k/12=20K BIN allocations in an application thread, which will be good enough to saturate IOPS (even on SSD) when we add more threads.

    Here is the code change I did:
    I simply replaced the ByteBuffer.allocate() with ByteBuffer.allocateDirect() in places where the code reads a LN/BIN/IN into the JVM. A lot of code needed to be changed since you could no longer call bytebuffer.array(). So, all the code that accessed the undelying buffer as byte[] and did buf[i] had to change to bytebuffer.get(i)/set(i). I kept making small changes running the unit tests in between. Then I hit the place where JE check for checksum and it expected a byte[] (underlying java API). So the only option was to create a temp copy and do the checksum tests. At this point, it started to sound a little hacky to me and I gave up and decided to check with you guys..

    Here is what I expect we would gain :
    Our heap sizes for Voldemort are at 32G and we don't want to go higher.. Needless to say that a heap this big is problematic in terms of GC (Just to clarify, we are much better off than last year though). So effectively, vertically scaling a box is now bottlenecked on heap space.

    -- YoungGen collection times - we have 30-40 enviroments (1db per env) sharing this singlle 32G heap and when cleaning happens, it makes ParNew run more frequently and everytime this happens, app stalls for ~300ms.
    -- CMS - this happens once in a while. but when there is heavy cleaning (load spikes etc), it also increases and is a function of the heap size.

    Replacing heap buffers with direct buffers would mean that, only the light BIN/IN/LN wrappers (200 bytes or so) will be allocated on the heap (as opposed to the entire index data) and thus dramatically reduce ParNew frequency and thus much better/predicatable 99th latency. Of course, a smaller heap is much more manageable. And yes we could throw more RAM on the box and vertically scale more with a slight increase in heap size..

    Over to you for comments now..

    Thanks
    Vinoth
  • 4. Re: Any plans for supporting Direct Byte Buffers?
    greybird Expert
    Currently Being Moderated
    I simply replaced the ByteBuffer.allocate() with ByteBuffer.allocateDirect() in places where the code reads a LN/BIN/IN into the JVM. A lot of code needed to be changed since you could no longer call bytebuffer.array(). So, all the code that accessed the undelying buffer as byte[] and did buf had to change to bytebuffer.get(i)/set(i). I kept making small changes running the unit tests in between. Then I hit the place where JE check for checksum and it expected a byte[] (underlying java API). So the only option was to create a temp copy and do the checksum tests. At this point, it started to sound a little hacky to me and I gave up and decided to check with you guys..
    This isn't how the memory for INs (including BINs) is allocated. ByteBuffer is only used for reading from the file, after which the bytes are deserialized to create Java objects: INs, LNs, etc.

    To use direct memory for INs we'd have to implement what databases traditionally call a buffer pool, and what I think you're calling slab allocation. This is a complete rework of the IN code. However, it is something we've considered for a future release, because of not only the potential GC advantages but also because our memory management might be simpler if we go as far as keeping what are now Java objects in a serialized form. But we haven't decided whether to do this. It's a big project. Certainly not something that's coming soon.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 5. Re: Any plans for supporting Direct Byte Buffers?
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    Thanks for confirming the support plan for this.
    To use direct memory for INs we'd have to implement what databases traditionally call a buffer pool
    I was actually talking about simply pushing the IN::identifierKey off the heap.
    anyway, I think I have my answer.

    Thanks!
  • 6. Re: Any plans for supporting Direct Byte Buffers?
    user4547579 Newbie
    Currently Being Moderated
    Hi Mark and Vinoth,

    For what it's worth, I would also be interested in direct buffer based allocation. Very large heaps are difficult to manage and support. In addition, usage of DirectBuffers has potential advantages on NUMA based machines. The following recent concurrency-interest discussion may be of interest:

    http://cs.oswego.edu/pipermail/concurrency-interest/2013-February/010858.html

    If JE used direct buffers then an application could partition data internally among NUMA nodes and avoid saturating the memory bus. This could be done crudely and less efficiently using thread-locals or require JNI to figure out more precisely which threads belong to which NUMA nodes. Partitioning would have to be per JE environment rather than DB, but would require the ability to pin JE's internal threads (cleaners, etc) to specific NUMA nodes (i.e. all threads accessing a single JE environment would be pinned to a single common NUMA node). I'm not sure how feasible this is in practice; it's certainly not possible using plain Java APIs.

    The potential scalability improvements are very significant: some JE based benchmarks (32GB JVM + 24GB DB cache) I performed a couple of years ago compared a single board non-NUMA 16 core machine with a 4-board NUMA based 32 core machine (8 cores per board) with otherwise identical architecture (same chips, memory, etc). I don't have the figures to hand any more unfortunately, but I do remember clearly that the 16 core machine out-performed its bigger and more expensive brother by a significant margin. Collaboration with hardware/OS engineers at the time showed that the bottleneck was the memory bus.

    Matt
  • 7. Re: Any plans for supporting Direct Byte Buffers?
    greybird Expert
    Currently Being Moderated
    Hi Matt!
    For what it's worth, I would also be interested in direct buffer based allocation. Very large heaps are difficult to manage and support. In addition, usage of DirectBuffers has potential advantages on NUMA based machines. The following recent concurrency-interest discussion may be of interest:
    I'll have to postpone comments on NUMA until I've had a chance to do some research. But I have a more immediate question about your comment on large heaps. When you say they're "difficult to manage and support", I assume this is due to GC issues. Is that right?

    If so, wouldn't these problems also be addressed if JE were to manage its cache memory using a fixed set of buffers allocated as regular byte arrays, rather than as direct buffers? This would require a large heap, and therefore compressed oops could not be used. But I assume GC would not be a problem since the buffers would be allocated at JE startup and freed at shutdown. Is there another drawback (not considering NUMA yet) that I'm missing?

    Thanks,
    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 8. Re: Any plans for supporting Direct Byte Buffers?
    user4547579 Newbie
    Currently Being Moderated
    Hi Mark,
    greybird wrote:
    Hi Matt!
    For what it's worth, I would also be interested in direct buffer based allocation. Very large heaps are difficult to manage and support. In addition, usage of DirectBuffers has potential advantages on NUMA based machines. The following recent concurrency-interest discussion may be of interest:
    I'll have to postpone comments on NUMA until I've had a chance to do some research. But I have a more immediate question about your comment on large heaps. When you say they're "difficult to manage and support", I assume this is due to GC issues. Is that right?

    If so, wouldn't these problems also be addressed if JE were to manage its cache memory using a fixed set of buffers allocated as regular byte arrays, rather than as direct buffers? This would require a large heap, and therefore compressed oops could not be used. But I assume GC would not be a problem since the buffers would be allocated at JE startup and freed at shutdown. Is there another drawback (not considering NUMA yet) that I'm missing?
    There are three issues with having very large heaps that I can think of off the top of my head:

    * GC impact on deterministic application response time: this is directly related to the number of live objects and hence the DB cache size. Using byte arrays would definitely help here regardless of how they are allocated.

    * JVM sizing: this is one of the most common problems we encounter. I suppose externalizing the cache via direct buffers is only going to change the sizing problem rather than eliminate it since users will still need to be careful not to exceed the amount of usable physical memory.

    * Detecting application memory leaks and other memory related issues. Quite often we ask users to take a heap dump for us to look at. This is clearly impractical for large JVMs due to email limitations and hardware limitations (e.g. I can't load a 32G heap dump on my laptop). I had assumed that DirectBuffers would not feature in a heap dump, but I'm not so sure now I think about it. Something worth checking...

    Matt
  • 9. Re: Any plans for supporting Direct Byte Buffers?
    greybird Expert
    Currently Being Moderated
    Thanks, this info helps.
    * JVM sizing: this is one of the most common problems we encounter. I suppose externalizing the cache via direct buffers is only going to change the sizing problem rather than eliminate it since users will still need to be careful not to exceed the amount of usable physical memory.
    Do you find that users size the JE cache too large, not leaving enough within the heap for the app? Or the heap and JE cache too small, and performance suffers?

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 10. Re: Any plans for supporting Direct Byte Buffers?
    greybird Expert
    Currently Being Moderated
    Matt,
    If JE used direct buffers then an application could partition data internally among NUMA nodes and avoid saturating the memory bus. This could be done crudely and less efficiently using thread-locals or require JNI to figure out more precisely which threads belong to which NUMA nodes. Partitioning would have to be per JE environment rather than DB, but would require the ability to pin JE's internal threads (cleaners, etc) to specific NUMA nodes (i.e. all threads accessing a single JE environment would be pinned to a single common NUMA node). I'm not sure how feasible this is in practice; it's certainly not possible using plain Java APIs.
    I read the concurrency post, thanks for the reference. I've been trying to think about how a single JE environment's cache could be partitioned and used by multiple NUMA nodes. It seems pretty difficult, especially because JE would allocate the direct buffers up front and then allocate memory for Btree information using a buffer pool approach.

    When you say that partitioning would be per JE environment, I don't understand how that would take advantage of multiple NUMA nodes. Is this in an app where you have multiple JE environments? Or are you thinking that the JE cache would be used in one NUMA node (I think that's what you described above), and other CPU intensive parts of the app use other nodes, and this would make good use of the processors?

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 11. Re: Any plans for supporting Direct Byte Buffers?
    user4547579 Newbie
    Currently Being Moderated
    greybird wrote:
    Thanks, this info helps.
    * JVM sizing: this is one of the most common problems we encounter. I suppose externalizing the cache via direct buffers is only going to change the sizing problem rather than eliminate it since users will still need to be careful not to exceed the amount of usable physical memory.
    Do you find that users size the JE cache too large, not leaving enough within the heap for the app? Or the heap and JE cache too small, and performance suffers?

    --mark
    Both actually, although it's often us that sizes the cache too small. The reason is that we have to provide a conservative default DB cache size during install. This is less of a problem these days with bigger default JVM sizes, but a few years ago we had to be much more conservative because the default JVM size was much smaller on a lot of machines and so a much greater proportion of the default heap was taken up with classes, and other long lived application data such as global constants, cached buffers, i18n strings, etc. Out of the box we had to opt for a default of 10% of the JVM size in order to cope with 256M JVMs or even down to 64M JVMs. We could have used a more complicated "ergonomic" algorithm which calculated the default size based on the JVM size - that would have been better. Even simpler though would have been the ability to set a fixed size by default (e.g. 256M) regardless of the JVM size, producing more reliable results OOTB regardless of the JVM size, version, or host OS.

    The main one that bites us though is the case where the user oversizes the DB cache: for best performance they want to maximize their usage of available heap memory without killing GC. The implication is that the user is forced to do some guess work based on the old gen size minus some application overhead for constants, classes, etc, minus some additional headroom for CMS to behave nicely (initiating occupancy threshold, fragmentation) which is definitely a black art. If the DB cache were externalized outside of the heap then that would definitely remove a big part of the guesswork and reduce the risk of a GC failure further down the road.

    Matt
  • 12. Re: Any plans for supporting Direct Byte Buffers?
    user4547579 Newbie
    Currently Being Moderated
    Hi Mark,
    greybird wrote:
    Matt,
    If JE used direct buffers then an application could partition data internally among NUMA nodes and avoid saturating the memory bus. This could be done crudely and less efficiently using thread-locals or require JNI to figure out more precisely which threads belong to which NUMA nodes. Partitioning would have to be per JE environment rather than DB, but would require the ability to pin JE's internal threads (cleaners, etc) to specific NUMA nodes (i.e. all threads accessing a single JE environment would be pinned to a single common NUMA node). I'm not sure how feasible this is in practice; it's certainly not possible using plain Java APIs.
    I read the concurrency post, thanks for the reference. I've been trying to think about how a single JE environment's cache could be partitioned and used by multiple NUMA nodes. It seems pretty difficult, especially because JE would allocate the direct buffers up front and then allocate memory for Btree information using a buffer pool approach.
    I agree. I don't think that it is feasible to partition a single JE environment across multiple NUMA nodes, because it would be extremely difficult to control JE's internal memory usage and constrain memory access to specific threads.
    When you say that partitioning would be per JE environment, I don't understand how that would take advantage of multiple NUMA nodes. Is this in an app where you have multiple JE environments? Or are you thinking that the JE cache would be used in one NUMA node (I think that's what you described above), and other CPU intensive parts of the app use other nodes, and this would make good use of the processors?
    I'm talking about an app with multiple JE environments, one per NUMA node, where each JE environment would represent a single partition of application data. It's probably a silly idea since it is probably much simpler to run multiple instances of the application process (one per partition), each bound to a single NUMA node (e.g. using pbind on Solaris).

    Matt
  • 13. Re: Any plans for supporting Direct Byte Buffers?
    greybird Expert
    Currently Being Moderated
    Matt and Vinoth -- thanks for all your comments! This discussion has really helped to clarify the benefits of using direct buffers.
    --mark                                                                                                                                                                                                                                                                                       

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points