10 Replies Latest reply on Oct 12, 2012 6:09 PM by robvarga

    Coherence Data storage in object form

    948124
      As I understand the coherence always store the data in serialized form i.e. heap of a cache node is full os serialized data. Is there a way if we can store the data in object form just like we do in a normal Hash Map. I know to the client it is transparent that how data is stored but still putting the data in object form may improve the performance for Reflection based value extractor.

      Thoughts?
        • 1. Re: Coherence Data storage in object form
          Jonathan.Knight
          Hi,

          Personally I would stick with Binary storage. If you want to speed up extractors then you have a couple of choices. If the ReflectionExtractor is just calling a method that returns the value of a field then use PofExtractor as this is much more efficient. If the ReflectionExtractor is calling some other method that returns some computed value then create an Index using that extractor. Coherence will then use the index to return the result of the extractor rather than actually deserializing the binary.

          JK
          • 2. Re: Coherence Data storage in object form
            948124
            Thanks JK but my question is that does coherence support storing data (on the cache node heap) in object format or it always stores the data in serialized form?
            • 3. Re: Coherence Data storage in object form
              robvarga
              945121 wrote:
              Thanks JK but my question is that does coherence support storing data (on the cache node heap) in object format or it always stores the data in serialized form?
              It is ultimately up to the backing map manager and the backing map to decide how it actually stores it.

              Replicated caches actually can store either the binary or the Java object form of the value (key is always binary) depending on whether it has already been deserialized since last changed or not, so there is an example of storing Java objects.

              You need to know, though, that if you store Java object form, then sending it on the network involves serializing it which is bad for performance if reads are more frequent than writes, which is why Coherence stores the binary form (besides it consuming less memory).

              Best regards,

              Robert
              • 4. Re: Coherence Data storage in object form
                robvarga
                945121 wrote:
                As I understand the coherence always store the data in serialized form i.e. heap of a cache node is full os serialized data. Is there a way if we can store the data in object form just like we do in a normal Hash Map. I know to the client it is transparent that how data is stored but still putting the data in object form may improve the performance for Reflection based value extractor.

                Thoughts?
                Btw, you can have an index created for IdentityExtractor if you want to have easy access to the Java object form without on-the-fly deserialization.

                Best regards,

                Robert
                • 5. Re: Coherence Data storage in object form
                  885867
                  Replicated caches actually can store either the binary or the Java object form of the value (key is always binary) depending on whether it has already been deserialized since last changed or not, so there is an example of storing Java objects.
                  You need to know, though, that if you store Java object form, then sending it on the network involves serializing it which is bad for performance if reads are more frequent than writes, which is why Coherence stores the binary form (besides it consuming less memory).
                  How about java objects that implement POF Serializable Interface? Can you elaborate on what you are alluding to.

                  We have a requirement to store data from from the database in the form of POF objects.

                  1) For an eg. assume, there are 4 tables in database that store- PersonID, Name, dob, ssn, address
                  2) A POF Object is created with all of those columns in a single object (usually it's done by joining all 4 tables).
                  3) An index is added on a unique rowkey, that's PersonID for faster lookups.
                  4) Our cache size is about 100 GB
                  5) We are using parallelaware aggregators for faster searches.
                  6) client, which is a .Net Client, exchange the data in the form of POF object. In other wwords, both request and response objects are POF objects.

                  - Questions to clarify are-
                  1) Are there any bottlenecks with the above design in terms of response time. For now oru response times are bit elevated at 400- 500 ms. per call
                  2) Is there a way to bring down the response time to <100 ms.

                  let me know, if you wanted to see any of our code/ config. will be happy to send it to your way. Thanks and appreciate if you can throw some insight into the serialization costs with the POF object or any other alternative or recommendations that you may suggest.

                  Edited by: 882864 on Oct 8, 2012 11:54 AM
                  • 6. Re: Coherence Data storage in object form
                    robvarga
                    882864 wrote:
                    Replicated caches actually can store either the binary or the Java object form of the value (key is always binary) depending on whether it has already been deserialized since last changed or not, so there is an example of storing Java objects.
                    You need to know, though, that if you store Java object form, then sending it on the network involves serializing it which is bad for performance if reads are more frequent than writes, which is why Coherence stores the binary form (besides it consuming less memory).
                    How about java objects that implement POF Serializable Interface? Can you elaborate on what you are alluding to.
                    If you store the binary form of objects, then sending the binary form on the network then complexity of sending it is roughly equivalent to send(byte[]).

                    If you store object form of the objects, then you have to jump hoops to get the same data to the network from multiple pieces (individual PofWriter.write... method calls and traversing the entire object hierarchy you need to serialize), or buffer it up which needs preallocation of a buffer and possibly it preallocates larger amount of memory because the size is not known in advance. The overhead of the send is much higher in this case.

                    We have a requirement to store data from from the database in the form of POF objects.

                    1) For an eg. assume, there are 4 tables in database that store- PersonID, Name, dob, ssn, address
                    2) A POF Object is created with all of those columns in a single object (usually it's done by joining all 4 tables).
                    3) An index is added on a unique rowkey, that's PersonID for faster lookups.
                    4) Our cache size is about 100 GB
                    5) We are using parallelaware aggregators for faster searches.
                    6) client, which is a .Net Client, exchange the data in the form of POF object. In other wwords, both request and response objects are POF objects.

                    - Questions to clarify are-
                    1) Are there any bottlenecks with the above design in terms of response time. For now oru response times are bit elevated at 400- 500 ms. per call
                    2) Is there a way to bring down the response time to <100 ms.

                    let me know, if you wanted to see any of our code/ config. will be happy to send it to your way. Thanks and appreciate if you can throw some insight into the serialization costs with the POF object or any other alternative or recommendations that you may suggest.

                    Edited by: 882864 on Oct 8, 2012 11:54 AM
                    Sorry, this is quite vague. Could you be more specific, please?

                    What are your entities? Person, I guess would be one. Name, DOB, SSN seem to be properties to me, I don't see the need for a separate table.
                    What is the cardinality of the relationship between the above entities/properties? E.g. can a person have multiple addresses? Can multiple person instances share the same address instance?
                    Do you have a single cache or multiple caches?
                    What is your cached object exactly? What is/are the cache key(s)?
                    What exactly are your typical searches?

                    Best regards,

                    Robert
                    • 7. Re: Coherence Data storage in object form
                      885867
                      NP, let me address each bullet so that we are on the same page... Thanks again.
                      What are your entities? Person, I guess would be one. Name, DOB, SSN seem to be properties to me, I don't see the need for a separate table.
                      Although they appear to be properties of a single entity, we have distinct tables for each one of those properties. However, Cache POF Object contains all those properties encapsulated in a single POF object as a "Person".
                      What is the cardinality of the relationship between the above entities/properties? E.g. can a person have multiple addresses? Can multiple person instances share the same address instance?
                      PersonID- Name(1:1)
                      PersonID - dob(1:1)
                      PersonID - ssn(1:1)
                      PersonID- address (1:M)
                      you have a single cache or multiple caches?
                      single cache
                      What is your cached object exactly?
                      Person implements POFObject {

                      public string rowkey; // contains a unique value for each object stored in the cache - similar to auto sequence number in the database.
                      public string personID;
                      public string firstName;
                      public string lastName;
                      public string middleName;
                      public long ssn;
                      public Timestamp dob;
                      public Address address;

                      get/set methods for each of the fields

                      }
                      What is/are the cache key(s)?
                      Person person = new Person();

                      person.setRowkey(MD5 generated sequence number);
                      person.setPersonID("A001");
                      person.setFirstName("John");
                      person.setxxx ....

                      Cache <key> and <Value> = <person> <person>
                      What exactly are your typical searches?
                      1) search by Person.getPersonID ("A001")
                      2) person.getFirstName like '%Joh%'
                      3) search by person.getSSN()
                      4) search by DOB
                      5) search by Address or any combinations for above columns mentioned

                      So, .net client contains "Person" in it's pof config.xml. When making a look up request on cache for any of those searches mentioned above, and will send a Person POF object for any of the search parameters with the values set. Cache will perform the search based on the input parameters and return 1 or many person objects For an example, search by personid results in more than 1, if addresses are more than 1; likewise, search by ssn results in returning 1 person object and so on..

                      Edited by: 882864 on Oct 11, 2012 10:30 AM

                      Edited by: 882864 on Oct 11, 2012 10:46 AM
                      • 8. Re: Coherence Data storage in object form
                        robvarga
                        How many distinct person ids are those 100G data?

                        >
                        What is/are the cache key(s)?
                        Person person = new Person();

                        person.setRowkey(MD5 generated sequence number);
                        person.setPersonID("A001");
                        person.setFirstName("John");
                        person.setxxx ....

                        Cache <key> and <Value> = <person> <person>
                        This seems to be a bad idea to me. You should use either rowkey or person id (if it is unique) as a cache key. Using person id allows you to do the first query quite cheaply (as it would be a key-based get() request).


                        What exactly are your typical searches?
                        1) search by Person.getPersonID ("A001")
                        As I said above, if person id would be the cache key, then this would be a simple get() request.
                        2) person.getFirstName like '%Joh%'
                        This is ugly to create an optimal index for. Can you make do with Joh%? If yes, then you can leverage a sorted index, otherwise non-sorted may be sufficient and you can still benefit from the fact that there are not likely going to be lots of different first names.
                        3) search by person.getSSN()
                        This can be indexed easily, or you can create another cache for the ssn (key) to person id (value) mapping (as it is not likely to change too frequently) instead of indexing.
                        4) search by DOB
                        This would benefit from a sorted index, if DOB search means you are searching in a date range. Again, materializing it into a reverse index cache may not hurt...
                        5) search by Address or any combinations for above columns mentioned
                        Search by address is a bit trickier to index. On the other hand, duplicating person because it has multiple addresses is a bad idea. You can store a list of addresses inside the Person object, and you can index it if necessary. This is likely going to save you more than the materialized indexes cost.

                        Combinations are easy, you can use an AndFilter or AllFilter (for more than 2 criteria)

                        You can add multiple indexes and Coherence will likely use all relevant ones.
                        So, .net client contains "Person" in it's pof config.xml. When making a look up request on cache for any of those searches mentioned above, and will send a Person POF object for any of the search parameters with the values set. Cache will perform the search based on the input parameters and return 1 or many person objects For an example, search by personid results in more than 1, if addresses are more than 1; likewise, search by ssn results in returning 1 person object and so on..
                        As said above, duplicating person for multiple addresses is a bad idea.

                        Best regards,

                        Robert
                        • 9. Re: Coherence Data storage in object form
                          885867
                          Should I take this conversation in a new thread if it's getting too chatty? Thank you, Rob for your response.
                          How many distinct person ids are those 100G data?
                          Around 50 GB. The rest will contain data related to the person id. Eg. account info. and so on
                          This seems to be a bad idea to me. You should use either rowkey or person id (if it is unique) as a cache key
                          Well, you are right, However, we had a challenge to support various searches by different parameters and their combinations, if the key object contains just the personid. so here's what we had done.

                          Key contains just the <searchable columns> and value object contains both the seachable columns in addition to other non searchable columns (LARGER SET OF COLUMNS) and their values. As a result,
                          1) the key with the searchable keys has a smaller memory foot print and has indexes on all of those searchable keys. And supports complex queries
                          2) Secondly, we dont' have to pull the entire value object that incurrs additional deserialization costs

                          Do you have any better recommendations to address this issue without compromising on the read performance at execution time? Please advise.
                          Can you make do with Joh%? If yes, then you can leverage a sorted index
                          yes, we are using that way and moved away from "%Joh%"
                          Search by address is a bit trickier to index. On the other hand, duplicating person because it has multiple addresses is a bad idea. You can store a list of addresses inside the Person object, and you can index it if necessary. This is likely going to save you more than the materialized indexes cost.
                          It's getting interesting.. Well, that's a good idea indeed. .. Having a collection object for Addresses- does it create any locks during the write operation to cache, if more than two thread competing to update the Address collection ? Are there any overheads or downside to it? Please advise.
                          • 10. Re: Coherence Data storage in object form
                            robvarga
                            882864 wrote:
                            Should I take this conversation in a new thread if it's getting too chatty? Thank you, Rob for your response.
                            How many distinct person ids are those 100G data?
                            Around 50 GB. The rest will contain data related to the person id. Eg. account info. and so on
                            I meant how many different person ids are there, not what their size is...
                            This seems to be a bad idea to me. You should use either rowkey or person id (if it is unique) as a cache key
                            Well, you are right, However, we had a challenge to support various searches by different parameters and their combinations, if the key object contains just the personid. so here's what we had done.

                            Key contains just the <searchable columns> and value object contains both the seachable columns in addition to other non searchable columns (LARGER SET OF COLUMNS) and their values. As a result,
                            However as far as I remembered this is not exactly what you described it earlier. What you described there was that key and value types are the same.
                            1) the key with the searchable keys has a smaller memory foot print and has indexes on all of those searchable keys. And supports complex queries
                            If you need to deserialize what you are querying, you usually have a problem. You can use POF and put the searchable fields to the beginning of the value (lowest property ids) and get the same transient footprint and smaller permanent and result set footprint as you can still use lean cache keys.
                            2) Secondly, we dont' have to pull the entire value object that incurrs additional deserialization costs

                            Do you have any better recommendations to address this issue without compromising on the read performance at execution time? Please advise.
                            As said above, use lean keys with which you can implement key-based access with for a frequently used scenario, and POF-based extraction would give you equal or better performance than deserializing a big key if the searchable fields are the lowest property ids. For querying an indexed fields, no deserialization is going to happen and the transient footprint of creating the indexes with POF extractors is not going to be higher than creating them from a deserialized semi-thick key.
                            Can you make do with Joh%? If yes, then you can leverage a sorted index
                            yes, we are using that way and moved away from "%Joh%"
                            Search by address is a bit trickier to index. On the other hand, duplicating person because it has multiple addresses is a bad idea. You can store a list of addresses inside the Person object, and you can index it if necessary. This is likely going to save you more than the materialized indexes cost.
                            It's getting interesting.. Well, that's a good idea indeed. .. Having a collection object for Addresses- does it create any locks during the write operation to cache, if more than two thread competing to update the Address collection ? Are there any overheads or downside to it? Please advise.
                            Each value is stored in serialized form when the entry is not changed from an entry-processor, so no locks to mention for that.
                            If an entry-processor deserializes the object then there are no other threads to contend with as Coherence guarantees exclusive access, so no locks are necessary for the Java object form, either, unless your code needs it for reusing the class somewhere else outside of Coherence.

                            Best regards,

                            Robert