1 2 Previous Next 19 Replies Latest reply on Aug 11, 2020 6:07 PM by _Dylan_

# Makeshift Checksum

I came across a query at work today. It's trying to see if two sets contain the same IDs. It's doing this by calculating the standard deviation and average of the IDs in each set. Any set with the same SD/Avg are considered a match. It seems to be working, but it doesn't feel like a bullet-proof method to compute a checksum (we're on Oracle 12.2.0.1 and I'm not aware of a checksum-like function in this version, but could be wrong).

I started to see if I could find example values that create false matches. The things I know are:

• Order within the set does't matter.
• ID will be a positive integer from 1 to ~27,000
• A set will contain anywhere from 1 to ~1,200 IDs
• The universe of all possible sets is less than the sum-total of all combinations of all IDs, of all set sizes. In reality, IDs are unique to an entity, and each set belongs to a single entity. The largest entity has about 1,200 IDs and can have some/all of those IDs in a set.

It seems difficult/impossible to cause a collision with a set of 2 numbers. But as the sets grow, it seems easier to get values to converge. I'm wondering if:

• anyone has seen this approach used to checksum a column.
• there's an existing function/method that actually does a checksum.
• there's a sql/plsql approach that could actually to predict sets of IDs that have collisions (and not take years).

I know of other SQL approaches to verify the sets match and not worried about that. I'm more focused in figuring out if/when the existing approach will have issues. I was thinking about trying to checking procedurally, or mathematically to try an backtrack into a formula that shows a breakdown somewhere.

```-- Set 1: (1,2,3,4,5,6,7,8,9,10) | Avg = 5.5, SD = 3.0276503540974916654225328097181936992
-- Set 2: (2,3,4,5,6,7,8,9) | Avg = 5.5, SD = 2.44948974278317809819728407470589139197
SELECT
grp,
AVG(id),
STDDEV(id)
FROM
(
SELECT
1 AS grp,
level AS id
FROM
dual
CONNECT BY
level <= 10
UNION ALL
SELECT
2 AS grp,
level + 1 AS id
FROM
dual
CONNECT BY
level <= 8
)
GROUP BY
grp;
```
• ###### 1. Re: Makeshift Checksum

The approach is fatally flawed. In principle, if you don't add COUNT() to AVG() and STDDEV() you won't even know for sure the two sets have the same cardinality. Moreover, it is trivial to construct examples with the same COUNT, AVG and STDDEV, and yet the two sets don't even have a single ID in common. The smallest example:

set 1:    1, 5, 6

set 2:    2, 3, 7

For both sets, the average is 4 and the standard deviation is sqrt(14/3). That is the population standard deviation, which is what you - or whoever wrote that code - should be using. In your example you use STDDEV, which is the sample standard deviation; it is not clear why that should be used. Either way the flaw in that approach is the same; and, to be precise, the sample standard deviation of the two sets shown above is sqrt(7).

One trivial recipe to find "collisions" (I used it above) is to start with ANY set of integers with integer average (trivial to construct for any cardinality), then reflect each number over the set average. The reflection of a number x over another number a, in general, is the number 2*a - x. For example: the average of set 1 above is 4. Reflect the first element, 1, over 4, to get 7. (The meaning of "reflection" is that 1 and 7 sit on the opposite sides of 4, and at the same distance). Reflect 5 and 6 over 4, you get 3 and 2. In complete generality: any set of integers and its reflection about the average have the same COUNT, same AVG and same STDDEV (and same STDDEV_POP - which is the Oracle function for population standard deviation.)

Beyond that, I don't know what "checksum of a column" means. I have heard of checksum for a single number. Besides, in your case you are not checking (in whatever manner) "a column", but "a set of integers". I have no clue what you mean with any of that. I won't bore you with the correct way to check that two (unordered) sets of integers are the same, since you said you already know about that, and you aren't interested.

• ###### 2. Re: Makeshift Checksum

Take a look at the built-in Oracle function  ORA_HASH() or perhaps the Oracle package DBMS_CRYPTO()

• ###### 3. Re: Makeshift Checksum

It's really easy to prove using the existing approach is flawed from the outset and will have issues. Even with combinations of 10 ids I can find 116 collisions :

create or replace type ty_numbers is table of number;

/

with data as (

select p.rn, v.column_value as id, p.column_value as id_set

from (

select rownum rn, column_value

from table(powermultiset(ty_numbers(1,2,3,4,5,6,7,8,9,10)))

) p, table(column_value) v

)

, test as (

select rn, avg(id) avgid, stddev(id) stddevid

from data

group by rn

)

select avgid, stddevid, collect(rn order by rn), count(*)

from test

group by avgid, stddevid

having count(*) > 1

The approach is rubbish and should be dropped.

• ###### 4. Re: Makeshift Checksum

Gaz in Oz wrote:

Take a look at the built-in Oracle function ORA_HASH() or perhaps the Oracle package DBMS_CRYPTO()

I don't quite see how that would help in the OP's problem.

Hash functions (at least ORA_HASH) can indeed be used with nested tables, and the hash value does not depend on the order of elements. OK. But that means that the OP must first collect the ID's into nested tables - and then he could compare them directly, there would be no need for hashing. (If anything, if hashing is helpful in comparing two nested tables, one would hope that the Oracle engineers would have thought about that already, so it would be used in their proprietary implementation of "comparing nested tables".) Or if, instead, you were thinking of aggregating the ID's in some other way (perhaps LISTAGG), the same comment would apply: once they are already aggregated that way, they can be compared directly. (Of course, LISTAGG will crash at some point - the OP's inputs are up to five digits long and there may be up to 1,200 ID's in a set, resulting in aggregated strings longer than 4000 characters in some cases.)

But there are additional issues with hashing. Take ORA_HASH: it's a 32 bit hash. If you just consider all sets of six ID's between 1 and 27000, there are more than 5 * 10^23 of those, or about 2^78.  You need at least 64-bit hashes to have any hope to avoid massive collisions even for sets that small.  There are over 2^230 sets of cardinality 20 (distinct ID's between 1 and 27000); you don't even want to know how many subsets of 60 ID's there are. Yet the OP's ID sets may have cardinality up to 1,200.

EDIT

Using a query similar to Paulzip's, but computing ORA_HASH on the nested tables (instead of average and standard deviation), I found that there are 1,344 collisions for subsets of ID's just between 1 and 23. For example:

set 1   1, 2, 3, 5, 15, 17

set 2   1, 6, 7, 8, 10, 11, 12, 13, 16, 18, 21, 22

These are two sets of ID's, of cardinality 5 and 11 respectively, and the ID's are just between 1 and 22. Both have hash value (with the default bucket count and seed) of 3193200. Finding all 1,344 collisions for ID <= 23 took a little longer than 3 minutes on my machine; doing the same for longer hashes will take much longer, but the theoretical argument I presented above doesn't require concrete examples to support it.

• ###### 5. Re: Makeshift Checksum

Using your example, we can find small examples of sets with different cardinalities that have the same AVG and same STDDEV.  Modifying your example slightly, we can find the same for STDDEV_POP.

• ###### 6. Re: Makeshift Checksum

mathguy wrote:

Beyond that, I don't know what "checksum of a column" means. I have heard of checksum for a single number. Besides, in your case you are not checking (in whatever manner) "a column", but "a set of integers". I have no clue what you mean with any of that. I won't bore you with the correct way to check that two (unordered) sets of integers are the same, since you said you already know about that, and you aren't interested.

When referring to checksums, I was thinking something along the lines of lines of an MD5/SHA-1/SHA-256 file checksum, and wondering if a similar method to verify the contents of a file could be applied to a query result. I found reference to a checksum function in the Oracle 20c Preview Documentation which sounds like it will do this.

• ###### 7. Re: Makeshift Checksum

_Dylan_ wrote:

When referring to checksums, I was thinking something along the lines of lines of an MD5/SHA-1/SHA-256 file checksum, and wondering if a similar method to verify the contents of a file could be applied to a query result. I found reference to a checksum function in the Oracle 20c Preview Documentation which sounds like it will do this.

Hashing can be used to identify (with high probability - not with certainty) if something was changed. For a file, if someone tried to embed malicious code, it is exceptionally unlikely that the hash of the resulting file will be exactly the same as the hash of the original file.

What you found is a similar concept - to see if the content of a table has changed. Note that that, too, is a "quick check". I already gave a counter-example in Reply 4, at the bottom: suppose the Oracle-provided checksum function uses ORA_HASH in the background, and the table has a single column of data type NUMBER. If yesterday the table had six rows, with the six values in SET 1 at the bottom of Reply 4, and today the table has 12 rows, with the values in SET 2 (obviously my counts in Reply 4 were wrong!)  then the "checksum" will be the same, even though the table contents have changed. The assumption in the "checksum for changing tables" is that the table has many rows, and only a few may change, and with some luck we won't see collisions too often.

In your problem, you may use hashing (or even the silly thing your developers already did) to confirm quickly that two sets of ID's are NOT the same. But if the hashes are equal (or if the avg and stddev are the same), you can't conclude that the sets are identical. You will still need to use one of the "proper" ways to check for equality of sets, if "knowing with certainty" is critical in your application.

• ###### 8. Re: Makeshift Checksum

STANDARD_HASH() is perhaps what you are actually after.

The optional `method` argument lets you specify the name of the hash algorithm to be used. Valid algorithms are `SHA1`, `SHA256`, `SHA384`, `SHA512`, and `MD5`. If you omit this argument, then `SHA1` is used.

• ###### 9. Re: Makeshift Checksum

mathguy wrote:

The approach is fatally flawed. In principle, if you don't add COUNT() to AVG() and STDDEV() you won't even know for sure the two sets have the same cardinality. Moreover, it is trivial to construct examples with the same COUNT, AVG and STDDEV, and yet the two sets don't even have a single ID in common. The smallest example:

set 1: 1, 5, 6

set 2: 2, 3, 7

For both sets, the average is 4 and the standard deviation is sqrt(14/3). That is the population standard deviation, which is what you - or whoever wrote that code - should be using. In your example you use STDDEV, which is the sample standard deviation; it is not clear why that should be used. Either way the flaw in that approach is the same; and, to be precise, the sample standard deviation of the two sets shown above is sqrt(7).

One trivial recipe to find "collisions" (I used it above) is to start with ANY set of integers with integer average (trivial to construct for any cardinality), then reflect each number over the set average. The reflection of a number x over another number a, in general, is the number 2*a - x. For example: the average of set 1 above is 4. Reflect the first element, 1, over 4, to get 7. (The meaning of "reflection" is that 1 and 7 sit on the opposite sides of 4, and at the same distance). Reflect 5 and 6 over 4, you get 3 and 2. In complete generality: any set of integers and its reflection about the average have the same COUNT, same AVG and same STDDEV (and same STDDEV_POP - which is the Oracle function for population standard deviation.)

Beyond that, I don't know what "checksum of a column" means. I have heard of checksum for a single number. Besides, in your case you are not checking (in whatever manner) "a column", but "a set of integers". I have no clue what you mean with any of that. I won't bore you with the correct way to check that two (unordered) sets of integers are the same, since you said you already know about that, and you aren't interested.

Hi mathguy, I don't think I fully understand how reflection helps you find 'Collisions' especially using your given example. for the given set below we had

set 1: 1, 5, 6

set 2: 2, 3, 7

Why do we need to use reflection to find 'number collision' and I thought the definition of 'collision' was having the same hash value or checksum. thanks

• ###### 10. Re: Makeshift Checksum

In the original post, the OP talks about a particular way to verify if two sets of numbers are equal. They use the average and the standard deviation, not a hash function.

In that context, a "collision" means two different sets of numbers (which may even have different cardinalities) that have the same average and the same standard deviation. This is a more general definition of "collision" - the most general (theoretical) context is like this:  You have two sets X and Y, and a function from X to Y. For hashing, X is a set of strings (or of files, or of whatever) and Y is a set of numbers (for example, between 0 and 2^32 -1). For the OP's scenario, X is a set of SETS of ID's and Y is the set of PAIRS of numbers, the function associating to each set of ID's the pair (avg, stddev).  "Collision" means two distinct elements of X which have the same image in Y under the function you considered. As we learn in middle school, a function with no collisions is called a one-to-one (or "injective") function; one-to-one functions can tell us that elements from X are distinct with certainty, but most functions simply aren't one-to-one.

My point in my replies was that the number of sets of ID's in the OP's application is so much greater than the number of available hash values (even for 512-bit hashes, or even much longer than that) that the whole approach is doomed to fail.

As to the "reflection" thing - that is a simple method to construct examples of collisions for the specific approach of using average and stddev - NOT for hashing! I said later (in another reply) that we can find collisions for ORA_HASH using the brute-force method in Paulzip's reply; there is no simple analog of "reflection" that will help us construct collisions for hash functions (unless the hash function itself is total garbage; not the case for any of the hash functions mentioned in this thread).

• ###### 11. Re: Makeshift Checksum

Gaz in Oz wrote:

STANDARD_HASH() is perhaps what you are actually after.

The optional `method` argument lets you specify the name of the hash algorithm to be used. Valid algorithms are `SHA1`, `SHA256`, `SHA384`, `SHA512`, and `MD5`. If you omit this argument, then `SHA1` is used.

It doesn't appear to be an aggregate function, though.

I came across a similar requirement today. I have made a new version of a procedure that processes financial data, in the hope that it will perform better than the original. To test it, I need to prove that it

1. is not slower, and
2. gives the same results.

I run the new version in my dev database, and the old version in an unchanged test database. I'd like a convenient way to show that the results are the same. I wonder if such a thing is logically possible, without comparing the entire two datasets in some less convenient way (export to Excel, sort, save file, compare files in Beyond Compare etc).

• ###### 12. Re: Makeshift Checksum

William Robertson wrote:

Gaz in Oz wrote:

STANDARD_HASH() is perhaps what you are actually after.

The optional `method` argument lets you specify the name of the hash algorithm to be used. Valid algorithms are `SHA1`, `SHA256`, `SHA384`, `SHA512`, and `MD5`. If you omit this argument, then `SHA1` is used.

It doesn't appear to be an aggregate function, though.

I came across a similar requirement today. I have made a new version of a procedure that processes financial data, in the hope that it will perform better than the original. To test it, I need to prove that it

1. is not slower, and
2. gives the same results.

I run the new version in my dev database, and the old version in an unchanged test database. I'd like a convenient way to show that the results are the same. I wonder if such a thing is logically possible, without comparing the entire two datasets in some less convenient way (export to Excel, sort, save file, compare files in Beyond Compare etc).

I use hashing a lot to test these sort of things.  Where I've done some rework for say performance reasons or have to add client specific enhancements, but I need to check the generic calculations still work OK.  I often just do a set of before and after log of calculations to JSON in a specific order, DBMS_CRYPTO.HASH (say SHA1) those JSON samples, if they match I'm confident I haven't broken it.

• ###### 13. Re: Makeshift Checksum

...perhaps you would be better off ignoring *most* of my suggestions.

• ###### 14. Re: Makeshift Checksum

Paulzip wrote:

I use hashing a lot to test these sort of things. Where I've done some rework for say performance reasons or have to add client specific enhancements, but I need to check the generic calculations still work OK. I often just do a set of before and after log of calculations to JSON in a specific order, DBMS_CRYPTO.HASH (say SHA1) those JSON samples, if they match I'm confident I haven't broken it.

Now that's a brilliant idea! Gotta add that one to my toolbox...

1 2 Previous Next