1 2 3 Previous Next 40 Replies Latest reply on Jan 10, 2016 11:57 AM by 2790388 Go to original post
      • 30. Re: Hash and replace masking in oracle

        We are not masking Customer id that is the key across all the applications .

        Ok - but just so you are aware that is generally considered unacceptable and a potential security risk for many orgs. Because if you know a customer id you can usually use it to find sensitive info in other apps that have NOT been masked.

         

        Lets say you have medical data and a physician id. If you don't mask city and state in the U.S. and you know the info is for a heart surgeon. There are likely few heart surgeons in a small rural town.

         

        So if you know that a person is a heart surgeon and is located in that small town you might be able to use a phone book and narrow it down to just a handful of people.

         

        Unfortunately to do effective masking you have to mask ALL POSSIBLE links like that.

         

         

        We would be performing a lookup and replace based on a customer id. we are not performing based on customer name or phone numbers.

        i already had a re fence table with list of names , how should i map these names to a customer every time the customer appears in different tables i.e. consistent masking across

        the database. how would i know which name is used for a customer.

        So just replace rownum with cus_id in that simple example query I gave you.

         

        A customer id should be a one for one match with a customer name. You should NOT have two or more customer names for the same customer id.

         

        So the customer name should ONLY be in ONE customer table if your data is normalized. Other tables should be using customer id as a foreign key to the customer table.

         

        If your data is NOT normalized then you potentially have BIG, BIG problems. Someone could easily enter JOHN in one table and JOUN in another by mistake when they are supposed to be the same. With denormalized data you have no way to catch that problem for your actual application.

         

        Likewise you have no way to mask data like that properly either since you may not even know the data is wrong.

         

        Those are some of the problems you face and the prices you pay when you don't do things according to BEST PRACTICE.

        • 31. Re: Hash and replace masking in oracle
          2790388

          Hi rp0428 ,

           

          I'm bit confused . The reference names are present in a different table and the actual customer id and customer name is present in different table. Without a join on these tow tables how we are replacing the values ?

          • 32. Re: Hash and replace masking in oracle
            2790388

            Hi ,

             

            I had modified the query as below and got the execution time now improved from 38 minutes for 80,000 customers to 1.1 minute.

             

            CREATE TABLE TMP_CUS_CUS AS

              WITH ref_name_hash AS

                 (SELECT FIRST_NAME,

                         MIDDLE_NAME,

                         last_name,

                         row_number() over(order by FIRST_NAME, MIDDLE_NAME, LAST_NAME) - 1 FIRST_NAME_HASH

                    FROM ref_name),

            CUS_name_hash as

                (select CUS_id, CUS_FRST_NME, CUS_SNAM_NMe, CUS_OTHR_GVN_NME from CUS_CUS),

              BUCKET as

                (select count(*) max_buckets from ref_name)

             

            SELECT CUS_ID, N.FIRST_NAME, N.MIDDLE_NAME, N.LAST_NAME

                FROM CUS_name_hash c

               cross join BUCKET b

                JOIN ref_name_hash n

                  ON ora_hash( CUS_FRST_NME ,CUS_SNAM_NME ,CUS_OTHR_GVN_NME) ,

                              b.max_buckets) = n.FIRST_NAME_HASH;

                             

                              select count(CUS_ID) , first_name || middle_name || last_name  from TMP_CUS_CUS

                              group by  first_name || middle_name || last_name 

                              having count(CUS_id) > 1;

             

            Thanks

            • 33. Re: Hash and replace masking in oracle
              Paulzip

              rp0428 wrote:

               

              You can't declare a new function (Mapp) in the middle of a code block. Declarations have to come BEFORE any code.

               

              That responder did NOT post useable code. They posted one (poor) example of using hashing.

               

               

              I posted untested code snippets to act as a template for a solution and it was qualified as such "I don't have time...".  I also mentioned it should be used within a package for the stateful reasons.  It isn't a poor example either, cached hashing lookups (via a hash table or PL/SQL's nearest equivalent, an associative array) are used throughout software and hardware, it is a valid and extremely fast approach, in fact Oracle themselves use it internally.

              • 34. Re: Re: Hash and replace masking in oracle
                Paulzip

                Here's a simple example I've knocked up.  You will have empty buckets, as it's not a perfect hash, so I return the input string - that bit may need some work.  It does give an indication of performance though.


                create or replace package P_Hash is

                  function Map(pString varchar2) return varchar2;

                  function HashCount return integer;

                end;

                /

                 

                create or replace package body P_Hash is

                 

                type TAssocArray is table of varchar2(1000) index by PLS_Integer;

                vAssocArray TAssocArray;

                vBucketCount integer;

                 

                function HashCount return integer is

                begin

                  return vAssocArray.Count;

                end;

                 

                procedure InitMaps is

                begin

                  if vAssocArray is null or vAssocArray.Count = 0 then

                    select count(*)

                    into vBucketCount

                    from REF_NAME;

                    for rec in (select REF_NAME_VALUE, Ora_Hash(REF_NAME_VALUE, vBucketCount) HASH from REF_NAME)

                    loop

                      vAssocArray(rec.HASH) := rec.REF_NAME_VALUE;  -- Might need code to handle collisions, where AssocArray(vRefNameHash) has diff value already

                    end loop;

                  end if;

                end;

                 

                function Map(pString varchar2) return varchar2 is

                  vHash integer;

                begin

                  select Ora_Hash(pString, vBucketCount) into vHash from dual;

                  if vAssocArray.Exists(vHash) then

                    return vAssocArray(vHash);

                  else

                    return pString; -- Handling for empty bucket return default (needs work)

                  end if;

                end;

                 

                begin

                  InitMaps;  -- package body initialisation code

                end;

                /


                Test, create some dummy data....

                 

                create table REF_NAME as

                  select DBMS_RANDOM.string('A', 50) as REF_NAME_VALUE

                  from dual

                  connect by level < 80000

                 

                create table CUS_NAME as

                with names(NAME) as (

                  select 'JOHN' from dual union all

                  select 'MATT' from dual union all

                  select 'PHIL' from dual union all

                  select 'MARK' from dual union all

                  select 'PAUL' from dual union all

                  select 'JEFF' from dual union all

                  select 'CARL' from dual union all

                  select 'DYLAN' from dual union all 

                  select 'MATTHEW' from dual

                ),

                surname_prefix(NAME) as (

                  select 'JOHN' from dual union all

                  select 'SVEN' from dual union all

                  select 'SMITH' from dual union all

                  select 'MARK' from dual union all     

                  select 'ETHEL' from dual union all

                  select 'ERIC' from dual union all 

                  select 'REDD' from dual union all

                  select 'GREEN' from dual union all

                  select 'HICK' from dual union all   

                  select 'VICK' from dual union all

                  select 'GRAY' from dual union all 

                  select 'GRUN' from dual union all 

                  select 'HENDER' from dual union all 

                  select 'LU' from dual union all

                  select 'LUND' from dual union all   

                  select 'BUCK' from dual union all 

                  select 'BECK' from dual 

                ),

                surname_middle(NAME) as (

                  select 'ING' from dual union all

                  select '-' from dual union all

                  select ' ' from dual union all

                  select '' from dual

                ), 

                surname_suffix(NAME) as (

                  select 'SON' from dual union all

                  select 'MAN' from dual union all

                  select 'BET' from dual union all

                  select 'ARD' from dual union all     

                  select 'EN' from dual union all 

                  select 'ICK' from dual union all 

                  select 'ANGE' from dual union all

                  select 'WAY' from dual union all 

                  select 'TON' from dual union all 

                  select 'TONN' from dual union all 

                  select 'WICK' from dual union all 

                  select 'ERD' from dual union all 

                  select 'AN' from dual union all 

                  select 'ON' from dual union all 

                  select '' from dual 

                )

                select n.NAME FIRST_NAME, m.NAME as MIDDLE_NAME, p.NAME||sm.NAME||s.NAME as SURNAME

                from names n

                cross join names m

                cross join surname_prefix p

                cross join surname_middle sm

                cross join surname_suffix s

                where n.NAME <> m.NAME

                 

                select P_Hash.Map(FIRST_NAME),

                       P_Hash.Map(MIDDLE_NAME),

                       P_Hash.Map(SURNAME)

                from CUS_NAME

                 

                 

                Result time for me, 8 seconds for 73440 records (or 220320 mapped hash lookups)

                • 35. Re: Hash and replace masking in oracle

                  I'm bit confused . The reference names are present in a different table and the actual customer id and customer name is present in different table. Without a join on these tow tables how we are replacing the values ?

                  Clarify what you are confused about. My comments don't include anything saying not to use a join.

                  • 36. Re: Hash and replace masking in oracle

                    I posted untested code snippets to act as a template for a solution and it was qualified as such "I don't have time...".  I also mentioned it should be used within a package for the stateful reasons.

                    And that is EXACTLY what I was telling OP - that they can't take what you posted and try to execute it. That all you were posting was for illustration.

                      It isn't a poor example either, cached hashing lookups (via a hash table or PL/SQL's nearest equivalent, an associative array) are used throughout software and hardware, it is a valid and extremely fast approach, in fact Oracle themselves use it internally.

                    Yes - it IS a poor example IMHO for the reason I stated:

                    That will not, and can not, produce a viable, performant solution that will work for your entire issue.

                    . . .

                    Using a 'hash' method is NOT scaleable because if you need to expand the number of hash values it will reek havoc on any values previously hashed since the same value may now hash differently.

                    The issue isn't the hashing, per se, it is the way it is used.

                     

                    1. If the number of hash buckets changes it affects the ENTIRE dataset. Any previously hashed data may have to be rehashed since the hash value may change. Hashing a value into 1 of 4 buckets may produce a '3' for a given value. Hash that same value into 1 of 16 buckets and it may hash to an '11'.

                     

                    Any data already hashed will have '3's and data hashed later will use '11'. That affects scalabililty since data is frequently processed in batches.

                     

                    2. It can be EXTREMELY inefficient to make multiple function calls for each row of a query

                    Usage

                     

                    select MyPackage.Map(FIRST_NAME), MyPackage.Map(LAST_NAME)

                    from CUST_DATA

                     

                    No joins, fast cached lookup.

                    Even that simple example is making TWO function calls for every row of the result set. That number increases when there are more columns that need to be mapped.

                     

                    Every one of those function calls requires a context switch between the SQL and PL/SQL engines and those are costly. It is possible that SOME of that cost can be mitigated with the latest result 'caching' functionality available but that is unreliable and difficult to design so as to ENSURE that you get the advantage of that. Even a slight change to code/queries can cause performance to change unpredictably.

                     

                    An index on a 'NEW_NAME' column of a mapping table with a unique key of 'OLD_COLUMN' would be very compact any cached index blocks would tend to give benefits similar to hashing. Using joins to those lookup tables, even many of them, may look complex but totally eliminate any context switches and Oracle can manage them VERY effectively.

                     

                    So, also as I said, a many-column mapping process can often require PL/SQL code and row-by-row processing.

                     

                    For that 'solution' the use of hash tables becomes performant. Because instead of CAUSING context switches from SQL to PL/SQL as that simple query above does, Using hash tables in the code PREVENTS context switching in the other direction: from PL/SQL to SQL.

                     

                    That is because the hash tables are already in the PL/SQL context and so is the 'lookup' of the hash value to get the replacement.  The ORA_HASH function is available in BOTH the SQL and the PL/SQL engines so hashing a value in PL/SQL doesn't cause a context switch and neither does the lookup into  the collection.

                     

                    And the function calls won't even be needed since the code can do a direct lookup on the collection. But even if the function calls were needed, or just desired for modularity, they won't cause context switches either.

                    cached hashing lookups (via a hash table or PL/SQL's nearest equivalent, an associative array) are used throughout software and hardware, it is a valid and extremely fast approach, in fact Oracle themselves use it internally.

                    Correct - but they are valid and fast ONLY in the context of their use.

                     

                    When used in a way that CAUSES multiple, repeated context switches they can be both inappropriate and slow 'solutions'.

                     

                    When used within PL/SQL in a way that PREVENTS those multiple, repeated context switches they can be appropriate and fast.

                     

                    It all depends on HOW they are used. Your specific, two-column, example would prove to be problematic in the most common use cases involving multiple mappings per row. That is the context in which I said that 'function/hash' solution was a poor one for OPs use case.

                    • 37. Re: Hash and replace masking in oracle
                      Paulzip

                      It seems we understand the OP's problem differently.  I was under the impression the OP is only hash mapping on specific identifiable information - like the customer names etc.  In #3. I ask him if it is an obsfucation process and where it is occuring (because that requirement would massively affect any solution I recommended)..  and he answered about "getting a random name from reference table and applying them consistently for a customer".  So I assumed a few customer fields - and my approach is ideal for a small number of customer name fields.

                       

                      If he's planning a huge obfuscation exercise across large amounts of data, then he should've said so, then of course I wouldn't suggest such a hash map approach.  My solution as is most solutions to problems, requirements specific.

                      • 38. Re: Hash and replace masking in oracle

                        It seems we understand the OP's problem differently

                        Tough to tell since OP still hasn't really told us the actual BUSINESS PROBLEM he is trying to solve.

                         

                        Instead, like others, they are wanting help with the 'solution' they are intent on implementing.

                         

                        Don't know if they want to use 'random' values from that list or map based on cus_id or other.

                         

                        And that 'lack of requirements' is a typical problem when developing processes like OPs. You need to try to implement draft versions in a way that won't require a total rewrite if things change. Otherwise you can't even make incremental progress for a small set of data since the next set might get processed differently.

                         

                        That is why for new development I prefer putting data into tables and using those tables directly (e.g. in queries) for the early dev stages; at least until the requirements and scope of the project are well established.,

                         

                        Data in tables is easy to see. Queries on tables are easy to debug; much easier than debugging queries and collections in pl/sql.

                         

                        Once the code and/or process actually works the underlying mechanism can be changed if necessary for performance or other reasons.

                         

                        I especially recommend that 'tables first' method for less experienced developers

                        NOTE:- I'm new to oracle technology and my questions seems to be silly. Please guide me.

                        I have a strong preference for implementing things in the simplest possible way and only using advanced techniques when they are necessary to deal with a particular aspect (e.g. performance) of the use case.

                         

                        1. Use SQL whenever possible

                        2. Use PL/SQL when SQL alone can't do the job

                        3. Use BULK Sql with limits and collections only when PL/SQL and embedded SQL can't do the job

                        4. Use Dynamic SQL only when the above can't do the job

                         

                        IMNO new developers should focus on producing SOMETHING that actually works and produces the correct results. They should AVOID trying to create an optimal design and implementation on-the-fly before they understand ALL of the requirements and the potential implication of changes to those requirements that are likely to occur.

                         

                        I find it is relatively easy to 'tune' or refactor a solution that works. But find it very difficult to try to even understand, let alone fix a partial 'solution' that doesn't work properly and also doesn't implement all of the currently known requirements.

                         

                        You have a history of providing good examples and practical solutions,.

                         

                        I do NOT disagree with the individual components of what you proposed (hashing, package initialized collections, et al) only with the particular manner in which those components are used together to address OP's, mostly unclear, requirements.

                        • 39. Re: Hash and replace masking in oracle
                          Paulzip

                          Yes, it certainly is often a HUGE problem on this forum to get accurate requirements from many posters, I have a handful of noted ones who truly are incapable of describing anything, even simple scenarios, to the point where it becomes an iterative cycle of time wasting.  Some appear to use the forum as a coding workshop, to write all of their code.

                           

                          I totally agree with your 4 step priority order, categorically so - I generally do this religiously - I think anyone with good Oracle dev skills does, it's correct and not debatable. I do bend the rules on some occasions, for example, where a poster explicitly asks for a PL/SQL function.  I tend to read between the lines and assume they've greatly simplified things or have a specific business requirement for it.

                           

                          If I was solving a data masking situation in my field of work, I would use Oracle's own toolset, Oracle Data Masking Pack, no question.  I wouldn't reinvent the wheel or try to compete with 1000s of man hours worth of enterprise development.  But I originally thought it was a very simple 2-3 field mask problem, so thought the ODMP was a hammer to a nut.  Hence the little package.  It was also a proof of concept puzzle for me, which is always fun.

                           

                          It's nice to get feedback and bang a few ideas around with people, like yourself, who have an extensive knowledge of the wonderful beast that is Oracle. Cheers.

                          • 40. Re: Hash and replace masking in oracle
                            2790388

                            @paul and @rp0248 thanks for your suggestion from my next post i will list the requirements in details for all the scenarios.

                            Thanks for your guidance.

                            1 2 3 Previous Next