Forum Stats

  • 3,838,560 Users
  • 2,262,382 Discussions
  • 7,900,688 Comments

Discussions

Previous duplicate row removal - Reg

User_U1SBG
User_U1SBG Member Posts: 12 Red Ribbon

Hi Team,

Good Morning

I have following sample data

CREATE TABLE sample(

  CHILD_NUMBER INTEGER NOT NULL PRIMARY KEY 

 ,PARENT_NUMBER INTEGER 

 ,KEY_VALUE   VARCHAR(2) NOT NULL

 ,FLOW_NODE   INTEGER NOT NULL

);

INSERT INTO sample(CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE,FLOW_NODE) VALUES (95,NULL,'P1',1);

INSERT INTO sample(CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE,FLOW_NODE) VALUES (96,95,'P1',2);

INSERT INTO sample(CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE,FLOW_NODE) VALUES (96,95,'P1',1);

INSERT INTO sample(CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE,FLOW_NODE) VALUES (98,NULL,'P2',1);

INSERT INTO sample(CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE,FLOW_NODE) VALUES (99,98,'P2',2);

INSERT INTO sample(CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE,FLOW_NODE) VALUES (99,98,'P2',1);


child_number duplicate should be filtered since, it has already available with flow_node 2. Agains it should not be repeated for flow_node 1

my expected output



Tagged:

Answers

  • Arun Kumar Gupta
    Arun Kumar Gupta Member Posts: 1,029 Gold Badge
    edited Jul 7, 2022 7:26AM

    Try this.


    SELECT CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE,FLOW_NODE
     FROM (
         SELECT a.*
            ,ROW_NUMBER() OVER (PARTITION BY CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE ORDER BY FLOW_NODE DESC) ranking
          FROM sample   a
        )
     WHERE ranking = 1
    


    If you want to check for duplicate only in CHILD_NUMBER column then just use only this column in PARTITION BY clause instead of 3 columns in above query.

    BTW if you have primary key on child_number column it wont have duplicates.


    Regards

    Arun

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,243 Red Diamond

    Hi, @User_U1SBG

    Thanks for posting the sample data; that's very helpful.

    CREATE TABLE sample(

     CHILD_NUMBER INTEGER NOT NULL PRIMARY KEY 

    ...

    INSERT INTO sample(CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE,FLOW_NODE) VALUES (96,95,'P1',2);

    INSERT INTO sample(CHILD_NUMBER,PARENT_NUMBER,KEY_VALUE,FLOW_NODE) VALUES (96,95,'P1',1);

    If child_number is the primary key, then you can't insert two rows with child_key=96. I had to remove the PRIMARY KEY constraint in order to make the INSERT statements work.


    Agains it should not be repeated for flow_node 1

    So, when the table contains two (or more) rows with the same combination of child_number, parent_number and key_value, the output should have only one row. That sounds like a job for GROUP BY, for example:

    SELECT   child_number, parent_number, key_value
    ,	 MAX (flow_node)  AS flow_node
    FROM	 sample
    GROUP BY child_number, parent_number, key_value
    ORDER BY child_number, parent_number, key_value -- or whatever you want
    ;
    

    Aggregate functions are usually more efficient than analytic functions, such as ROW_NUMBER.

    MATTEO RAPPAZZO