10 Replies Latest reply: Mar 1, 2012 1:26 PM by jihuyao RSS

    performance

    RN
      In a staging table with 2 million records:

      trying to find duplicates using two diferent approaches:
      1. using RANK
      2. joining a table to itself ...e.g select * from table1 awhere rowid>(select min(rowid) from table1 b where....)

      Cost of both queries are coming around 5-6
      But when I run, it takes more than 15 minutes(I canceled the execution after that)

      How can I optimize them? May be work on small fraction of data at a time? Can somebody advice me please?

      Let me know if I need to provide more information.

      Thanks,
      RN
        • 1. Re: performance
          sb92075
          RN wrote:
          In a staging table with 2 million records:

          trying to find duplicates using two diferent approaches:
          1. using RANK
          2. joining a table to itself ...e.g select * from table1 awhere rowid>(select min(rowid) from table1 b where....)

          Cost of both queries are coming around 5-6
          But when I run, it takes more than 15 minutes(I canceled the execution after that)

          How can I optimize them? May be work on small fraction of data at a time? Can somebody advice me please?

          Let me know if I need to provide more information.

          Thanks,
          RN
          select emp_id, count(*) from emp group by emp_id having count(emp_id) > 1
          • 2. Re: performance
            JustinCave
            If the cost is in the single digits and you're scanning a 2 million row table, I would strongly suspect that the statistics on the table (or its indexes) are incorrect (missing, out of date, etc.) That's likely causing Oracle to massively underestimate the number of rows it is going to have to look at and causing it to use a poor query plan.

            - Are the statistics on the table accurate?
            - Can you post the queries you're using and the query plans?

            Justin
            • 3. Re: performance
              RN
              Thanks for your response. Actually, I have to delete those dupes too.
              I am alreday using count(*)> 1 in the RANK approach.

              such as: per http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm#rank
              delete from $table_name where rowid in
                (
                select "rowid" from
                   (select "rowid", rank_n from
                       (select rank() over (partition by $primary_key order by rowid) rank_n, rowid as "rowid"
                           from $table_name
                           where $primary_key in
                              (select $primary_key from $table_name
                                group by $all_columns
                                having count(*) > 1
                              )
                           )
                       )
                   where rank_n > 1
                )
              Thanks a lot!
              • 4. Re: performance
                32685
                RN wrote:
                Thanks for your response. Actually, I have to delete those dupes too.
                I am alreday using count(*)> 1 in the RANK approach.

                such as: per http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm#rank
                delete from $table_name where rowid in
                (
                select "rowid" from
                (select "rowid", rank_n from
                (select rank() over (partition by $primary_key order by rowid) rank_n, rowid as "rowid"
                from $table_name
                where $primary_key in
                (select $primary_key from $table_name
                group by $all_columns
                having count(*) > 1
                )
                )
                )
                where rank_n > 1
                )
                Thanks a lot!
                Christ...Talk about over complicating it!
                DELETE
                FROM
                    $table_name 
                where rowid in
                    (   SELECT
                            LEAD(r.ROWID) OVER(PARTITION BY <columns you use to determine a duplicate> ORDER BY <something if you need to>)
                        FROM
                            $table_name
                    )
                / 
                Just to clarify - ORDER BY <something if you need to> - if there is a date or a sequential number that you want to use to pick one of the rows to keep, then use it, if there is nothing and you just want to get rid of all duplicates user ORDER BY NULL

                Cheers

                David

                Edited by: Bravid on Mar 1, 2012 5:38 PM
                • 5. Re: performance
                  RN
                  Thanks for your response, Justin.

                  You are absolutely right. I am working in dev. There are no statistics captured on the table.

                  1. So should I ask dba to gather statistics.
                  after that optimizer can come up with better plan because it knows how much data it needs to work on. But then cost may come up higher too. Right?
                  and can still take longer.
                  2. Should I try using this approach:fetch min and max pk of the data set
                  work on first 10,000 records and delete dupes from them as I move on to next 10,000.
                  does it make sense?


                  Thanks,
                  RN
                  • 6. Re: performance
                    JustinCave
                    RN wrote:
                    1. So should I ask dba to gather statistics.
                    after that optimizer can come up with better plan because it knows how much data it needs to work on. But then cost may come up higher too. Right?
                    and can still take longer.
                    You want the cost to be accurate. You shouldn't care whether the cost is "high" or "low". You care that the optimizer's estimates are accurate because when the optimizer's estimates are accurate, it's very likely picking the best plan.

                    Just like in the real world, you don't want estimates to be exceedingly low or exceedingly high. If a teammate estimates that he can complete a project that you estimate should take a team of 4 people a month by himself in 30 seconds, for example, you'd generally assume that there has been a miscommunication and that the teammate is estimating the difficulty of a very different problem than you intended rather than getting upset when the project wasn't finished 30 seconds later. In the same way, you want Oracle's estimates to be accurate, rather than small, and the way you do that is to communicate with the optimizer using statistics so that it can figure out how best to optimize the query.

                    99 times out of 100, you should ignore the COST. Focus on the cardinality estimates and focus on getting the cardinality estimates to be accurate.
                    2. Should I try using this approach:fetch min and max pk of the data set
                    work on first 10,000 records and delete dupes from them as I move on to next 10,000.
                    does it make sense?
                    No. That's going to be slower than doing it in a single SQL statement.


                    Justin
                    • 7. Re: performance
                      RN
                      Thanks again, Justin. I analyzed the table and it worked like wonders.

                      Yes, the cost is super duper high .
                      Cost with self join approach=55755 and it took 105 secconds to fetch duplicates.
                      Cost with RANK approach=361536 and took 41 seconds to fetch duplicates.
                      I always paid more attention to the cost. Still, it seems very high...probably I never saw tht high before.

                      But isn't it a problem that when I used RANK approach, and then filter records with rank>1 to fetch duplicates (in order to delete them), Oracle will need a lot of memory to sort them in case there are say 20,000 duplicates? Should I be worried.

                      Thanks a ton!
                      RN
                      • 8. Re: performance
                        JustinCave
                        Oracle isn't bound to sort the data in memory. It can happily use your TEMP tablespace for sort operations that don't fit in RAM. On the other hand, your session probably has plenty of PGA space to hold 20,000 values.

                        Justin
                        • 9. Re: performance
                          RN
                          Got it! Lot more to learn on performance and optimization.
                          My question is answered.

                          Thanks,
                          • 10. Re: performance
                            jihuyao
                            not sure what this part is doing

                            where $primary_key in
                            (select $primary_key from $table_name --where are the dimensions in the list for group by?
                            group by $all_columns
                            having count(*) > 1
                            )

                            Anyway. you can try select all unique rows from source table and insert into a 'staging' table (with one partition) and swap them and then drop the 'staging' table. Do it during maintenance time window.