Forum Stats

  • 3,838,621 Users
  • 2,262,385 Discussions
  • 7,900,705 Comments

Discussions

Best practice query

Hi,

For a while I'm questioning myself about a best practice query for performance.

I've 2 tables.

EMPLOYEES and DEPARTMENTS

The EMPLOYEES table can contains millions of records.


Which query is better for performance ?


Query 1

Select t1.emp_id, t1.emp_name, t2.dept_name
from EMPLOYEES t1, DEPARTMENTS t2
where t1.dept_id = t2.dept_id
order by t1.emp_name


Query 2

Select emp_id, emp_name, (select mypackage.get_dept_name(dept_id) from dual) as dept_name
from EMPLOYEES
order by emp_name

Answers

  • Mike Kutz
    Mike Kutz Member Posts: 6,195 Silver Crown
    edited Nov 21, 2021 11:30AM

    Query 1b

    Use ANSI joins

    Select t1.emp_id, t1.emp_name, t2.dept_name
    from EMPLOYEES t1
        Left join DEPARTMENTS t2. 
          On t1.dept_id = t2.dept_id
    order by t1.emp_name
    

    In Query 2, the scalar sub query is ran for each returning row of EMPLOYEES.

    Nic Pilot
  • User_H3J7U
    User_H3J7U Member Posts: 1,062 Gold Trophy

    To be equivalent, emp.department_id must be fk not null. And depends on function.

    Nic Pilot
  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,009 Blue Diamond

    Calling a pl/sql function to return a string that could simply be a column (or concatenation of columns) from a table is introducing an overhead and a possible threat. It's not necessary.

    Assuming you have the standard referential integrity between employees and departments (i.e. unique id for departments, foreign key on employees to departments) then your inline scalar subquery could be:

    (select dep.dept_name from departments dep where dep.dept_id = emp.dept_id)
    

    It's worth noting that the return from the function would be varchar2(4000) which might affect the output of the final query, or if you're pulling the data across a network might (depending on the client tool) waste a lot of memory if (for example) the client code allocated memory based on the maximum possible return length).

    If employees allows for employees not yet assigned to a department (i.e. null dept_id) this would report NULL in the final query, whereas the join would lose the employee unless you changed the join to an outer join.

    You are presumably hoping to benefit from scalar subquery caching with the scalar subquery approach - or possibly result caching, or pl/sql function caching. Two things to remember if you depend on such things - how long does the cache persist, what happens if you get a cache collision. The more different departments you have (and millions of employees may mean thousands of departments) the more likely you are to get a hash collision on the cache and end up with randomly variable performance.

    In your example the simply join is likely to be a hash join using the departments table as the build table with optimal workarea usage - which means an in-memory hash lookup by each employee, and you can't really get much faster than that.

    With an example like this "millions of employees, no filtering" your performance problems are more likely to be related to transporting the volume of data to the client - so setting a large fetch arraysize is likely to be far more important than fiddling with such simple SQL. And while there is a way to write the query so that it can perform a little faster, there are two other aspects of coding to consider: how fragile is the code, and what are the chances that the next person who has to modify the code will not understand the tweaks you've used and do something that makes things worse.


    Here's a very old example of the performance instability that can appear with scalar subquery caching - in this case it's a filter subquery, but the principle is the same: https://jonathanlewis.wordpress.com/2006/11/06/filter-subqueries/



    Regards

    Jonathan Lewis

  • Nic Pilot
    Nic Pilot Member Posts: 27 Green Ribbon

    Thank you all for your replies & tips.

    So better go for Query 1 or Query 1b.

    Is there any difference in term of performance between Query 1 & Query 1b ?

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,009 Blue Diamond

    Bear in mind that if employees.dept_id is allowed to be null then the two queries will return different results. In that case the outer join equivalent using Oracle's traditional syntax would be:

    Select t1.emp_id, t1.emp_name, t2.dept_name
    from EMPLOYEES t1, DEPARTMENTS t2
    where t1.dept_id = t2.dept_id(+)
    order by t1.emp_name
    

    (which you could view as "we add an empty row to t2/departments to give employees without a department something to join to")

    Generally it is now seen strategically as a good thing to use the "ANSI" syntax rather than traditional Oracle syntax. There are a few oddities (whichevet you choose) that mean in some boundary cases you get different plans as you switch. In this case there would be no performance difference.

    Regards

    Jonathan Lewis

  • Nic Pilot
    Nic Pilot Member Posts: 27 Green Ribbon

    Thank you Jonathan.

    Regards