This discussion is archived
10 Replies Latest reply: Jan 22, 2013 3:31 AM by TomasAlbinsson RSS

Possible new percentile function?

TomasAlbinsson Newbie
Currently Being Moderated
Hi,

I'm running "Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production".

I need to calculate percentiles, so I believe I have two choices: PERCENTILE_CONT() or PERCENTILE_DISC().
I also need to get the same results as the SPSS and SAS tools.
Comparing the results, neither of the Oracle versions give the same result as SPSS/SAS. I think this is due to how Oracle determines what row numbers to use.

Oracle takes row RN = (1+ (P*(N-1)) where P is percentile and N is number of ordered rows.
For PERCENTILE_CONT(0.1) from a 16 row set that would give row 2.5 as the starting point.
http://docs.oracle.com/cd/B28359_01/server.111/b28286/functions115.htm

SPSS and SAS uses the formula RN = P*(N+1) which gives row 1.7 for a 16 row set.

Now, would it be possible for me to write a PERCENTILE_SPSS() function?

The samples of analytical/pipelined functions I have seen (and written) are happy to use the rows in the set one by one, to calculate some special min/max/avg/etc.
For this function I would need either
1 - to know the number of rows in the set, from the first call of my function, and calculate the percentile as my function "passes" the desired rows
or
2 - save all values in the set and calculate the percentile at the end

The first sounds impossible, the second sounds expensive for a large table.
Any suggestions?


Kind regards

Tomas
  • 1. Re: Possible new percentile function?
    Frank Kulash Guru
    Currently Being Moderated
    Hi, Tomas,

    Sorry, I'm not sure what you're asking.

    Whenever you have a problem, it helps if you post a concrete example. Include CREATE TABLE and INSERT statements for a little sample data, or use some commonly available table for input values. In this case, perhaps you can use scott.emp as your input data. That table has 14 rows, each with a sal value (12 distinct values).
    Whether you use your own table or some other one, show the results you're trying to get.
    Tomas Albinsson wrote:
    ... Now, would it be possible for me to write a PERCENTILE_SPSS() function?
    Sure, but do you really need a function in this case? That is, if you're just using the function to find the .0, .1, .2, ..., 1.0 percentiles, would you be satisfied with a query that produced those 11 rows?
    The samples of analytical/pipelined functions I have seen (and written) are happy to use the rows in the set one by one, to calculate some special min/max/avg/etc.
    For this function I would need either
    1 - to know the number of rows in the set, from the first call of my function, and calculate the percentile as my function "passes" the desired rows
    or
    2 - save all values in the set and calculate the percentile at the end

    The first sounds impossible,
    It doesn't sound impossible to me, but I'm not sure I understand the problem. I could really use a concrete example.
    the second sounds expensive for a large table.
    Depending on how you're planning to use the results, there might be ways to make it more efficient.
  • 2. Re: Possible new percentile function?
    rp0428 Guru
    Currently Being Moderated
    >
    I also need to get the same results as the SPSS and SAS tools.
    >
    You mean you want to implement the same FIVE methods that SPSS supports?
    >
    SPSS and SAS uses the formula RN = P*(N+1) which gives row 1.7 for a 16 row set.
    >
    One of the SPSS methods might use that formula but SPSS supports five different methods for calculation percentile.

    Have you checked each of those to see if any of them

    And there are many, many more methods of calculating percentile.

    See this doc from IBM (there are many, many others with similar sentiments)which pretty much sums up the state of things:
    How does SPSS/PASW Statistics calculate percentiles in FREQUENCIES?
    http://www-01.ibm.com/support/docview.wss?uid=swg21480663
    >
    Many people are disturbed by the existence of different values for the same percentile on the same data, but this is unavoidable. Consider the definition of a percentile from the NIST web site: "The pth percentile is a value, Y(p), such that at most (100p)% of the measurements are less than this value and at most 100(1- p)% are greater."
    Even if you are working with a population rather than a sample, unless N*p (or W*p) is exactly equal to the cumulative case count up to a certain case, nothing actually satisfies the definition for the given data, and you're left trying to estimate the best approximation.
    >
    Notice the part that says:
    >
    unless N*p (or W*p) is exactly equal to the cumulative case count up to a certain case, nothing actually satisfies the definition for the given data,
    >
    All of which begs the question: why do the Oracle calcs have to match what the other two system produce?

    Select a 'System of Record' for such calculations and then use it.
  • 3. Re: Possible new percentile function?
    TomasAlbinsson Newbie
    Currently Being Moderated
    Hello Frank,

    I just thought my question was more fundamental than practical :)
    The end goal is to replace an existing report. As I must replace both future and historical (already printed) values, there are correct answers I must produce.
    As I need to present percentile values I believe I have two built-in choices, PERCENTILE_CONT() or PERCENTILE_DISC().
    SELECT PERCENTILE_CONT(0.1) WITHIN GROUP (ORDER BY nr) "Perc 10 cont",
           PERCENTILE_DISC(0.1) WITHIN GROUP (ORDER BY nr) "Perc 10 disc"
    FROM ( 
     select 2 nr from dual union select 5 from dual union select 6 from dual union select 8 from dual union
     select 13 from dual union select 14 from dual union select 15 from dual union select 18 from dual union
     select 20 from dual union select 23 from dual union select 24 from dual union select 28 from dual union
     select 30 from dual union select 31 from dual union select 32 from dual union select 35 from dual
    )
    
    Perc 10 cont Perc 10 disc
    ------------- -------------
              5,5             5
    For the above 16 numbers, SPSS gives the answer 4.1 (using it's default version of percentile, as in the original report).
    There are more than five ways to calculate a percentile and I don't care if anyone is more correct than the other. I think Oracle just made a different decision from the SPSS developers.

    If there is some Oracle analytical function that gives 4.1 for the above sample, please inform me and I'll be happy to use it.
    If there is not, my question remains: can I write my own?

    It's not a formula problem, I know what to do and how. My only concern (right now) with writing it myself is that the formula needs to know the total number of records to decide the starting point for the calculation.
    If my function "sees" the records one by one, it would know there were 16 at the very end, and would then need to get record 1.7 (records 1 and 2) to calculate.

    As the function would be used in the system, it's very unlikely that it will ever be used for more than 200 records. So I could use an array and store the values until the and.
    I'd just rather do it in a way that would work fine with a million records :)


    Kind regards

    Tomas
  • 4. Re: Possible new percentile function?
    TomasAlbinsson Newbie
    Currently Being Moderated
    I usually start off with "Hi" or "Hello" but Mr rp0428 seem to find that superfluous.

    >
    You mean you want to implement the same FIVE methods that SPSS supports?
    >
    No. I want to implement the one method I know is used in the old report I'm to replace.
    That is the default method, "haverage".
    The formula itself is not the problem. I'm asking if it's possible to write a custom-made analytical function when the formula needs to know the record count.

    Thank you for the lecture on the many ways and problems with calculating a percentile.

    >
    All of which begs the question: why do the Oracle calcs have to match what the other two system produce?
    >
    Oracle does not have to match, I've never said that.
    My report, running in Oracle Reports, do need to match the results of an old report.

    >
    Select a 'System of Record' for such calculations and then use it.
    >
    The report I'm replacing is actually printed (as "on paper"). I have considered typing in the old values and selecting them.
    That would mean I'd have to use SPSS to produce future values to store.
    That's just too much manual labour for my taste.
  • 5. Re: Possible new percentile function?
    KeithJamieson Expert
    Currently Being Moderated
    It is possible to create user-defined aggregate functions in Oracle, i.e. functions that operate on more than one row.
    So theoretically, you shoudl be able to create your own percentile function.

    Here is one link:

    http://docs.oracle.com/cd/B28359_01/appdev.111/b28425/aggr_functions.htm
  • 6. Re: Possible new percentile function?
    padders Pro
    Currently Being Moderated
    the second sounds expensive for a large table
    Yes, you're probably right. Writing your own aggregate function implies multiple context switches between the SQL and PL/SQL engines at runtime and possibly abuse of PGA memory by keeping excessive values in arrays and hence is unlikely to perform as well a pure SQL solution, however desirable encapsulating the function might initially appear to be.

    I think using COUNT (*) OVER () to get the row count and following the specification for PERCENTILE_CONT at http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions110.htm#i1000909 will give you what you need need to calculate this, e.g.
    SQL> SELECT DECODE (
      2            MAX (CEIL (prn)),
      3            MAX (FLOOR (prn)), MAX (nr),
      4           (MAX (CEIL (prn)) - MAX (prn)) *
      5              MAX (DECODE (rn, FLOOR (prn), nr)) +
      6           (MAX (prn) - MAX (FLOOR (prn))) *
      7              MAX (DECODE (rn, CEIL (prn), nr))) percentile
      8  FROM   (SELECT ROWNUM rn, COLUMN_VALUE nr,
      9                 0.1 * (COUNT (*) OVER () + 1) prn
     10          FROM   TABLE (sys.odcinumberlist (
     11                    2, 5, 6, 8, 13, 14, 15, 18, 20,
     12                    23, 24, 28, 30, 31, 32, 35)))
     13  WHERE  rn IN (CEIL (prn), FLOOR (prn));
    
    PERCENTILE
    ----------
           4.1
    
    SQL>
    Edit: in hindsight you might prefer the ROW_NUMBER analytic function to ROWNUM as it would allow you to explicity order by nr with unsorted data.
  • 7. Re: Possible new percentile function?
    ascheffer Expert
    Currently Being Moderated
    Something simular
    select min(nr) KEEP (DENSE_RANK LAST ORDER BY rn desc)
        + ( min(nr) KEEP (DENSE_RANK LAST ORDER BY rn)
          - min(nr) KEEP (DENSE_RANK LAST ORDER BY rn desc)
          ) * mod( max(x), 1 )
    from(
    SELECT nr
         , row_number() over ( order by nr ) rn
         , ( count(*) over () + 1 ) * 0.1 x
    FROM ( 
     select 35 nr from dual union select 5 from dual union select 6 from dual union select 8 from dual union
     select 13 from dual union select 14 from dual union select 15 from dual union select 18 from dual union
     select 20 from dual union select 23 from dual union select 24 from dual union select 28 from dual union
     select 30 from dual union select 31 from dual union select 32 from dual union select 2 from dual
    )
    )
    where rn in ( floor( x ), ceil( x ) )
  • 8. Re: Possible new percentile function?
    rp0428 Guru
    Currently Being Moderated
    You'll be more successful in the forums if you focus on the issues, questions and suggestions rather than make assumptions about why those questions or suggestions are being asked or made. Your unwarranted personal comments are unprofessional and indicate a lack of maturity.

    My questions were, and are, solely to elicit information about your actual issue so that suitable solutions or workarounds can be determined. There was no 'lecture' anywhere in any of the information that I provided; just statements and quotes from experts.

    The articles and quotes I provided were intended to show you that there isn't a universal agreement on a definition for those 'percentile' calculations. That would signal at least a 'caution' flag.
    >
    I'm asking if it's possible to write a custom-made analytical function when the formula needs to know the record count.
    >
    Yes - it is possible. In my opinion it isn't feasible. Any attempt to reproduce the exact proprietary algorithm that any vendor uses would require knowledge about the internals of that vendor's code that jsut aren't likely to be known or available: what precision is being used for intermediate results? What rounding methods are being used? What is the exact order of operations. Were any of those factors changed between versions. Did the vendor make any bug fixes?

    Any testing you do with any manageable set of test data won't ensure that there aren't data sets that will produce slightly different results.

    So writing such a function is one thing; trying to duplicate existing functionality is something else. The vendor's own implementation may have changed due to bug fixes so the results from that vendor may depend on the actual version being used.
    >
    Oracle does not have to match, I've never said that.
    My report, running in Oracle Reports, do need to match the results of an old report.
    >
    Well I read that as saying that Oracle does need to match. Your second sentence pretty much says exactly that.

    Again, my question was to try to understand why Oracle needed to produce a NEW report from OLD data that matches information in a PRINTED report from the past. That is a very unusual requirement. I've never run across such a requirement in more than 30 years and I have written thousands of reports using Crystal Reports (now BO), Oracle Reports, RPG and many other tools.

    It is often not possible to reproduce ANY report from the past due to the ever-changing nature of the data in the database. Data gets updated, deleted, inserted even when it shouldn't.
    >
    I have considered typing in the old values and selecting them.
    >
    That is the only method guaranteed to produce the same results. And that is about the only method that will pass the stringent auditing and Sarbanes-Oxley requirements that many organizations have to abide by. Those auditors won't allow you to just change the data to make things come out the way you want. You have to PROVE the provenance of the data: where did every piece of data come from, what changes were made to it (and when and by whome) and how were any accumulations performed.

    Using a custom function and merely showing that the results, on paper, match would not sway any auditor I've ever had to deal with. You may not have an issue of having to prove the data. That's why we are asking questions. To try to understand what options are viable for you.
    >
    That would mean I'd have to use SPSS to produce future values to store.
    >
    Yep! That's one of the main drivers for a lot of the ETL processes that exist. There are multiple systems of record. A PeopleSoft system might be the SOR for Inventory while an Oracle financials system might be the SOR for the financial and payroll data. ETL tries to merge and match the data. Trying to keep those ETL processes up to date with the changes going on in all of the source systems is a MAJOR headache.

    The standard solutiion for that use case is to select an SOR for reporting and then create ETL processes that bring the SOR data from each source system to the reporting SOR. For your use case that might mean using SPSS to produce data in report-ready tables and then use ETL to bring that data to the report server where it can be merged with data from other systems.
    >
    That's just too much manual labour for my taste.
    >
    Welcome to the real world!
  • 9. Re: Possible new percentile function?
    TomasAlbinsson Newbie
    Currently Being Moderated
    Thank you Anton and padders for your excellent suggestions!

    I think I even understand them :)


    Kind regards

    Tomas
  • 10. Re: Possible new percentile function?
    TomasAlbinsson Newbie
    Currently Being Moderated
    As I'm a consultant I don't have the time for this dispute.
    I tried to make it clear that I had no actual problem understanding why the standard percentile functions weren't giving me the results my client wanted. English is my second language.
    I found your reply unhelpful and unpolite. Could be cultural. So I knowingly responded likewise.

    Maybe there are more worlds than one.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points