5 Replies Latest reply on Oct 11, 2012 6:27 PM by Sven W.

# Statistics question

I can easily calculate the standard deviation for some rows with the STDDEV function. But that returns me a value expressed in the same units as the things being measured. (In other words, if I have times of 1,12,6,11 and 8 seconds, I will be told that the average time is 7.6 seconds, and the standard deviation is 4.39 seconds).

What I'd like to be able to do is to say that the record with value 1 is (say) 3 std deviations away from the average. Or that the 6 second row is within 1 standard deviation.

Put it the other way around: I am trying to find records which are more than 3 standard deviations of the mean, because they are the outliers I am interested in.

I cannot see how to convert a STDDEV result into being a "number of standard deviations".

(If I have not expressed myself clearly, I'm using this sort of stuff http://en.wikipedia.org/wiki/68-95-99.7_rule and I'm trying to find >3-sigma records)

Edit: I may be being stupid on this, but is it correct simply to multiply the STDDEV result by 1, 2, 3 and so on, and then compare to the time? That is, is it OK to do the following:

Mean = 7.6 seconds (from my sample data mentioned in the above post)
StdDev = 4.29 seconds

So 2sigma would be 4.29*2 = 8.78 seconds, plus or minus the mean?

So any record with a time value between -1.18 (7.6-8.78) and 16.38 (7.6+8.78) seconds would be within 2-sigma, assuming normal distribution?

I am still wondering if there's a nice function or something that returns the sigma value for any given set of records?
• ###### 1. Re: Statistics question
>
can easily calculate the standard deviation for some rows with the STDDEV function. But that returns me a value expressed in the same units as the things being measured. (In other words, if I have times of 1,12,6,11 and 8 seconds, I will be told that the average time is 7.6 seconds, and the standard deviation is 4.39 seconds).

What I'd like to be able to do is to say that the record with value 1 is (say) 3 std deviations away from the average. Or that the 6 second row is within 1 standard deviation.

Put it the other way around: I am trying to find records which are more than 3 standard deviations of the mean, because they are the outliers I am interested in.

I cannot see how to convert a STDDEV result into being a "number of standard deviations".
>
You don't need to convert like that.

If the mean is 7.6 and the standard deviation is 4.39 then just compute the values that are +- 1 by adding/subtracting the values.

7.6 - 4.39 = 3.21
7.6 + 4.39 = 11.99

So 6 is within 1 standard deviation.
• ###### 2. Re: Statistics question
Cheers. Not sure if you saw my edit to my original question or not. I think I came to the same conclusion as you, and you've validated that, which is very helpful.

I'm a bit surprised there's no built-in function to say "this value is X-sigma away from the mean" -though I haven't seen one in any other database or even things like Excel, so maybe I just haven't thought it through enough. I think a query where you say "where sigma>2" would be a lot easier to write than "where time not between avg(time)-stddev(time)*2 and avg(time)+stddev(time)*2", no?!
• ###### 3. Re: Statistics question
>
I'm a bit surprised there's no built-in function to say "this value is X-sigma away from the mean"
>
No one can possibly foresee all of the different calculations one might want to do with values.
>
I am still wondering if there's a nice function or something that returns the sigma value for any given set of records?
>
Not sure what you mean. Sigma is just standard deviation so you would use that function.

See STDDEV in the SQL Language doc
http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions178.htm#sthref1566
>
STDDEV returns the sample standard deviation of expr, a set of numbers. You can use it as both an aggregate and analytic function. It differs from STDDEV_SAMP in that STDDEV returns zero when it has only 1 row of input data, whereas STDDEV_SAMP returns null.

Oracle Database calculates the standard deviation as the square root of the variance defined for the VARIANCE aggregate function.

This function takes as an argument any numeric data type or any nonnumeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument.
• ###### 4. Re: Statistics question
"Sigma is just standard deviation so you would use that function."

No, sigma is not 'just standard deviation'. It's certainly related to standard deviation, but it's not the same thing at all.

If you know the standard deviation on a "elapsed time" column for a set of rows is 5, and the mean is 16, then a row within that sample which has an elapsed time value of 9 has a sigma of 1.4 ((16-9)/5). In general, (difference-from-mean as a proportion of stddev).

The standard deviation tells you one thing about an entire population; a sigma value is valid for each sample within that population (i.e., how close or far away from the mean is this particular sample?).

What's more, the standard deviation is measured in the same units as the thing it's calculated from -so "5" in the earlier example means "the population varies from the mean by plus or minus 5 seconds". Whereas the sigma is measured in 'number of standard deviations from the mean', so '1.4' does not mean '1.4 seconds' or anything like that.

So sigma is completely different from standard deviation... and if Oracle could return the sigma for a row as well as an aggregate standard deviation for an entire population, that would be quite useful. Fair enough that it doesn't, but it's not a particularly off-beam requirement, I think -my pseudo-query earlier from my previous post shows how it might be useful.

Fair enough that I'm no statistician, so my understanding of this might be wrong in many ways, but I believe I know what STDDEV does ...and it doesn't tell me what I need to know in this case. I can certainly derive what I need to know from STDDEV, but in a world of analytics, I was hoping I wouldn't have to.

Thank you for your help.
• ###### 5. Re: Statistics question
There are a lot more statistical functions that oracle provides. I do only understand half of them. i guess you need to be a proper maths guru to fully get the grasp of it.

For example there is also CUME_DIST and PERCENTILE_DISC which might be helpful for your case
But it is not so easy to get a good overview about all those functions.

here are a few links:

http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions043.htm#SQLRF00627
http://docs.oracle.com/cd/E11882_01/server.112/e25554/analysis.htm#i1006229

and http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions004.htm#i81407 (scroll down to see the list of analytical functions available).

And if thats not enough there is the dbms_stat_func package: http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/d_stat_f.htm#ARPLS006