11 Replies Latest reply: Mar 25, 2013 1:14 PM by 781578 RSS

    Pearson Correlation of multiple values

    781578
      Hey,

      i have a problem. I want to calculate the pearson correlation of two sets which have multiple values. The function CORR(x,y) accepts only two numbers. How can i handle that?
      Is there an efficient way to do that because i have a lot of sets where i want to calculate the correlation.

      I hope you can help me.

      bladepit
        • 1. Re: Pearson Correlation of multiple values
          Frank Kulash
          Hi,
          bladepit wrote:
          Hey,

          i have a problem. I want to calculate the pearson correlation of two sets which have multiple values. The function CORR(x,y) accepts only two numbers.
          CORR accepts two arguments , such as two NUMBER columns. The result that it returns reflects the correlation over a group (or partition ) which typically includes a large number of rows.
          How can i handle that?
          Is there an efficient way to do that because i have a lot of sets where i want to calculate the correlation.

          I hope you can help me.
          Post a little sample data (CREATE TABLE and INSERT statements) and the results you want from that data.
          See the forum FAQ {message:id=9360003}
          • 2. Re: Pearson Correlation of multiple values
            781578
            I can't post the table definition or something else but i hope i can explain my table. I thought this select will make it little bit clearer:

            select value, ref, bez from table where (bez = 'A' OR bez = 'B') and ref in (200,201,202....) order by ref;

            The value is the important information here. The output could be something like that:

            value ref bez
            0,0251;194;A
            0,7063;194;B
            1,2265;195;B
            0,7063;198;B

            Over the amount of the two sets i want to calculate the pearson.

            I think if i put them into an tmptable in my database and execute this sql statement:

            select CORR(x,y) from tmptable;

            i will get my information but i need to do that for more then 2 million different bez and i think this will not be very efficient.
            • 3. Re: Pearson Correlation of multiple values
              Frank Kulash
              Hi,
              bladepit wrote:
              I can't post the table definition
              Don't be silly. Of course you can post CREATE TABLE and INSERT statmenents for some sample data. They don't need to have the same names as your real tables and columns, any more than they need to have the same size or the same data as your real tables.
              If you can show what you need to do using commonly available tables, such as those in the scott or hr schemas, then you don't need to post any tables of your own. But one way or another, you have to provide sample data to the people who want to help you.
              or something else but i hope i can explain my table. I thought this select will make it little bit clearer:

              select value, ref, bez from table where (bez = 'A' OR bez = 'B') and ref in (200,201,202....) order by ref;

              The value is the important information here. The output could be something like that:

              value ref bez
              0,0251;194;A
              0,7063;194;B
              1,2265;195;B
              0,7063;198;B

              Over the amount of the two sets i want to calculate the pearson.
              What are the results you would want from this small sample data set?

              I'm sorry I don't know much about your application, or about statistics. This is a SQL and PL/SQL forum. If you happen to find someone here that knows all about your application and about statistics, that's great, but in general, you'll have to explain what you want. With your knowledge of your application and of statistics, and our knowlege of SQL, I'm confident we (the poeple on this forum) can solve this problem, if we work together.
              I think if i put them into an tmptable in my database and execute this sql statement:

              select CORR(x,y) from tmptable;

              i will get my information but i need to do that for more then 2 million different bez
              What is a bez? Why is the number of different bez singnificant in this problem?
              and i think this will not be very efficient.
              Where is the data now?
              If it's not in an Oracle database, or someplace that can be read from Oracle, why are you even considering using an Oracle function?
              Given that you do want to use an Oracle function, then you have to get the data into Oracle, even if that takes time.
              • 4. Re: Pearson Correlation of multiple values
                chris227
                With T1 as (
                Select
                 Value,ref,bez
                From table
                Where
                Bez ='A'
                ),
                 T2 as (
                Select
                Value, ref,bez
                From table
                Where
                 bez ='B'
                )
                
                Select
                Corr(T1.value,t2.value)
                Over(Order by T1.ref) cor
                From
                T1,T2
                Wehre
                T1.ref=T2.ref
                Sorry, it is somehow RAW, behause i Worte this in an iPad, what i am Not Use to do.

                Edited by: chris227 on 23.03.2013 11:47
                • 5. Re: Pearson Correlation of multiple values
                  781578
                  Thank you that worked fine.

                  I have another question for a problem that i need to solve to use your solution.

                  I have a table:

                  CREATE TABLE TABLE
                  (
                  BEZ VARCHAR2(13 BYTE) NOT NULL
                  , REF NUMBER(10, 0) NOT NULL
                  , REF2 NUMBER(1, 0) NOT NULL
                  , VALUE NUMBER(10, 4)
                  , CONSTRAINT PK_K PRIMARY KEY
                  (
                  BEZ
                  , REF
                  , REF2
                  )
                  ENABLE
                  )

                  In this table are four million entries. Over all bez i want to generate the distinct pairs of them. These are the input pairs for your solution above.

                  I have no idea how i can do this with plsql. My first idea was so save all distinct BEZ from an sql in plsql and then generate an sql:

                  select distinct (a.BEZ), b.BEZ from DISTINCT_SUBSTID_SQL a, DISTINCT_SUBSTID_SQL b

                  This is only an idea and i thought it will slow down my complete procedure.
                  • 6. Re: Pearson Correlation of multiple values
                    chris227
                    Sorry, i dont really understand which column should be distinct.
                    But did you try just
                    With T1 as (
                    Select distinct
                     Value,ref,bez
                    From table
                    Where
                    Bez ='A'
                    ),
                    
                    ...
                    The distinct will slow it down in every case

                    Edited by: chris227 on 24.03.2013 12:37
                    • 7. Re: Pearson Correlation of multiple values
                      781578
                      I think that doesn't match my needs.

                      I want to get the distinct pairs of BEZ for an specific list of REFs. REF2 needs to be always 1 so that is trivial.

                      An example of data in table and what i want to get from the select:

                      BEZ;REF;REF2;VALUE

                      A 11 1 2
                      B 22 1 3
                      C 11 1 2
                      D 33 0 4


                      Now i want to get all combinations of BEZ pairs for an specific set of REFs for example for REF 11 and 22:

                      A,B
                      A,C
                      B,C

                      For four million BEZs this will explode and i know that so i tried to filter out the BEZ bevor i start this step (based on some criteria).
                      My question is: Can I handle efficiently with PLSQL?
                      • 8. Re: Pearson Correlation of multiple values
                        chris227
                        you mean something like
                        with data as (
                        select distinct
                          BEZ,REF,VALUE
                        from your_table
                        where
                         ref in (11,22)
                        )
                        
                        select
                         d1.bez , d2.bez
                        d1.value, d2.value
                        from
                        data d1, data d2
                        • 9. Re: Pearson Correlation of multiple values
                          781578
                          I think this could be the solution for my problem. But if i execute this sql i became the following error:

                          ORA-01652: Temp-Segment kann nicht um 128 in Tablespace TEMP erweitert werden
                          01652. 00000 - "unable to extend temp segment by %s in tablespace %s"
                          *Cause:    Failed to allocate an extent of the required number of blocks for
                          a temporary segment in the tablespace indicated.
                          *Action:   Use ALTER TABLESPACE ADD DATAFILE statement to add one or more
                          files to the tablespace indicated.

                          I guessed i have no tablespace with name TEMP. If i open my plugin insider view in sqldeveloper i didn't see this tablespace.

                          Can i execute this sql in another tablespace?
                          • 10. Re: Pearson Correlation of multiple values
                            chris227
                            With this table this is not really surprising, because of the cartesian product.
                            if the data is like in the example, one might try to reduce the number of rows (aside from extending the temp-space) with something like below to exclude permutations
                            select
                             d1.bez , d2.bez
                            d1.value, d2.value
                            from
                            data d1, data d2
                            where
                            d1.bez <= d2.bez
                            You may try to find the tablespace by starting with
                            select * from dba_tablespaces
                            There willbe one, read the docs on this or ask your administrator.
                            • 11. Re: Pearson Correlation of multiple values
                              781578
                              Can i handle for example four million bez with an collection with type like that:

                              TYPE intstring IS TABLE OF VARCHAR2(64) INDEX BY BINARY_INTEGER

                              In my example that works. But i am sure if i put four million strings in it it could be something like outofmemory?

                              Something like that became i if i do the following:
                                i:=0;
                                over402 := over40;
                                str := over40.first; 
                                while str is not null loop
                                  str2 := over402.first;
                                  while str2 is not null loop
                                    if str != str2 AND substidtosubstid.EXISTS(str || ';' || str2) = FALSE AND substidtosubstid.EXISTS(str2 || ';' || str) = FALSE then
                                      substidtosubstid(str || ';' || str2) := 'TEST';
                                      --dbms_output.put_line(str || ' ' || str2);
                                    end if;
                                    str2 := over402.next(str2);
                                  end loop;
                                  str := over40.next(str);
                                end loop;
                              over40 is an collection with 8767 strings. It throws:
                              04030. 00000 -  "out of process memory when trying to allocate %s bytes (%s,%s)"
                              *Cause:    Operating system process private memory has been exhausted
                              *Action:
                              Seems that the operating system under the database reachs end of memory?

                              Edited by: bladepit on 25.03.2013 11:13