1 2 Previous Next 27 Replies Latest reply on Jun 23, 2015 2:25 PM by rp0428 Go to original post
      • 15. Re: Design suggestion for a very wide data set (4000 columns)
        Billy~Verreynne

        If the users only ever want to query a subset of columns, why export 4000 columns per line? Or is it not closer to the truth that "we do not know what columns will need to be queried, but it should be a subset, however it is safer to grab all 4000"? Which is an asanine argument.

         

        And this begs the question as to what the structures and relationships are in a single 4000 column line? And as there are, why are these not supplied as individual data pieces via CSV/XML/etc files? I fail to grasp the logic of how spooling a 4000 column CSV file is easy, never mind SENSIBLE!!

         

        If you had 40 different CSV files, each one containing an entities data per line, then yes! That is EXACTLY how data needs to be loaded.

         

        The basic reason this has me foaming at the mouth, is converting structured data on the source system, to unstructured gunk as 4000 columns per CSV line, and then expecting the target system to magically make that work as structured data.

        • 16. Re: Design suggestion for a very wide data set (4000 columns)
          GS613

          "we do not know what columns will need to be queried, but it should be a subset, however it is safer to grab all 4000"


          The above is correct. And (now you better sit down for the next statement) the columns that are needed on a monthly basis are in the range of 20-30 AND this number is fairly static.  It is on the ad-hoc modelling that is done that they may need other columns, but even so this would still only be a minority of other (but any of the) 4000 columns.


          Before you punch the screen or blow an O-ring there is good and bad news.


          Good: the longer term intention is to aggregate the data on our end. We will receive much narrower, deeper files of the raw data, this will be aggregated (within SAS) and the produced model be persisted in Oracle. Potential issue is the aggregation will need to be stored - very wide...


          Bad: Until the above happens, and even for a period once its happening, we need to import said aggregated data.


          Good news of the Bad news: Multiple aggregated files are possible. A composite key can be defined and will exist in each file. The number of columns per file will be limited to at most 255 - probably less in reality. The source system is able to provide files in this format. The users will know which tables to use for their analysis.


          You will ask, why don't they just give you the raw data now - valid question. The process to aggregate is a specialized, proprietary service that is paid for. We do have the skills to do this, we will aggregate in the future, but not for a while.


          So, in an attempt to overcome some of the medium term pain of importing this aggregated data by structuring it, I can push for multiple files. It will likely make the import more efficient. By the users defining the split they will know the tables they need to use for their modelling and therefore not require any combined view for the data.










          • 17. Re: Design suggestion for a very wide data set (4000 columns)
            Billy~Verreynne

            Okay. Your and your database's funeral.

             

            Here is a method that supports a 4000 column wide view as a single table. It is likely to be horribly slow - but you will have 4000 "columns" in a single table.

             

            Simple method:

            Create an array type in SQL of varchar2(4000) (e.g. create or replace type TStrings as table of varchar2(4000); )

             

            Create a table with a single column of this type. E.g.

            create table foobar(

              id integer primary key,

              col TStrings

            ) ... nested table clause...

             

            The column type is an array and can hold a 1000, 10,000, or more values. Access from SQL is simply via the index/sequential number of that array, for a specific row.

             

            Views can be used to create specific SQL projections of the array cells needed.

             

            Problems. Indexing. I/O to read a row. All data is assumed to be strings and implicit or explicit type conversions will be needed for dates and numbers.

             

            The more advanced method. Create an object type that can intelligently handle a scalar of multiple types (string, number, date, etc). Create a collection/table type of that. Make the table's column type the collection/table type.

             

            Second method can be quite powerful in dealing with heterogeneous data values in a generic and flexible way. However, performance on large datasets are always an issue that need to be addressed somehow - if at all possible. Never mind the additional complexity created with such an approach.

             

            In your shoes - I would stir the pot vigorously, and wave the old lead pipe threateningly, and refuse to accept a 4000 column wide entity via CSV file.

            • 19. Re: Design suggestion for a very wide data set (4000 columns)
              GS613

              Haha, the lead pipe is an option - I like it.

               

              I will push back at the single 4000 column file idea (for various reasons that have come up in this thread, thanks ! ) - but I don't see the issue with it in that it would be imported into smaller tables anyway. If there is something related to the import process of a wide file that is a problem then yes, but what?

              • 20. Re: Design suggestion for a very wide data set (4000 columns)
                John Stegeman

                What are you going to load that 4,000 column file into? You cannot load it into a single table.

                • 21. Re: Design suggestion for a very wide data set (4000 columns)
                  jgarry

                  I think the mention of importing into Oracle has led people astray from the requirement that this is for SAS.  Poking about online, it looks to me like SAS people tend to regard Oracle as it was several versions ago.  I think you should delve into the Oracle OLAP and DW docs.

                  • 22. Re: Design suggestion for a very wide data set (4000 columns)
                    GS613

                    John, the load will be into 20 or so tables with 255 or less columns, with an average row length of around 3000 bytes.

                    I don't see the actual import being an issue - split the file (based on business agreed grouping) and load into the various tables (with composite key).

                    As mentioned, I can get the files split at source into 20 files (based on that same data model) and then load. The effort is on their side or my side to do the split - nothing major.

                     

                    The decisions required are more around the 20 tables. How will the users be using the tables? It seems, based on the number of columns they usually need (very low) they would query from 1 or 2 of the tables at a time. There will inevitably be a lot of unused data but what to load or not to load cannot be determined up front.

                     

                    jgarry, agreed -  storing 4000 columns in an object at any point was never the intention. SAS in this case just use Oracle as a repository / data store. The data is imported into SAS for processing where all the magic and complexity that some clever guys have designed to interpret data to make decisions and predictions.  The interaction with Oracle will most likely be a bunch of full table scans to get the data into SAS, then dump the results back into Oracle.

                    • 23. Re: Design suggestion for a very wide data set (4000 columns)

                      John, the load will be into 20 or so tables with 255 or less columns, with an average row length of around 3000 bytes.

                      I don't see the actual import being an issue - split the file (based on business agreed grouping) and load into the various tables (with composite key).

                       

                      The issue is NOT the actual import.

                       

                      As I, and everyone, am saying, the issue is that you need the proper data model on that target system and the wide file itself doesn't have data model info in it.

                       

                      The source system presumably has that data model info with the data stored properly in relational tables. Then someone aggragates that data into a wide file and HIDES the data model from you.

                       

                      Now you have to figure out, from scratch, what the data model should be.

                       

                      What EVERYONE is saying is just have the data passed to you using the original data model. Then put the data from table1 into your table1, the data from table2 into your table2.

                       

                      If the source system doesn't have an aggragated data model then you will just need to create one from scratch using the method I outlined in an earlier reply.

                       

                      The point is that your WIDE file is basically an opaque file with an unknown data model. It is totally USELESS for any real purpose except displaying it AS IS.

                       

                      Individual files with proper data in them are useful by themselves since they are just simple delimited files. Each file has a SIMPLE data model and can be easily understood, loaded into a simple table, checked for errors, aggregated and manipulated.

                       

                      By producing the WIDE file the source has essentially DESTROYED all of that valuable metadata information about the data.

                      • 24. Re: Design suggestion for a very wide data set (4000 columns)
                        GS613

                        As mentioned, receiving the data in its "unaggregated" form is not feasible at this point. The process of aggregating is what I said is a proprietary service provided by the external company. In the future we will embark on a project to design an in-house aggregation solution - but until then we need to receive the data aggregated.

                         

                        It is to be used for one purpose - to be used in SAS to build models, as and how required, in one or many ways.

                         

                        The wide file can be broken up into multiple narrower files. Each file containing the key that will link the rows across files. However it is still just aggregated data that is split up. The groupings that are decided could in a way define its own data model that has meaning to the business who will decide on which tables to query when loading into SAS.

                        • 25. Re: Design suggestion for a very wide data set (4000 columns)

                          Yes - I understood that the first time.

                           

                          I described an architecture you can use in my first reply.

                           

                          Just use arbitrary, skinny tables to get the raw data loaded.

                           

                          Then design your own un-aggregated data model and move the data to it.

                           

                          But there is little point in worrying about chained rows and rows per block until you demonstrate that you really have an issue.

                           

                          The first step is to design a logical data model. Quit worrying about the physical data model until you have a logical one that meets your requirements.

                           


                          • 26. Re: Design suggestion for a very wide data set (4000 columns)
                            GS613

                            Agreed, I'll do just that. Thanks for your input.

                            • 27. Re: Design suggestion for a very wide data set (4000 columns)

                              Then please mark the thread ANSWERED.


                              1 2 Previous Next