Oracle Analytics Cloud and Server Idea Lab

Welcome to the Oracle Analytics Community: Please complete your User Profile and upload your Profile Picture

Please Bring Back the Data Set Profile Feature

Delivered
71
Views
6
Comments

Organization Name

Vlamis Software Solutions

Description

The data set profile feature was extremely useful. While I love "Explain", it has a different use case and structure. Exploratory Data Analysis typically starts with an objective overview of the full data set before diving into detailed insights. The original data profile feature offered exactly this. Each column was assigned the most appropriate default visualization (bar, line, etc) allowing you to see the number of factors adn the rough distribution.. You could quickly scan a data set and see the shape of the distribution for a large number of attributes simultaneously and then work through different measures. It was possible to "read" a data set in a manner of minutes and visually identify potentially interesting attributes and facts to further explore. I don't want to lose "Explain", but it never truly replaced the profile feature. The profile feature was one of the best aspects of Big Data Discovery that made it into the new Data Visualization interface. 

Use Case and Business Need

The first thing to do when encountering a new data set is to do a profile and get a sense of it. This is a classic technique and best practice from data science. That is scan the entire data set first in summary graphs very quickly and then when you're reading through it a second or third time, pause to develop insights, relationships, and hypotheses. The later part is what "Explain" is so good at, but straight forward profiling would be better for initial views. We find that many people begin either by diving immediately into a detail view or just throw data visualizations at the wall to see what sticks. Having a methodology and process for exploratory data analysis greatly improves its completeness, objectivity, and effectiveness. It's too easy to miss an important insight if you go too deep too early. Profiling helps avoid this. 

More details

Data Visualization used to have this feature. It was removed, but without explanation for why it was removed. 

Original Idea Number: 3dedb877bd

6
6 votes

Delivered · Last Updated

Comments

  • Philipp Kaufmann-Oracle
    Philipp Kaufmann-Oracle Rank 4 - Community Specialist

    I agree. The data set profile feature would be (and was) a great way to quickly understand a new data set.

  • Michal Zima
    Michal Zima Rank 7 - Analytics Coach

    All, who have been used to this feature, are really missing it - so pls, bring it back.
    Thanks
    Michal

  • No ETA at this point, but the team is working on it. Once designs and specs are ready it will show up on the roadmap timeline. Current code name is Explain for Data Sets, but unclear if it will be the name at delivery 

  • Tim, I wonder if the data quality tiles in 6.0 addressed the basic need that you specify here.

  • TimVlamis
    TimVlamis Rank 5 - Community Champion

    Hi Gabby. First, thank you so much for the data quality tiles. It's a great feature and much appreciated. It doesn't completely meet the needs of data profiling. When we were able to choose a fact for attributes, it enabled us to get a sense of the data set. Right now, we're only getting counts. We can see high-low and the automatic equal width binning gives us a sense of the distribution. It would be nice if the bins could be set for either equal width or ntiles (equal height). Dates are often "unnatural". It can be useful to see what the bin width is, but it would be better if the binning was done by the natural hierarchy of time (day, week, month, year) that would be most appropriate. It would be nice to be able to select a "group by". Some may argue we can do this with "Explain", but in my experience explain almost does too much (weird right?). It's a lot to run and read through for a full data set. For what it's worth, I'm not a huge fan of building automated models and then making predictions based on those models without understanding a bit more how the predictions are made. The "expected" versus unexpected values are easily misinterpreted. Automated clustering works well, but it's really only good for attributes with the "right number" of factors. Lots more we can talk about regarding "Explain", but I suppose it's been around long enough that it would be tough to change much. I'd perhaps look to other examples for what would be needed at a minimum for data profiling. One of the best implementation of profiling from Oracle that I've seen was in the Data Miner interface Explore node. It gave the summary statistics that one would normally get (median, mean, mode. max, min, kurtosis, and skew) and would automatically produce binned graphs. (Many data analysts would also like to see the basic stats from a box plot).You could also select a "group by" column in the Data Miner Explore node. The profiling in Big Data Discovery was also very strong. It chose the natural graph type for columns pretty well (bar, line, map) and was a good data profiling summary tool for reading through a data set and understanding it. Maybe look to the EDA package in R or some of the similar python packages. We don't need all the statistics, but summary stats would be extremely helpful. Thanks again for all the excellent work in OAC 6.0. 

  • @Tim Vlamis the combination of the Data Quality insights in prep and the Auto Insights in workbooks is our replacements to the old insights feature. We will continue to enhance both with more capabilities.