When creating a dataset in Oracle Analytics Server / Oracle Analytics Cloud Data Visualization (DV), data profiling is currently performed automatically. However, users are experiencing significant performance issues when profiling large datasets — profiling may not complete even after more than 10 minutes.
Typical dataset usage patterns (sourced from Vertica) include:
~10 billion records total, ~1,000 records added per day, 93 columns
~50 billion records, 42 columns
~15 billion records, 122 columns
To address performance and usability concerns, we propose adding a configurable Data Profiling mode with the following options:
No Profiling – Skip profiling entirely for faster dataset creation.
Sample-Based Profiling (default) – Profile a representative sample for balanced performance and insight.
Full Dataset Profiling – Profile the entire dataset, with a performance warning for very large data volumes.
Why This Matters
Performance Optimization – Large-scale datasets often contain billions of rows; forcing full profiling significantly delays dataset creation and consumes system resources unnecessarily.
User Control & Flexibility – Different use cases have different needs. Analysts working with exploratory or time-sensitive data may prefer faster creation with sampling or no profiling, while data stewards might require full profiling for validation purposes.
Efficient Resource Utilization – Profiling large datasets puts heavy load on database connections, memory, and compute resources. Allowing selective profiling can reduce impact on both the OAS/OAC server and source systems (e.g., Vertica).
Improved User Experience – Long-running or incomplete profiling operations lead to frustration and workflow interruptions. Configurable profiling modes enable smoother dataset setup and analysis.
Transparency & Control – By presenting users with clear options and warnings, it empowers them to make informed decisions about trade-offs between accuracy and performance.