Skip navigation

You’ll have to forgive me for spoofing the famous movie line “We don’t need no stinking badges” (Treasure of the Sierra Madre, 1948 and Blazing Saddles, 1974), it just seemed quite apropos.

 

I was discussing data modeling needs recently with a potential customer – and like many shops I visit, they saw no need for data modeling. That their DBA’s, data administrators and application architects knew everything that was home grown – and that the rest was third party applications for which they had no control and thus no need for a data model. That’s a very common belief these days it seems.

 

So my reply was to try to enlighten them about why data modeling makes sense no matter what – and especially for the scenario they described. I thought it might be worthwhile to share those points with the general public in the hopes of convincing a few more people.

 

Five common misbeliefs that data modeling is unneccessary:

 

1. Our DBA’s, data administrators and application architects have our databases under control.

 

That’s good news – you’re doing better than many shops. But what would happen if those guys and gals happen to play the Mega Millions lottery in a pool, and win for $360 million? What’s going to happen when they come in Monday and quit? That’s one heck of a lot of knowledge, wisdom and experience to try to back fill – and quickly. Wouldn’t having a data model make that process a little easier? OK – so the liklihood of that happening is low. What if they all get new jobs or retire around the same time? These things happen – so like the Boy Scouts say “Be Prepared”.

 

2. We only use 3rd party applications like SAP or Peoplesoft, so to us the database is a black box.

 

That’s perfectly true – often you cannot do anything about the design of such products. But knowing the design has several advantages. Ever been on a tech support call and wished you had a little more knowledge to either communicate your issue better or firmly dismiss a poor reason they say is the problem – and you know it is not. How about when you get a new version of the 3rd party application, would not knowing about some of the structural changes be useful in case the upgrade of the data runs into issues? And what if the business makes a requirements change, like having all addresses fully qualified and verified against a validity database? Would not having some idea of all areas impacted be helpful? I may not be able to effect changes to the US highway system, but it sure helps to have a map for long drives outside my comfort zone.

 

3. We only use home grown applications, so we already know everything about all the databases.

 

Actually, this has already been addressed by the prior two points (just exchange home grown for 3rd party application). Because all the same items hold true: you need to protect your company from unforseen knowledge loss, and it often helps to have a “road map” for impact analysis no matter who built the application. In many cases the justification for data models is even higher when the applications are developed in house. Because if you built the highways, you better be able to best navigate them – and a picture almost always helps.

 

4. We contract out application development, so no need to worry about such technical issues.

 

Ok – so when you buy a new home, you’re comfortable with just tell the builder to make something about 2600 square feet? That you don’t need to see a floor plan that meets your requirements, and that the contractors buidling it don’t need any blueprints off which to work. We all know that’s not true – especially because a home is such a large investment of the most valuable kind of money – our own. Well application development is no less specialized, critical, costs lots of precious corporate budget. What’s more, 80% or more of the IT project failures I’ve seen could have been adverted or at least lessened by having a data model up front. I’ve even been an expert witness in court cases on this issue. The data model is nothing more than a collection of the business requirements. If an application fails and there was no such document defining the business requirments – then whose fault is it? The best companies I’ve visited not only do data models for their own projects (regardless of who builds them), they even require that 3rd party vendors provide data models as part of the RFP process.

 

5. Data modeling is quite outdated and has been replaced by newer methodologies such as UML.

 

That could well be true. If your application developers have extensive UML models that fully cover the underlying “persistent” data structures, you may not need to do data models as well. But application developers tend to be programmers, and programmers tend ot think (and thus model) in terms of what the data in motion needs are – which is what their programs embody. So having just UML models may in fact provide you a false sense of security that you’re fully covered. If the UML models do not cover persistent data storage, then you may well need a data model to complement your UML models. It won’t hurt to double check to be safe.

Oracle’s Parallel Query Option (PQO) is a fantastic tool. But much like any good tool, it can very easily be used in the wrong situation or simply abused. In those cases, PQO can actually make database performance much worse. Let’s examine some common misuses or misperceptions regarding optimal and efficient PQO usage.

 

  1. PQO make sense universally on any SMP or multi-processer database server.

 

Right – and I have some wonderful “magic beans” to sell to people who blindly believe that theory. Oracle has to spawn additional shared server processes (or reuse those that are lying around still from recent completed PQO operations). So the cost of spawning additional processes and their footprints must be accounted for in the grand scheme of things. Plus, Oracle will need a “coordinator” process to coalesce all the results – and that too has a small cost. So calculate all this into your assumptions before using PQO. But for those of us less scientific, think of PQO as being like a grocery store checkout line. If you only have a couple items in your grocery cart, will forcibly spreading them across multiple checkout lanes get you out the any door sooner? And even if you do have an overly full cart, will spreading your items across checkout lanes that are already busy be any better? There are obvious times (i.e. rules) that are better than others. We simply need to apply the same clear thinking when using PQO.

 

  1. When I say parallel N, Oracle only exactly and only spawns N server processes.

 

OK – I’m 100% guilty of doing this on occasion. In fact, I might even do this more often than not. But I am cognoscente of the above issues and have planned according – and still have a reasonable PQO scenario where my IO is 100% not my chief bottleneck. Thus in those clear cases I will extend beyond the actual CPU count in my PQO settings. I’ve even gone two to four times higher in the right circumstances. And in those cases it has made a clear and measureable difference. But be careful of when it does not make sense. For an easy test to verify this, simply set PQO to degree four on your dual core desktop or laptop and run several concurrent queries. It will not only run slower, but painfully so.

 

  1. I have a big multi-processer server, so full-table scans are always the way to go.

 

As silly as this may sound at first, I’ve seen this in many data warehousing scenarios. I think it happens because somewhere someone wrote that “parallel full-table scans were efficient”. But that author meant in clearly reasonable scenarios – and not universally. Yet this remains one of the top data warehousing tuning mistakes I see. Think about a DW query – it may scan millions of rows in tables with many billions of rows. Why force a full-table scan when there is at least a ten-to-one, if not a one-hundred-to-one, reduction factor. That’s exactly what indexes are for – and index scans can be done in parallel. I’ve seen improvements as large as 1000% from undoing this simple PQO misuse.

Bert Scalzo

Oracle Dynamic Scripting

Posted by Bert Scalzo Jan 15, 2017

Ever wanted to perform some operation against all the objects in a schema? For example a one-time management task to turn off parallel for all tables and indexes. Another example might be an on-going task to disable and later enable triggers and referential integrity to speed up a recurring data load job. These types of tasks are not uncommon, however many people don’t know how to leverage dynamic SQL scripting to make such jobs trivially easy. So I’ll show dynamic SQL scripts for these two specific examples – which will provide you both a working knowledge and template for your own use.

 

Example #1: Turn off parallel for all tables and indexes for a named schema

 

-- @NOPAR.SQL  schema_name

-- Turns off parallel

-- for all tables and indexes

-- for the named schema

 

set echo off

set verify off

set feedback off

 

set linesize 256

set pagesize 0

 

set term off

spool %TEMP%\nopar.tmp

 

select 'alter table '||owner||'.'||table_name||' noparallel;'

  from all_tables

  where owner = upper('&1')

  order by table_name;

 

select 'alter index '||owner||'.'||index_name||' noparallel;'

  from all_indexes

  where owner = upper('&1')

  order by index_name;

 

spool off

set term on

 

@%TEMP%\nopar.tmp

 

A key aspect to understand is that as this SQL script runs we want all its output to be the new SQL script we’re creating to run. So we don’t want any excessive output such as echoing the commands being run, verification of parameter values (i.e. old vs. new), or feedback about the rows counts processed. Hence the first three SET commands address this. We also don’t know how long an output text line will be, so we need to set the line size accordingly. Finally we don’t want select headers or page breaks, so easiest way to get both is to set page size to zero. So now our generated script output is nothing more than just the SQL commands.

 

Note that I’m spooling the output to my Windows temporary directory as defined by the “%TEMP%” environment variable, plus I don’t want to see the select command output wiz by – set I set terminal output to off. Then all I do are some common select commands against Oracle data dictionary tables such as ALL_TABLES – but notice that I’m actually using text literals and concatenation operations with table columns to construct SQL commands rather than just typical select output. Once such output has been completed I simply set the terminal output back on and execute (i.e. @) the SQL script this script just created. That’s all there is to it.

 

Example #2: Disable and enable triggers and referential integrity (RI) for a named schema

 

-- @SETALL.SQL  schema_name  OFF|ON

-- Turns off triggers and referential integrity (RI)

-- for all tables

-- for the named schema

 

set echo off

set verify off

set feedback off

 

set linesize 256

set pagesize 0

 

set term off

spool %TEMP%\setall.tmp

 

select 'alter table '||owner||'.'||table_name||decode(upper('&2'),'OFF',' disable constraint ',' enable constraint ')||constraint_name||';'

  from all_constraints

  where owner = upper('&1')

  and constraint_type = 'R'

  and status = decode(upper('&2'),'OFF','ENABLED','DISABLED');

 

select 'alter trigger '||owner||'.'||trigger_name||decode(upper('&2'),'OFF',' disable',' enable')||';'

  from all_triggers

  where table_owner = upper('&1')

  and status = decode(upper('&2'),'OFF','ENABLED','DISABLED');

 

spool off

set term on

 

@%TEMP%\setall.tmp

 

This script was just a little more complicated, but essentially uses the exact same technique. In fact all the SET commands are the same. The SPOOL command only differs in the name of the temporary file. Then of course the select commands to build the SQL are different. Note that I added a second input parameter that allows for me to use this one script to both turn on and off the items of interest via the DECODE function. So I only need this one script for both tasks – turning off vs. on.

 

There’s literally no limit to how creative you can be with such dynamic SQL scripts. For example a long time ago before Oracle provided the metadata package to regurgitate the DDL to create an object, I had a complex dynamic SQL script to reverse engineer all of my database objects. The key take away here is this technique should be on any DBA or senior database developers skills list.

It’s not uncommon to discover databases that have simply evolved over time rather than having been built according to some overall design or “master plan”. And there seem to be no shortage of answers as to why this occurs. Sometimes DBA’s are just far too busy with a myriad of other tasks. Sometimes developers are under the gun for deliverables so quickly that there’s just not time. Or my favorite, this project started as just a couple new tables and then ballooned into a monster. Whatever the reason, the situation exists far too often in the real world of information systems.

 

Let’s put this into perspective. Would you purchase a brand new house where the builder simply showed you a high level “floor plan”, and then said “trust us” – we’ll just hammer together the lumber and stuff until we’re done, don’t worry. Not too sure very many of us would so nonchalantly encumber ourselves with a 30 year loan on such vague promises. We automatically expect the home builder to use that rough diagram as the precursory for an architectural blueprint that meets local codes, standards and ordinances. We expect the city zoning commission or code enforcement division to sign off on key items during the construction. We expect to spend the week before taking ownership to inspect the home and to mark things to be fixed before the closing. In short, when it’s our money and/or financial future – we expect things to simply work right.

 

The same is true in many other disciplines that we utilize every day. We get on elevators knowing that they were well designed and recently inspected. We readily fly on airplanes knowing that the FAA and other government agencies are making sure it’s safe and well regulated. We undergo surgery at hospitals knowing that doctors and administrators have jumped through major hoops to guarantee a reasonable level of safety. Yet we still build computerized systems with all the skill of a MASH surgeon – just get it stable enough to pass onto the next guy. No wonder then that both database administrators and application developers are quickly catching up to lawyers and politicians in terms of likeability.

 

So let’s start this week by just looking at the two most critical database success factors: effectiveness (doing the right thing) and efficiency (doing it expediently).

 

How can anything be effective if it’s just cobbled together haphazardly over time? I’m not saying that everyone must do a logical data model and transform that into a physical data model. That may not be most peoples’ “cup of tea”. But slowing down a wee bit to think of the overall physical design should yield a more consistent business solution. That should translate into higher effectiveness.

 

The same can be said for efficiency. So by not simply and seemingly randomly adding tables and columns on an as need basis can easily result in a more fluid design. It’s much easier too to code against such a design, which can also mean lower costs in terms of the development (i.e. less complicated to work with). Furthermore, this generally translates into lower ongoing maintenance costs as well – and often far less unplanned crisis too.

 

Find all this “pie in the sky” thinking to be too esoteric or too academic? Well I’ve been an expert witness in several trials where companies have sued consulting firms that built substandard applications and databases. In every single case, the plaintiff won substantial judgments based solely poor database design. So far my testimony has directly resulted in seven figure financial awards. That’s millions of dollars per case that could easily have been saved by slowing down and doing more planning or design. So as the old song goes “Look before you leap, still waters run deep” …

There are those who think that data modeling has become very passé’ these days. The belief is that because data modeling theory is over thirty years old and that since some data modeling tools have been around 10-20 years that somehow data modeling is now no longer relevant. Nothing could be further from the truth. In fact data modeling may now be more necessary than ever before.

 

While there are other modeling techniques and notations, such as BPM (Business Process Modeling) and UML (Unified Modeling Language), the need for accurately capturing the business data requirements and transforming that into a reliable database structural design is as paramount as ever. The key differentiator is that data modeling is the only technique and notation that focuses on the “data at rest”. All the others tend to focus more on “data in motion”. Another way to say this is that data modeling concentrates on issues that lead to a solid database design, while all the others tend to focus more on issues that will result in better application design or the things useful to programmers – such as data structures, objects, classes, methods, application code generation, etc.

 

I’ve personally been involved in several trials as an expert witness where plaintiffs were suing defendants for serious financial remuneration where custom database applications had performance and/or data accuracy problems. In every single case there was a failure to data model the business requirements. Thus the data effectiveness suffered. Moreover ad-hoc database design or database design using more programmatic oriented techniques and tools often resulted in inefficient database design. And no amount of coding trickery could overcome the resulting bad database design. So in every case, the plaintiff won.

 

The other reason data modeling has seen measurable resurgence is the data warehousing phenomenon. With cheap storage these days, most companies can afford to and benefit from retaining historical aggregate and/or summary data for making significant strategic decisions. With the accumulation of numerous source legacy OLTP (Online Transaction Processing) systems, there are two keys ways to approach populating a data warehouse: directly from source to warehouse (Figure 1) or through an intermediary database often referred to as an ODS or Operational Data Store (Figure 2).

 

x1.png

x2.png

 

There is already sufficient debate as to which approach is superior. I won’t address that here. However regardless of which approach one decides to utilize, the database design (i.e. the data at rest) is paramount. Because in a data warehouse the “data itself” and the business information it contains is the most relevant and valuable asset. Typical data warehouse queries and reports issued via BI (Business Intelligence) tools process that asset to yield key strategic decision making results.

 

The other key area where data modeling often supports the whole data warehousing and business intelligence effort is through the mapping of legacy data fields to their DW and BI counterparts. This meta-data mapping about how front-line business data maps to the data warehouse helps with the design of both the queries and/or reports, as well as with the ETL (Extract Translate and Load) programming efforts. Without such a mapping, as the OLTP legacy systems evolve there would be no automatic tie to the dependent data warehousing information. Hence one would have to almost totally re-engineer rather than simply follow the OLTP source data ramifications and ripples downstream to the DW and BI end points.

 

For those not involved with data warehousing projects, possibly just performing the more traditional OLTP type systems development, data modeling is still no less important. But often people get so caught up in novel paradigms such as “extreme programming”, “agile software development” or “scrum” that they compromise or even possibly entirely skip data modeling. The problem is that these new approaches don’t always spell out exactly how data modeling should be incorporated, thus people often skip doing it. My belief is that no matter what latest and greatest approach one utilizes, data modeling should be integrated into your development process wherever it makes sense.  Figure 3 shows how both conceptual and physical data modeling should fit into an overall database design process - whether it’s for a totally new system, or one that’s simply being updated or even totally re-engineered.

 

x3.png

 

There’s one final reason why data modeling has been getting more attention these days. In many cases organizations are finally requiring data models as a “sign-off” deliverable as part of the development process. I attribute this to organizations attempting to adhere to the Software Engineering Institute’s (SEI’s) Capability Maturity Model (CMM) and Capability Maturity Model Integration (CMMI) concepts. The idea here is quite simple: in order to mature ones’ development process regardless of technique, you need to mature in terms of both the processes and tools used to achieve the desired better end result. For example the proper and effective use of project management tools is often quoted as the single best way to escalate from Level 1 (Chaos) to Level 2 (Repeatable). The central idea that both processes and tools can lead to maturity has helped to reinvigorate many peoples’ interest in data modeling.

 

Now comes the hard part, what data modeling tool should one use? That might seem like a tough or loaded question. There are numerous data modeling tools out there. Plus there are many enterprise modeling suites that contain data modeling capabilities. Rather than advise any particular tool, I’m going to outline some basic guidelines of things to avoid. My belief is that any tool that meets some standard and minimal requirements will enable one to produce effective and efficient data models – and hence the resulting databases.

 

  1. Avoid drawing tools that aspire to be data modeling tools. A good data modeling tool supports defining tons of meta-data that has business relevance. Think of the diagram as just the tip of the iceberg – where 90% of the berg mass what you don’t see as it’s underwater. The same is true for data modeling. If you concentrate on just what the picture is, you’ll probably compromise the effectiveness of the resulting database.
  2. Choose a tool that fits your needs. Often people will purchase a killer modeling tool that offers everything imaginable. But if all you need or will use is the data modeling portion, then why pay for more. The key concern here is that the more any tool does besides data modeling, the more chance that its data modeling capabilities may have been compromised in order to do everything else. Sometime more is not better.
  3. Verify that the data modeling tool you choose generates solid DDL, especially for the target database platform. This is another case where more might not be better. It will be better if your tool supports 100% accurate CREATE or ALTER scripts for a few databases important to you than one that does them all at a lesser level. But be very careful, the DDL generated by many tools, even those that focus on a few databases, can often generate less than optimal DDL. You have to know what to look for – so engage your DBA in making the decision just to be safe.
  4. Verify that your data modeling tool provides robust model consistency and accuracy checking reports and/or utilities. As data models grow (and they will), it can be quite overwhelming to have to manually check everything. And you cannot expect the poor DBA to sanity check the thousands or tens of thousands DDL lines a data modeling tool can quickly generate. Remember, above we talked about both effectiveness and efficiency. Effectiveness is mostly on your shoulders, but efficiency can be aided by good data modeling checking utilities.

 

In closing, data modeling has come a long way since its inception. And even though the heydays of CASE and software engineering have passed with the 90’s – the need for and usefulness of data models has not subsided. And data modeling can assist with any effort regardless of development methodology or paradigm. So don’t pass up on data modeling these days just because it’s a mature technique. You might well be very sorry if you do.

 

Data Warehousing DBA’s usually work with huge tables, often possessing hundreds of millions or even billions of rows. And occasionally, these DBA’s might need to perform maintenance that requires either recreating the entire huge table or some of its partitions. For example, the DBA might need to drop an unnecessary column or simply desire to de-fragment that table or some of its partitions (i.e. reduce the number of extents allocated).

 

In either case, data warehousing tables, and even just their partitions, are often simply too big for unloading and reloading. For example, exporting and then importing a 700 million row table may require 2 to 3 days run-time. Whereas a “CREATE TABLE AS SELECT” (a.k.a. CTAS) would take just a few hours -- assuming that sufficient space is available.

A run-time efficient version of a CTAS command to copy a huge table is as follows:

 

            CREATE TABLE COPY_OF_ORDER

STORAGE (INITIAL 100M NEXT 100M PCTINCREASE 0)

PCTFREE 5 PCTUSED 90

NOLOGGING PARALLEL (DEGREE 16)

AS

SELECT *

FROM ORDER;

 

Thus, CTAS becomes the data warehousing DBA’s greatest alley. It permits the DBA to easily and quickly copy massive amounts of data. However, CTAS can also be a DBA’s worst enemy if they’re not careful. While the CTAS example above is run-time efficient for the copy operation itself, what about its effect on queries against the copied table?

 

Our test environment was a 700 million-row table on a 64-bit, 16-processor UNIX box with an EMC RAID-1 disk array. The CTAS command copied the table in less than one hour. However, queries against the copied table ran nearly four times as long as those same queries issued against the original table. What happened?

 

The problem is that the CTAS has negated the original table’s proximity of data. More simply stated, the CTAS has more uniformly spread related rows of data across many more blocks and files. This can be a very elusive problem to find. Look again at the CTAS example. Unfortunately, it looks harmless enough. It’s tough to see that it’s the source of the problem.

 

Most data-warehousing tables have their data loaded at regular intervals (e.g. nightly). As a result, related rows of data become naturally located in close proximity, often within the same or closely related data blocks inside a relatively few data files. So queries retrieving those densely packed rows perform a relatively minimum amount of physical IO.

 

The query to test a table’s data proximity is as follows:

 

SELECT

    COUNT(

        DISTINCT DBMS_ROWID.ROWID_RELATIVE_FNO(rowid)||'-'||

DBMS_ROWID.ROWID_BLOCK_NUMBER(rowed

    ) "FILES-BLOCKS",

                    AVG(

COUNT(DISTINCT DBMS_ROWID.ROWID_ROW_NUMBER(rowid))

    ) "AVERAGE ROW DENSITY"

FROM &table_name

WHERE &where_condition_for_returning_related_rows

GROUP BY

DBMS_ROWID.ROWID_RELATIVE_FNO(rowid)||'-‘||

DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid);

 

The average row density for our original table was 90. In other words, each data block held approximately 90 related rows. However the density for our new table was only 24, meaning that each block held only 24 related rows. Hence, all the related data rows are spread across nearly four times as many data blocks. Since Oracle does its physical IO at the block level, we have nearly four times as much physical IO occurring. That’s why our queries were running four times as long.

 

The corrected CTAS locates the related rows together and is as follows:

 

            CREATE TABLE ORDERED_COPY_OF_ORDER

STORAGE (INITIAL 100M NEXT 100M PCTINCREASE 0)

PCTFREE 5 PCTUSED 90

NOLOGGING PARALLEL (DEGREE 16)

AS

SELECT *

FROM ORDER

GROUP BY &all_the_column_names_in_the_table;

 

The average row density for this newest table was 104 – which was even better than the original table’s density of 90. Note that the CTAS uses the GROUP BY clause, because the ORDER BY clause is illegal for a CTAS. Although the table copy run-time increased to nearly four hours (which was acceptable as a one time cost to copy the data versus 2-3 days for doing export/import), queries against this newest table ran within 5-10 minutes.

 

As you can see, data warehousing DBA’s have some very challenging issues to deal with. And often the problems and their solutions are not readily apparent. But that’s the real fun in working with terabytes of data …