Skip navigation
Bert Scalzo

Managing Mountains of Data

Posted by Bert Scalzo Apr 24, 2018

When is data less valuable? How do you manage data as it becomes less valuable?

In the good old days the answer to these questions was easy; it was simply based on what the company was willing to pay for relatively expensive onsite disk space. So while a company might prefer to keep data around as long as possible, the realty was always more of what is the bare minimum the company could afford to keep and was willing to pay for space to house that data. Back then this was not a major tradeoff because we did not have data scientists or data analytics plus business intelligence was in its relative infancy. Below is a simple diagram of a data life cycle management schema that smart DBAs might have implemented. That was then.


Nowadays the businesses see data as a leading differentiator and worth its weight in gold. Hence they want to keep as much data accessible online as possible, within reasonable cost. With much lower disk costs these days plus tiered cloud storage options, the amount of data practically possible to maintain online is arguably near unlimited. Of course that’s not a very legitimate answer since we also need to maintain certain performance service level agreements (SLAs). But today a DBA might have to maintain four to ten times as much historical data to meet the new data driven business requirements, including regulatory and compliance needs. Undoubtedly the data life cycle management picture above is not the optimal solution.  For example on Amazon Cloud you might implement something along these lines.


For purpose of discussion let's assume that you have Oracle 11g or 12c RAC cluster deployed on premise, and now you've been directed to see about moving it to the cloud. Of course there is always the Oracle cloud, but let's further assume that your company has more than just Oracle databases and has therefore chosen a generic cloud provider such as Amazon AWS, Microsoft Azure or Google Cloud. These cloud providers generally do not offer network multicast, virtual IP nor database specific QOS control. Furthermore they also do not offer the true shared storage required by RAC and historically provided on premise via a SAN or NAS. So what are you to do?


Please note that I am not promoting any one cloud provider over another. I don't want to start any heated debates or discussions. I am just saying what do you do when management says move our Oracle databases which include some RAC clusters and they mandate it must be to one of the other cloud providers where they have already standardized. As I see it you have four choices (with some better than others, and some not really choices at all).


First, you might be able to educate management and thus push back on the decision to move Oracle databases to other cloud providers due to RAC infrastructure requirements. Only you know how much clout or political capital you have to spend on such an endeavor. My experience has been that more and more Oracle DBAs find that the tail cannot wag the dog anymore. As many hesitant DBAs learned in the early days of database virtualization that resistance is futile. Thus this option may not really be feasible.


Second, you might suggest that since the cloud VM size can be created and upgraded as needed that you can consolidate the RAC cluster into a single VM whose size equals the total of all the current nodes. However while that might solve the scalability issue it nonetheless loses the most critical aspect of RAC - High Availability (HA). So while technically this might sound good at first, it sacrifices HA which should not be the tradeoff you should readily make.


Third, you might consider using Oracle Data Guard to provide HA via a physical and/or logical standby database run in maximum protection mode. However there are scenarios where manual tasks must be replicated when using a physical standby that does not exist with RAC. And to quote Ask Tom about logical standby "There are more moving pieces which require more setup, care, and feeding". Also your company may not be licensed for nor have skills with Data Guard. So in effect you'd introduce two variables to this cloud migration: the cloud and Data Guard. That too may not be a viable choice.


Fourth and final you could search for a way to deploy Oracle RAC in non-Oracle cloud. Searching the web you should encounter a company called FlashGrid who offers a interesting and novel solution. You also will find references to FlashGrid on Amazon's and Azure's market places. Of course many skeptical DBAs might ask “Does Oracle support such a solution?” You can easily find a few blog posts against the idea. While they present reasonable technical points for consideration, they also admit to not yet having tried to run Oracle RAC on AWS or Azure. You might remember 10-12 years ago when the same kinds of questions were asked about running Oracle on VMware. At that time too there were blog posts against it and even some question as to the level of Oracle support. However today we all run Oracle on hypervisors without much concern – and Oracle fully supports it.


  So my advice is to treat this like any new car you might buy. Take this or any other promising technology for a test drive to form your own opinion. I have run Oracle RAC on both AWS and Azure - and so far no issues. You can create a 2 or 3 node Oracle RAC cluster via FlashGrid's simple provisioning wizards which takes about 10 minutes to define what you want and then an additional 90-120 minutes to fully deploy a fully functioning environment. So for the mere cost of a few hours cloud time you can see and judge for yourself whether this new technology has merit for your database needs. Take it for a spin!

If you ever need for a simple, light weight and absolutely free data migration tool then you should try Data Puppy Lite. It offers the ability to migrate a schema at a time from a source to target database, and it supports the following databases: Oracle, MySQL, SQL Server and PostgreSQL. Data Puppy Lite automatically handles all the required, complex data type conversions as you got from one database to another. As a Windows client tool it's not well suited for moving very large amounts of data. But when all you really need is something quick and easy it fits the bill.


screen-snapshot-Data Puppy Lite.jpg

Microsoft has been extremely innovative the past few years – especially with regard to their SQL Server relational database both on premise and in the cloud. Although I am a Linux bigot and love the fact that Microsoft now offers SQL Server on Linux, that’s not the most interesting new development. I also like their new pricing vs. feature structure – that all editions now offer almost all features, the only limiting factor being maximum CPU, memory and database size. That makes database upgrades painless with zero application changes required. But arguably the most interesting and disruptive new feature is the “stretch database”, which basically allows your on premise database to store a portion of your data as normal and the rest in the cloud – with total transparency to the application and end users. Some users adopt this feature to slowly migrate to the cloud while others to leverage the cloud’s infinite scalability in order to keep colder data online for longer period of time as required by the business. Here is a basic diagram depicting this concept and feature. Note that SQL Server’s query optimizer handles this division of data totally transparently. Thus a big table logically looks like it’s all local when in fact part of it is on premise and part of it is in the cloud as shown here.




But as interesting as this capability is, it has some key limitations. First is requires SQL Server 2016 or greater and it can only stretch into Microsoft Azure SQL database. So it is limited to both one database and one cloud provider. Furthermore there are several key data type, constraint and DML operations limits which further complicate this solution. Moreover the costs for leveraging this capability are more costly than many find acceptable since one must pay several fees – basic cloud storage, database stretch units (DSUs) which are the computing power to process stretch processing and outbound data transfers.


What many people would prefer is this exact same capability for any on premise database stretched into any cloud provider with the target database being any database (i.e. not require same database on both sides) and without data type or other limits. Such a solution does exist! Cirro’s federated query engine provides the capability to create a “partition view” tables regardless of their database platform. You can stretch for example Oracle on premise to PostgreSQL on Amazon RDS. In fact you can stretch tables for any number of tables either on premise or in the cloud and across any number of database platforms. So the figure shown below highlights only the most basic scenario.




Below is the Cirro partition view definition required to stretch my two largest, on premise Oracle tables to PostgreSQL on Amazon RDS. I keep roughly 10% of the hottest data on premise and 90% of the colder data in the cloud.


create partition view stretch_table_1 as

    select * from ORACLE.ORCL.stretch.table_1a partition by range on c1 (minvalue, 1000)

    union all

    select * from POSTGRESQL.stretch.stretch.table_1b partition by range on c1 (1000,maxvalue);


create partition view stretch_table_2 as

    select * from ORACLE.ORCL.stretch.table_2a partition by range on c1 (minvalue, 1000)

    union all

    select * from POSTGRESQL.stretch.stretch.table_2b partition by range on c1 (1000,maxvalue);


Then from the end users’ perspective they simply query the Cirro view and Cirro automagically handles optimizing the query and returning the results. Note in this example below the user has two tables with some portions on premise and in the cloud, and performs SQL join operations without issue. Now that’s what I call truly stretching databases and tables. And only Cirro currently offers this.


select * from stretch_table_1 t1 join stretch_table_2 t2 on (t1.c1 = t2.c1)

                where t1.c1 between 500 and 2500 and t2.c2 between 4 and 6;