Machine Learning with R in Oracle Database

Version 1

    Machine Learning with R in Oracle Database


    Learn to leverage the power of Oracle Database when performing data analysis with R.


    By Yuli Vasiliev


    In the era of cheap disk storage, high-speed Internet connections, and vast increases in processing power, collecting even large volumes of data has become so incredibly simple. However, simple collecting data does not reveal its business value. To turn into meaningful information, data must be processed and analyzed.

    Known for its ability to efficiently operate on large volumes of data, Oracle Database is ideal for hosting those magic, yet resource-consuming, procedures that can derive meaningful value from raw data, thus implementing the concept of moving data processing closer to data. Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, turns this concept to reality by providing the foundation for integration of R – open source statistical programming language that is best suited for data analyses – with Oracle Database, enabling high-performance in-database executions of R commands and scripts.


    Getting into Oracle R Enterprise requires you to get a grasp on how it works and how you can use it efficiently. A good structure to follow when learning Oracle R Enterprise includes:

    Transparency layer, which enables users to:

    • Leverage proxy objects (ore.frames) - data remains in the database.
    • Take advantage of the overloaded R functions that translate functionality to SQL
    • Use standard R syntax to manipulate database data

    Parallel, distributed algorithms, which enable users to:

    • Improve scalability and performance
    • Use in-database algorithms from ODM
    • Use additional R-based algorithms executing in database server

    Embedded R execution, which enables users to: 

    • Store and invoke R scripts in Oracle Database
    • Perform data-parallel, task-parallel, and non-parallel execution
    • Use open source CRAN packages


    This article does not pretend to cover all these topics in detail, of course. Rather, it explains some basic concepts behind Oracle R Enterprise and provides examples to show how it can be used.

    Oracle R Enterprise Bridges R to Oracle Database

    Being a statistical programming language, R predictably offers a rich set of tools for data analysis. Oracle R Enterprise enhances this functionality by introducing a set of objects and functions to efficiently work with the data stored in Oracle Database.


    Oracle R Enterprise provides the ability to access database tables as R data.frames and to push R data.frames to the database as tables, creating corresponding proxy Oracle R Enterprise objects to be manipulated in R. Using proxy objects for tables enables overcoming the memory limitations of a client R session and to take advantage of the database server processing power when performing data analysis operations. In fact, Oracle R Enterprise has overloaded many standard R functions so that they transparently translate R operations into SQL that runs in the database.

    Users can also pull database objects to local R objects. To perform these push and pull operations, you can use functions ore.push and ore.pull, respectively. The first one creates a temporary database table from a local R object. For example, you can use ore.push to push an R data.frame object to the database as a temporary table, receiving an ore.frame object as an R proxy for that table. Strictly speaking, an ore.frame object maps to a database table or view, representing an Oracle R Enterprise proxy object for that database object – it is a subclass of data.frame. On the other hand, you can use ore.pull to pull a database table’s data into an R’s data.frame object.


    When a client R session disconnects from the database, all the unsaved, temporary tables created for the R objects pushed during the session are automatically removed. In case you want to create proxy ore.frame objects for permanent database tables, you can use the ore.create function. This approach can be very useful for cases where you want to access your data not only with R but also with SQL, and perform in-database computation.

    Diagrammatically, this might look like shown in the figure below:



    Figure 1: Oracle R Enterprise allows pushing R objects to the database as tables, pulling data back into R (if desired), and overloads R functions to implicitly translate R operations into SQL for interacting with those tables.

    An important limitation of pulling data into R is that you can pull a database table or view into an R data frame only if the data can fit into the local R session memory. This can be a problem when it comes to large data sets. Moreover, there is no reason to do pulling if you are planning to feed a data set into one or more ORE functions.

    Several examples of how pushing and pulling work are provided in the Understanding the Transparency Layer section a bit later in this article.


    Setting Up Your Working Environment


    This section provides a quick overview of the Oracle R Enterprise installation. If you have already installed Oracle R Enterprise in your system, you can skip this section.

    Before installing Oracle R Enterprise, ensure that your system environment meets the hardware and software requirements as described in the Oracle R Enterprise System Requirements section of Oracle R Enterprise Installation and Administration Guide.

    It is important to note that Oracle R Enterprise consists of two components: Oracle R Enterprise Client and Oracle R Enterprise Server. The server is installed on the computer running the Oracle Database server (It is important to emphasize that Oracle R Enterprise Server is not a separate server, but installed in Oracle Database). And the client is installed separately on each computer running Oracle Database client software, which can be either Oracle Database Client or Oracle Database Instant Client. If you just want to use both the client and the server of Oracle R Enterprise on the machine running Oracle Database, you do not need to do a separate client installation.


    Oracle R Enterprise Server installation is a multi-step process, during which you need to install the following software components on the machine hosting Oracle Database:


    • R (either open source or Oracle R Distribution)
    • Oracle R Enterprise Server
    • Oracle R Enterprise Supporting Packages


    If you do not have Oracle Database already installed in your system, the simplest way to prepare your working environment is to install Oracle VM VirtualBox and then import into it a pre-built appliance that provides a pre-installed and pre-configured Oracle database. This can be Database App Development VM, for example. With such an appliance, you get not only a ready-for-immediate-use Oracle database, but also the predefined operating system user oracle that is used by default and meets the requirements for the user that can install Oracle R Enterprise Server.

    Once you have made sure your database is up and running and its release satisfies the Oracle R Enterprise Server support matrix, you can move on to installing the components from the above list. So, the next step is to install R for Oracle R Enterprise. Oracle recommends that you install Oracle R Distribution – Oracle's free distribution of R – since it is best suited for Oracle R Enterprise.


    To install Oracle R Distribution on a Linux server, you must log in as root. For Oracle Linux systems, the recommended way to install Oracle R Distribution is from the Oracle public yum server. Alternatively, you can install the Oracle R Distribution RPMs directly. For that, you might visit Oracle Linux 7 (x86_64) Addons or Oracle Linux 6 (x86_64) Addons, depending on your system, and install the necessary RPMs by double-clicking them in the list:



    Figure 2: Installing Oracle R Distribution RPMs directly from the Oracle Linux repositories.


    For the exact list of RPMs to be installed, refer to the Installing R for Oracle R Enterprise chapter of Oracle R Enterprise Installation and Administration Guide. This chapter also describes the Oracle R Distribution installation options for the other supported operating systems.


    After you have installed R in your system, you can install Oracle R Enterprise Server and the supporting packages. Unlike Oracle R Distribution, the installation of Oracle R Enterprise Server does not require root access. The user, however, must be a member of the dba group and must have write access to $ORACLE_HOME/lib. For further details on requirements, refer to the Oracle R Enterprise Server Requirements section in the Oracle R Enterprise Installation and Administration Guide.

    To install Oracle R Enterprise Server, download the Server and Supporting ZIP files from the Oracle R Enterprise Downloads page. Unpack the ZIPs to a local directory on your database server machine and find a script called server. Using this script, you can perform the following operations: install Oracle R Enterprise Server, upgrade or uninstall it, or perform a partial installation.

    To perform a default installation of Oracle R Enterprise Server, you can launch the server script without arguments:



    For Windows:




    During the process, you will be prompted for the information required to set up the underlying database configuration and install the necessary packages. For details, you can refer to the A Default, First-Time Installation of Oracle R Enterprise Server section in the Oracle R Enterprise Installation and Administration Guide.

    The successful completion of the server script completes the installation of Oracle R Enterprise on the server computer. As mentioned, a separate client installation is not needed on the computer hosting Oracle Database. However, if you plan to use a client computer or computers, you will need to perform a separate client installation on each. For details, refer to the Installing Oracle R Enterprise Client chapter of Oracle R Enterprise Installation and Administration Guide.

    Understanding the Transparency Layer

    Now that you have installed Oracle R Enterprise in your system, let’s look at how you can benefit from using it. Before moving on to particular examples, however, it is important to understand how Oracle R Enterprise works. In particular, you need to understand the transparency layer, one of the key components of the Oracle R Enterprise architecture. In fact, understanding the transparency layer is fundamental to understanding the way in which Oracle R Enterprise implements certain features and how it accesses and manipulates data, enabling R users to transparently leverage the power of the Oracle Database server.

    In a nutshell, the Oracle R Enterprise transparency layer enables users to operate on database tables as if they were R objects, implicitly translating R operations into SQL for in-database execution. Physically, the transparency layer functionality is implemented in the OREbase, OREgraphics, and OREstats packages, overloading functions found in the open source R base, graphics, and stats packages, respectively. The overloaded functions in the above ORE packages enable creating R objects mapped to database objects.

    Naturally, when you launch a new ORE session, your first step, after loading the ORE packages, is to connect to the underlying database: 

    So, you start R, using the ORE script:

    $ ORE

    Next, in the R prompt, you load the ORE packages:

      > library(ORE)

    And then connect to an Oracle Database instance:

    > ore.connect(user = "ruser", sid = "orcl", host = "localhost", password = "pswd", port = 1521 )

    After a database connection has been established, you can access the data in the database schema you have connected to. To make all the tables in the schema available in your ORE session as ore.frame objects, you can explicitly invoke:

    > ore.sync()

    > ore.attach()

    Or use ore.connect argument all=TRUE to make Oracle R Enterprise do this for you implicitly. Another example of using the above ORE functions will be provided in the Dealing with Large Data Sets section later in this article.

    If you don’t have your data already stored in the database, you, as an ORE user, still can take advantage of the processing power and memory capacities of the database server, pushing R objects to the database as tables.

    For the purpose of example, let’s create a simple data frame object from scratch:

    > emps <- data.frame(employee=c("Maya Silver","John Polonsky","Tom Falking"),



    Then, you can push the above data frame to the database:

    > emps_of <- ore.push(emps)

    You might want to determine the class of the newly created object:

    > class(emps_of)

    [1] "ore.frame"


    [1] "OREbase"

    Just like with an R data frame, you can look at the data contained within:

    > emps_of

       employee title sex

    1 Maya Silver programmer F

    2 John Polonsky manager M

    3 Tom Falking engineer M

    Unlike a data frame however, an ore.frame object does not contain data in the local R session memory, since it is simply a metadata object that maps to a database table. As mentioned, the ORE transparency layer implicitly translates R into SQL for in-database execution. So, you might want to know what SQL is actually executed when you access data in an ore.frame object. To reveal that, you can look at the value of the dataQry attribute of ore.frame object:

    > attributes(emps_of)$dataQry


    "( select /*+ no_merge(t) */ VAL004 NAME001,VAL005 NAME002, VAL001 ,VAL002 ,VAL003 from \"RUSER\".\"ORE$5_1\" t )"

    As you can see, the data is taken from table ORE$5_1 in the ORE user schema.

    The table name is also available through the sqlTable attribute:

    > attributes(emp_of)$sqlTable

    [1] "\"RUSER\".\"ORE$5_1\""

    If you want to learn more about the table, this can be easily done with an SQL tool such as Oracle SQL Developer. The screenshot below shows the Data tab of the ORE$5_1 table in Oracle SQL Developer:


    Figure 3: Oracle SQL Developer allows users to view database content.

    It is interesting to note that using ore.push is not the only way in which you can create an ore.frame object along with a temporary database table associated with it. Thus, joining data from ore.frame objects with the merge function will also result in transparently creating a temporary database table for the joined data. To see this in action, create a data frame that can be joined with the emps data frame created earlier in this section:

    > bonus <- data.frame(employee=c("Maya Silver","John Polonsky","Tom Falking"),


    Then, push the above data frame to the database:

    > bonus_of <- ore.push(bonus)

    Finally, merge the ore.frame objects:

    > mgr_of <- merge(emp_of, bonus_of, by.x="employee", by.y="employee")

    > mgr_of

       employee title sex bonus

    1|1 Maya Silver programmer F 2500

    2|2 John Polonsky manager M 1000

    3|3 Tom Falking engineer M 1500

    It is important to emphasize that combining ore.frame objects does not cause data moving between the server (database) and the client (R session). Instead, the database generates a new temporary view to reflect the joined data, and then performs a join of the tables associated with the ore.frames being combined – all within the database.


    Dealing with Large Data Sets

    In a typical scenario, you obtain a data set as raw data in CSV or other formats and then load it into R for analysis.

    As an ORE user, you have more options, which can be listed as follows:


    • You load a data set into an R data frame and then push it into a temporary database table, obtaining an ore.frame proxy object for that table, as discussed in the preceding section.
    • You load a data set into an R data frame and then move it into a permanent database table created automatically, obtaining an ore.frame proxy object for that table.. In a next session, you will be able to obtain an ore.frame for this same table, without having first to load the same data set into an R data frame again.
    • You load a data set directly into a database table using SQL and then sync the table proxy metadata with an ore.frame proxy object in an R session.

    Since the first option has already been discussed in the preceding section, let’s proceed to the second one. In the following example, you load a data set into an R data frame from a CSV file. For this example, I used a sample of the Airline On-Time Performance Data data set, containing 450K records.

    > setwd('/home/oracle/Downloads')

    > getwd()

    [1] "/home/oracle/Downloads"

    > ontime_df <- read.csv('ONTIME.csv')

    > class(ontime_df)

    [1] "data.frame"

    Assuming you have already connected to the database with ore.connect as discussed earlier, you can create a persistent database table, deriving the structure and data from the ontime_df data frame created above:

    > ore.create(ontime_df, table="dt_ontime")


    On the R side, this implicitly creates an ore.frame object with the same name as the table.

    Now you can access the dt_ontime object by name:

    > class(dt_ontime)

    Next time you connect an R session to the database, you can restore the dt_ontime object with this single line of code:

    > ore.sync(table = c("dt_ontime"))

    In practice, you may have to deal with data sets that are too large to fit into your client R session memory. If this is the case, the natural choice is to load the data directly in the underlying database. As a variant, you might already have the data for analysis stored in the database. In that case, all you need to do is associate the table or view containing the required data set with an ore.frame proxy object in an R session.

    In either case, to make a data set stored in a database table or view available for analysis in R, you can execute the following code:

    > ore.sync(table = c("dt_ontime"))

    > ore.attach()

    As a result, you obtain a proxy ore.frame object for the specified table (dt_ontime, in this particular example), which you can then pass as a parameter to ORE functions.

    Building Data Models


    Now that you know how to load and access data efficiently in an ORE environment, it’s time to explore ORE’s data modeling capabilities.


    Oracle R Enterprise comes with the OREmodels and OREdm packages that contain functions for building analytical data models on data available through ore.frame objects. In the rest of this section, you’ll look at some examples of using functions from both of the above packages.

    Let’s start with OREmodels functions. And the ore.randomForest function, which is available in the OREmodels package starting with Oracle R Enterprise 1.5, is definitely worthy of consideration. The ore.randomForest function fits a random forest classification model by growing trees in parallel on data in an ore.frame object.

    Let’s put ore.randomForest in action, building a classification model on the dt_ontime data set discussed in the preceding section. If you recall, it contains 450K records and therefore can be considered as a large one.

    Calling ore.randomForest with a minimum number of parameters, you must pass in only a formula that specifies target and predictor variables, and an ore.frame object as input data set. So, you might start with the following invocation:

    > mod <- ore.randomForest(DAY_OF_WEEK~DEP_DELAY+DISTANCE, dt_ontime)

    If you see the following error message as a result, this means the passed data contains 'NA' values.


    Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch, :

      ORA-20000: RQuery error

    Error in : missing values in object

    ORA-06512: at "RQSYS.RQGROUPEVALIMPL", line 121

    ORA-06512: at "RQSYS.RQGROUPEVALIMPL", line 118

    The point is, the na.action parameter of the ore.randomForest function is set by default to na.fails, which instructs the function to fail if the data passed in contains 'NA'. As a solution, you could preprocess the data set, performing filtering that removes NA values. However, the simplest solution is to set na.action to na.omit, thus instructing the function to remove rows with missing values: 

    > mod <- ore.randomForest(DAY_OF_WEEK~DEP_DELAY+DISTANCE, dt_ontime, na.action=na.omit)


    This time the process of model building started successfully on my testing machine, ending up, however, with the following error message, after a while:


    Error in value[[3L]](cond) :

      Model build exceeds the memory limit in bulk process. Increase the value of the argument 'groups' to build model in smaller pieces

    As you can see, the above refers to the groups argument, suggesting you to increase its value. If you look at the Usage section in the online help for the ore.randomForest function:



       ore.randomForest(formula, data, ntree=500, mtry = NULL,

       replace = TRUE, classwt = NULL, cutoff = NULL,

       sampsize = if(replace) nrow(data) else ceiling(0.632*nrow(data)),

       nodesize = 1L, maxnodes = NULL, confusion.matrix = FALSE,

       groups = getOption("ore.parallel", NULL), na.action =, ...)


    you will see that the groups argument defaults to the global option ore.parallel or NULL in case this option is not defined. So, you can easily check the status of ore.parallel in your R environment by issuing:

    > getOption("ore.parallel", NULL)


    Then, you can set or reset it, as follows:

    > options(ore.parallel=8)

    > getOption("ore.parallel", NULL)

    [1] 8

    Alternatively, you could explicitly specify the groups argument when calling ore.randomForest, as in the following example:

    > mod <- ore.randomForest(DAY_OF_WEEK~DEP_DELAY+DISTANCE, dt_ontime, na.action=na.omit, groups = 8)

    The process of building a model may take some time to complete, depending on your machine's performance and the data set size.

    Actually, building some models on a large data set can be a resource-consuming process, during which you may run out of space in the tablespace that contains the ORE user database schema. The simplest way, in which you can monitor how much free space currently remains in the tablespace, is with the following query. 



    If the tablespace size is not large enough to accommodate all the objects being created in the ORE user schema during building a model, then the model build operation will fail. If so, consider increasing the tablespace size, as described in the Altering and Maintaining Tablespaces section of the Oracle Database Administrator’s Guide. The amount of RAM available is also a consideration here. For further details, you can refer to the Managing Memory Limits and Configuring Exadata for Embedded R Execution post.

    Note: The important thing to understand here is that the space consumed during the model build will be released only after disconnecting the ORE session from the database. An ORE session database connection can be closed explicitly with ore.disconnect or implicitly either upon invoking ore.connect to establish a new connection or upon quitting the session. Until any of the above happens, even explicit removing the model object may not free up the consumed space:

    It is interesting to note that the database space, as well as the amount of RAM on the database server machine, consumed by a random forest model building directly depends on the number of trees it grows. Thus, you can reduce the burden on the ORE user schema tablespace and RAM by lowering the value of the ntree argument passed to ore.randomForest. Also, reducing the number of growing trees significantly reduces the time required to build a model. For example, if you set ntree to 100 (which is 5 times less than the default), the requirements for database space, as well as for the time required to complete the operation, are also reduced by up to 5 times.

    After the model has been successfully built, you can use it for making predictions on new data.

    > mod



    ore.randomForest(formula = DAY_OF_WEEK ~ DEP_DELAY + DISTANCE, data = dt_ontime, groups = 8, na.action = na.omit)

       Type of random forest: classification

       Number of trees: 500

       Number of groups: 8

    No. of variables tried at each split: 1

    > pred <- predict(mod,dt_ontime_test,type="all",supplemental.cols="DAY_OF_WEEK", cache.model=FALSE)

    > confusion.matrix <- with(pred, table(DAY_OF_WEEK, prediction))

    As mentioned, ore.randomForest is part of the OREmodels package, which also contains several other functions you can use for building advanced analytical data models.

    Another ORE analytics package OREdm includes functions that expose the Oracle Data Mining algorithms executed in the kernel of the database with an R interface. The functions included in this package can be used for building regression models, classification models, and clustering models. For example, you can use the ore.odmKM function that employs the Oracle Data Mining k-Means (KM) algorithm for building a clustering model. If you need to build classification or regression models, you might take advantage of the ore.odmGLM function that uses the Oracle Data Mining Generalized Linear Models (GLM) algorithm.

    The entire list of the supported data mining models can be found in the Oracle Data Mining Models Supported by Oracle R Enterprise section of the Oracle R Enterprise User's Guide.

    Whatever OREdm function you choose however, they are similar in usage. The following example illustrates using the ore.odmNB function with which you can build a Naive Bayes classification model.


    > nb.pred <- predict(nb.mod,dt_ontime_test,supplemental.cols="DAY_OF_WEEK")

    > with(nb.pred, table(DAY_OF_WEEK, PREDICTION))



    DAY_OF_WEEK 1 2 3 4 5 6 7

       1 1176 0 398 12 462 575 754

       2 1023 647 481 11 389 635 49

       3 368 673 855 15 386 627 32

       4 383 0 838 375 391 667 34

       5 384 0 475 354 787 652 51

       6 365 0 416 11 664 660 58

       7 540 0 451 15 486 670 703


    As you can see, the syntax is similar to the syntax used in the preceding example.

    Embedded R Execution

    The R code examples so far in this article have been run in an R interface. However, the ability to invoke R code through a SQL interface is crucial when it comes to production database applications. Oracle R Enterprise supports both an R interface and a SQL interface for embedded R execution that provides you with the ability to store R scripts in the R script repository in the database and invoke them in R sessions that run on the database server, under the control of Oracle Database. This section provides a simple example of how it can work.

    Suppose you want to perform scoring on new data, using the Naive Bayes model created in the preceding example (it is not uncommon that one might want to perform scoring on new data in a production environment).

    The first step is to store the model in the database. Below you create a datastore in the database and save the nb.mod object in it.

    >, name="dt_store1")

    The further steps are implemented through SQL. However, alternatively, you might do the same in R, creating a script with ore.scriptCreate and then invoking that script with ore.tableApply.


    In the following PL/SQL code, you create script named NBModelOnFlights1 in the database R repository. The function contained in the script loads the nb.mod model object from the dt_store1 datastore and then scores the data passed in, using this model. The function returns the confusion matrix to be in line with the preceding example implemented in R.

    To execute the following code, the RQADMIN role is required.



       'function(dat, ds, obj) {

       ore.load(name=ds, list=obj)

       mod <- get(obj)

       dat <- ore.frame(dat)

       prd <- predict(mod, dat,supplemental.cols="DAY_OF_WEEK")

       rslt <- with(prd, table(DAY_OF_WEEK, PREDICTION))




    After the above code has completed successfully, you can invoke the NBModelOnFlights1 script using the rqTableEval SQL function, as in the following example:

    SELECT * FROM table(rqTableEval(


       cursor(SELECT 'dt_store1' as "ds",

       'nb.mod' as "obj",

       1 as "ore.connect" FROM dual),

       'SELECT 1 "1", 1 "2", 1 "3", 1 "4", 1 "5", 1 "6", 1 "7" FROM dual',


    The results might look like this:

         1 2 3 4 5 6 7

    1208 0 457 20 435 562 731

    1005 655 516 18 331 652 54

    360 698 900 22 353 633 32

    391 0 832 337 374 646 33

    396 0 512 332 788 653 51

    376 0 413 12 624 669 42

    489 0 532 22 415 714 704

    The rqTableEval function used in this example invokes an R script with a full table as input. To take advantage of parallelism, you could use rqRowEval instead. It has similar syntax but enables you to specify the number of rows to pass to the function at a time, assuming multiple invocations of the function in parallel. ore.rowApply is the R equivalent of rqRowEval.


    In this article, you looked at how Oracle R Enterprise bridges R to Oracle Database, providing the ability to move heavy data processing performed by R functions to the database server. You learned that Oracle R Enterprise overloads many R functions so that a model build launched in R can take advantage of the database server processing power and capabilities, such as parallelism. You also looked at how R code can be run through a SQL interface, providing a convenient way to employ R in production database applications.

    Next Steps


    Learn more about R technologies from Oracle.

    Learn more about Oracle R Enterprise

    Read more about Oracle R Enterprise 1.5 Predictive Analytics.

    Yuli Vasiliev is a software developer, freelance author, and consultant currently specializing in open source development, databases, business intelligence (BI), machine learning, and virtualization. He is the author of a series of books on the Oracle technology, including Oracle Business Intelligence: An introduction to Business Analysis and Reporting (Packt Publishing) and PHP Oracle Web Development: Data processing, Security, Caching, XML, Web Services, and Ajax (Packt Publishing).