Sentiment Analysis Using Oracle Data Miner
We regularly come across the opinions the people have about products, companies, films, etc. These are generally located on various shopping websites, forums, blogs, etc. Yes the list could go on and on. A number of companies are now looking to see how they can use these various Big Data sources to gain a better insight into what is happening. For example, if a company launches a new product they can monitor in real-time how their customers and the wider market thinks of their product. They can take data feeds from a variety of Big Data sources, integrate them, analyze them and be able to get up to the minute anlysis of what their customers are saying. Depending on the opinions being expressed they can then tweak their marketing campaigns, product offerings, etc.. This topic goes by the name of Sentiment Analysis. Another commonly associated name is Opinion Mining.
In this article I will show you how you can use Oracle Data Miner, which is part of SQL Developer, to build a Sentiment Analysis model of customer product reviews. You should be able to replicate these instructions to work on data for your own company. The article is divided into three main parts. The first part shows you how to load data into your Oracle Schema using SQL Developer. The second part shows you how to build and apply a sentiment analysis model using Oracle Data Miner and the 11.2g Oracle Database. Then Part three shows you how to do the same functionality using the 12c Database and the 12c features in Oracle Data Miner.
Part 1: Our Sample Data Set
The sample scenario we are going to use in our examples is based on a reviews of the Da Vinci Code book. The reviews were taken from the a website that is no longer operational, but the data set is available from the Kaggle website (www.kaggle.com). The data set consists of two attributes. One of the attributes is the review comment left by a reader of the book. It is this comment that we want to be able to analyze to work out the sentiment i.e. is it positive or is it negative. The sample data was given to 300+ people to make human judgements on whether a book review was positive or negative.
We call this a pre-labelled data set. It is this label and the customer review comments that will be used to build our Sentiment Analysis model, using the machine learning algorithms that are build into the database (in-database data mining).
Load the Data into your Schema
The data set comes as a tab separated file. You can use the Data Import tool in SQL Developer to load this data into your schema. To do this right-click on Tables in the object tree for your schema and select Import Data from the drop down menu.
Navigate to the directory where the file is located, select the file and click next. You will now be in the Data Import Wizard. The first step of this wizard involves specifying the characteristics of the file. The following image shows the setting that are required. The only change that you really have to make is to set the file type to TXT. Click the Next button.
In Step 2 of the Wizard, give the Name of the table you want to create. I’ve called my DA_VINCI_BOOK_REVIEWS. Enter the table name and click the Next button. Then click the Next button again to advance to Step 4 of the Wizard. At this step you can rename the columns to OPINION_VALUE and BOOK REVIEW. Click the Next button and then the Finish button to complete the Data Import Wizard. All the data should be loaded into the new table in your schema.
Our data set contains 3995 positive reviews and 3091 negative reviews.
The last data setup step is to add a CASE_ID to the table. Oracle Data Miner likes to have one attribute that gives the case id (or primary key) for each record. We will do this by adding a new attriubte to the table and to assign it the ROWID.
alter table da_vinci_book_reviews
add case_id varchar2(50);
set case_id = rowid;
Alternatively you could create a view to do this and use this view in the following examples instead of the table DA_VINCI_BOOK_REVIEWS.
create or replace view da_vinci_book_reviews_v as select rowid case_id, book_review, opinion_value from da_vinci_book_reviews;
Part 2 – Building a Sentiment Analysis model using Oracle 11.2g Database
In this section of the article I will show how to build a Sentiment Analysis model using Oracle Data Miner and the Oracle 11.2g Database. Oracle Data Miner comes as part of SQL Developer 3 and SQL Developer 4. The example shown in this example will also work in Oracle 12c Database, but there is new functionality for 12c and this will be show in Part 3.
Preparing and Tokenizing the Data
I’m assuming that you already have your database setup with the Oracle Data Miner repository. If you don’t then you will need to do this before you can continue. There are details of how to setup the ODM repository available on my blog and on the Oracle Learning Library on the Oracle website.
The first step with all projects in Oracle Data Miner (part of SQL Developer) is to create a Data Node for the data we want to analyze. To do this we need to select the Data Source node from the Data section of the Components Workflow Edit.
Select the Data Source node from the Data section and move the mouse to your worksheet and click. The Data Source node will be created and the Define Data Source wizard with open. Search the table listing to find your table, then click the Next button and the the Finish button, to complete the setup of the Data Source node.
The Build Text node, located in the Text section of the Component Workflow Editor, is used to break the book review comment into individual words. This is called tokenization. It also removes certain words, called stop words and punctuation from the text. In addition it does a count of the top X words, specified in the Build Text node, and select the frequent occurring words that fit within the top X words.
To create the Build Text node, select it from the Text section of the Component Workflow Editor. Move the mouse to the workflow and click again to create the node. You then need to join the Data Source node with the Build Text node. To do this right click the Data Source node and select Connect from the menu. Then move the mouse to the Build Text node and click again. This will create a connection between the two nodes. This connection will feed the records from the DA_VINCI_BOOK_REVIEWS into the Build Text node.
To setup and configure the Build Text node you will need to double click on the node. The first thing we need to do is to set the Case ID. This is an attribute that is unique for each record. In an earlier section we created a CASE_ID attribute. Select this attribute from the down list for the Case ID. The next step is that we need to specify how we want the Build Text node to handle the BOOK_REVIEW attribute. The text that we want to analyze is in this attribute. Click on the BOOK_REVIEW attribute and then click on the Add Text Transform(s) icon. This is the small green plus symbol.
When then icon is clicked the Add/Edit Text Transform window opens. Here you can edit the type of text transformations we want to use. In our case we want to be able to analyze what words in the book review contribute to a positive review and what words contribute to a negative review. To do this will use the Token transform type. Each word in the review will be tokenized (separated and marked for analysis). The default stop list will be used. This stop list contain many of the common words that do not have meaning for a sentence. Examples of some of these include ‘all’,’and’,’but’,’me’,’he’,’she’, etc. You can accept the rest of the defaults and click the OK button.
You should see some changes to the attribute list in the Edit Build Text node widow. We get a new column added called BOOK_REVIEW_TOKEN. This attribute has a data type of DM_NESTED_NUMERICALS. The DM_NESTED data types are a clever feature in Oracle as it allows us to embed data and values within one attribute, instead of having to create an unknown number of attributes to contain the data. This new attribute will contain the tokenized data for the book review.
Click the OK button to finish the setup of the Text Build node.
The final step in setting up the data is to run the Build Text node. To do this right click on the Build Text node and select run from the menu. When complete we can view the tokens. Double click on the Build Text node. Then click on the BOOK_REVIEW_TOKEN attribute we get a list of the tokens i.e. the words used in the book review. The list of words/tokens are given with the frequency of each, across the data set.
Building the Sentiment Model
Now that we have prepared our data for text mining we can now build a Classification model that will take the tokenized data as the main input and will use this tokenized data to build a Classification model. Oracle Data Mining comes with four in-database classification algorithms. These are Support Vector Machine, Decision Tree, Generalize Linear Model and Naïve Bayes. In Oracle 11.2g Database you cannot use the Decision Tree algorithm for text models. So you will have to remove it from the Class Build node.
To build the Classification Build node into your workflow you will need to select the Classification node from the Models menu of the Component Workflow Editor. Move the mouse to be near the Build Text node and click again to create the node on your workflow. Next we need to join the Build Text node to the Classification node. Then double click on the Classification node to edit the settings. Set the Target to OPINION_VALUE and Case Id to CASE_ID. Click on the Decision Tree in the Model Setting list and then click on the red X mark to remove it. You can leave all the other detail settings.
To build the models close the Edit Classification Build Node window. In you workflow right click on the Classification node and select Run from the drop down menu. After a short period of time you will see a little green tick mark on the Classification node. This means that Oracle has finished building the three Sentiment models based on the Classification algorithms.
We can look at what Oracle Data Miner has produced for each model and we can examine how accurate each model is. To compare the models and their accuracies right click on the Classification node and select Compare Test Results from the menu. We get a number of graphs and statistical analysis of the performance of each model produce by Oracle.
By examining the performance information we can see that the Support Vector Machine seems to give the best or most accurate result. This is the Sentiment model that we can use to automatically determine new book reviews as they are entered.
What is the Sentiment of our new data?
As you gather new comments and reviews about your product, etc. you can automatically find out the sentiment of them using Oracle Data Miner, with having to get a human to sift through many hundreds of thousands of records. Oracle Data Miner can do this quickly for you. All you need to do is to have your new product reviews in a new table and then apply one of the text classification models to it. To show you an example of this, we are going to pretend that the DA_VINCI_BOOK_REVIEWS table contains new reviews.
To apply one of the text models to the “new”data we need to create an Apply node in out workflow. The Apply node is located in the Evaluate and Apply section of the Component Workflow Editor. Create this node on your workflow. Next you need to create two connections into the Apply node. The first is a connection from the “new” data source node to Apply node and the second is from the Classification Build node to the Apply node.
You could run each of the three classification models against the “new” data, but given the data volumes you might be dealing with in the Big Data world this big not be a good idea if you want to get a quick response. What you would typically do is to select one of the classification models. This would be the one that gives you the best results based on your initial training data. Let us take the GLM model to illustrate this.
After you have connected the “new” data source and the Classification Build node, to the Apply node you need to select the classification node you want to use. To do this click on the Classification Build node, go to the Properties tab. Under the Output column select into the other model to deselect them. You should end up with just the GLM model with a green arrow.
You can now run the Apply node to score the sentiment of the new product reviews. These can be persisted in your schema, using a Create Table or View node.
Part 3 – Building a Sentiment Analysis model using Oracle 12c Database
In part 2 of this article I showed how to create a sentiment model using Oracle Data Miner. The approach shown in Part 2 works for those of you who are using the Oracle 11.2g Database. This approach works with the version of Oracle Data Miner that comes with SQL Developer 3 or SQL Developer 4.
If you have the Oracle 12c Database and SQL Developer 4 there is an alternative way of building your sentiment analysis models. In this part of the article I will show you how to do this.
If you are using the Oracle 12c Database and SQL Developer 4, the processing that the Text Build node does, has been mergered into the Classification Build node. Another difference with the Oracle 12c Database is that you can now use Decision Tress for text classification. This part of the article will show you how to do this.
The first step is to create a Data Source node for the DA_VINCI_BOOK_REVIEWS table. Then create the Classification Build node and link the two tables. To edit the Classification node setting double click on the node. When the Classification Build node opens you will notice that there is a new tab in the Edit Classification Build Node window. This new tab is called Text. When you click on this tab you will notice that is most of the same information that the Text Build node had in Part 2. You can set up you Text specific values here instead of creating a separate node in your workflow.
An important setup step is that the Categorical Cutoff value must be less than the maximum length of your text to be analyzed. The Default value is 200 and this is suitable for most text scenarios. If you just analyzing twitter feeds then you will need to set this value less than 140.
You are now ready to run the classification node. When the workflow has finished you can view the results by right clicking the Classification Build node and selecting Compare Test Results from the menu.
We similar results to what we got in Part 2, for the same three algorithms and we get the additional information for the Decision Tree.
The same setup and processing that was shown in Part 2 to Apply a model to new data is applicable here.
In this article I have shown how you can build a Sentiment Analysis model for predicting the opinion/sentiment of a book review. There are two main steps in setting up a similar scenario for your organization. The first is that you need to take some existing text, be that product reviews, forum comments etc., and get they labelled as positive or negative. This is the human side of the project and you may need to get a few employees to do this over a short period of time. The second part involves using the machine learning algorithms and other data preparation features in Oracle to build a Sentiment model based on the data and the work completed by the employees. Once we have build and selected our best performing Sentiment model we can then deploy this in our applications to enhance the customer experience. For example, if a customer writes a positive review of a book then you might present them with certain offers or products that other similar people have expressed and interest in. We could use Association Rules in Oracle to help us determine these offers or products. Similarly if some write a negative review then we can present them with a different set of offer or products.
This is an example of how you can combine many different Oracle data mining features to enhance the customer experience and to easily integrate the data mining models developed in Oracle into your applications. It is just SQL after all.
The examples illustrated in this article shows you how to build a Sentiment Analysis model using Oracle 11.2g and Oracle 12c Databases, with the functionality available in SQL Developer 4.
About the Author
Brendan is an independent consultant and lectures on Data Mining and Advanced Databases in the Dublin Institute of Technology in Ireland. Brendan is an Oracle ACE Director (BI) and has extensive experience working in the areas of Data Mining, Data Warehousing, Data Architecture and Database Design. He has worked on projects in Ireland, UK, Belgium and USA. He has been working with the Oracle database and tools since 1992, starting with Oracle 5, Forms 2.3, ReportWriter 1.1. Brendan is the deputy editor of the UKOUG Oracle Scene magazine and is the deputy chair of the OUG Ireland BI SIG. Brendan has presented at ViraThon, UKOUG Annual Conference, OUG Ireland Conference, OUG Norway and the BIWA Summit. He is also author of the forthcoming book published by Oracle Press on Data Mining in Oracle