IntroductionData mining (sometimes called knowledge discovery) is the process of analyzing and summarizing data into useful information which can be used to understand common features, the origin of data and to extract hidden predictive information. Data mining is used in science, en gineering, modeling and analysis of financial markets.
This article discusses a free data-analysis framework called jHepWork  which is widely used to facilitate data analysis and data mining (see Figure 1). It was designed for scientists, engineers and students who need numerical and statistical computations, data and function visualization and even symbolic computation.
jHepWork is a 100% Java package, which means it is fully object-oriented and runs on any Java Virtual Machine regardless of computer architecture. Another notable feature - it uses the Python programming language  to call Java classes for numerical and statistical computation as well as for data visualization. To be more exact, jHepWork fully unitizes the power of Jython  which is an implementation of the Python programming language in Java.
Such merge of Java and Jython is not accidental. According to the TIOBE Community Programming Community Index , Java is the world’s most popular programming language. Python is among popular scripting languages widely used in science, engineering and education. It is also the fastest growing programming language of 2010 according to the same TIOBE index. jHepWork uses the Python language due to its short and clear syntax which is handy for calling numerical Java libraries. As the result, data-analysis programs written in such approach are short and clear, while still utilizing the full strength of Java.
This is somewhat different from GUI-only type programs that typically require walking through various menus and sub-menus to perform certain tasks. In the jHepWork approach, one can write short commands using Python to perform computations with arbitrary algorithmic logic that can be changed at runtime. Such approach is also important for repetitive tasks when analysis code, once saved into a file, can be executed multiple number of times depending on inputs (which is a tedious task for GUI-only programs). In some sense, the scripting approach to data mining is similar to the R-programming language , but the difference is that jHepWork is based on Jython, using the full advantage of its object-oriented design, the Python programming language with its high-level standard library, the power of Java API and jHepWork Java libraries for data manipulation and visualization.
Saying all the above, one should also keep in mind that one can always use a pure Java approach to develop data-mining analysis programs using jHepWork since all numerical and graphical libraries of jHepWork are implemented in 100% Java. Or one can use an alternative scripting language, such as BeanShell or the Java scripting API shipped with the
javax.script package. Finally, one can enjoy using the powerful Eclipse or Netbeans IDEs while editing analysis programs.
Short tutorialIn this tutorial we will illustrate the full strength of jHepWork for data mining using the Jython language. We show how to analyze multidimensional data, display data on 2D and 3D canvases, plot a function and how to perform a full-scale linear regression analysis widely in statistical interpretation of data.
Let us assume that we have a matrix of numbers organized as:
# this is a comment 1 2 3 4 5 6 7 8 .......
(the numbers of rows and columns can be arbitrary). The goal of this tutorial is to analyze this data and to extract some useful information. The numbers can be stored in a file which can be located on the Web.
First, make sure that the Java Virtual Machine http://www.java.com/ is installed. Then download the jHepWork package from http://jwork.org/jhepwork/, unzip the package file and run the script “jhepwork.sh” (Linux/Mac) or “jhepwork.bat” (Windows). If you do this for the first time, Jython will start creating a cache directory. This process may take twenty to forty seconds depending on the speed of your system. Jython needs to document all Java classes visible for the Java Virtual Machine since this will simplify programming (no need to specify every Java class in the import statements) and will speed up the code execution.
After the start up, you will see the jHepWork IDE as shown in Figure 1. It is bundled with a powerful code editor and a code assist based on the Java reflection technology. It also has a Jython shell (below the main editor) and the Bean shell. Both help interactive development of a data-mining analysis code and also can be used to call external commands. For this tutorial, we will use the Jython shell (“JythonShell”) since one can see the program response immediately after entering commands line by line. The JythonShell is located below the main editor.
A first step is to read the data into a jHepWork data container designed to perform some handy manipulation. Our preference is to read our data from a prepared file located on the Web. Make the JythonShell window bigger and enter the code shown below line by line and pressing [Enter]:
>>> from jhplot import * >>> pn=PND('data','http://jwork.org/jhepwork/examples/data/pnd.d') >>> print pn.toString()
Here we create a
PND object using the input file “pnd.d” stored on the Web and print the numbers stored in this container for checking. The
PND class is located in the "jhplot" package which is shipped together with jHepWork; this is the main jHepWork package to perform data manipulation and visualization. The input file has exactly the same structure as shown before, i.e. each row is separated by a new line. From now on, we use the Python syntax to print a string returned by the method
toString(). Alternatively, one can use
pn.toTable() method to display all numbers in a sortable and searchable table. You will see the numbers printed out in the JythonShell (which is used for output of the print command).
Want to learn about methods of the “pn” object? Just type “pn.“ (the dot is important!) and press
[Ctrl]-[Space]. You will see a drop-down menu with the methods of this class. Alternatively, one can look at the complete API of the
PND class as
>>> pn.doc() # this brings up a widows with the class API
Let us continue with the analysis of our data. First thing we want to do is to extract the numbers from the second column and display them as a histogram (or a bar-chart density plot) in order to understand the statistical characteristics of the data. Assuming that the “pn” object is created as shown before, we will extract the second column using the index 1 (the first column has the index 0)
>>> p0=pn.getP0D(1) # extract 2nd column and put to a 1D array >>> print p0.getStat() # print a detailed statistical characteristics >>> c1=HPlot('Plot') # create a canvas to display a histogram >>> c1.visible() # pop-up canvas. c1.visible(False) creates the image in background >>> c1.setAutoRange() # set auto-range for the X and Y axis >>> h1=p0.getH1D(10) # convert 1D array into a histogram with 10 bins >>> c1.draw(h1) # draw the histogram
You will see a long list of statistical characteristics of the array of the first column (object p0) and a pop-up window with the histogram from the first array. The code is self-explanatory and contains the necessary comments to explain each step. For example, the method
p0.getH1D() fills a one-dimensional histogram (the Java class H1D) using ten ranges between a minimum and a maximum value of the array “p0” (the Java class P0D). You will be surprised to find how many methods the H1D class contains. According to the Java API, the histogram class H1D has about 100 methods for data manipulation (excluding the methods for graphical representation).
If you want to make a file with a high-quality vector graphics, use the method
c1.export("fig.pdf") (for the PDF format) or
c1.export("fig.ps") (for the PostScript format). jHepWork supports about a dozen formats for image outputs. Figures can be generated in background without bringing up the canvas. In this case, use the method
c1.visible(0). Finally, jHepWork has a powerful input-output mechanism for each data object (histograms, functions, data arrays) which allows storing all objects in files either using the Java serialized mechanism or simple text-based files with compression.
Scatter plot and linear regression
The next step in our analysis is to extract two arbitrary columns and to make an X-Y scatter plot in order find a correlation between the numbers from these columns. In the example below we extract the second and third column, plot them on a X-Y canvas and then perform a least-squared linear regression:
>>> from jhplot.stat import * >>> p1=pn.getP1D(1,2) # extract 2nd and 3rd columns >>> c1=HPlot('X-Y plot') >>> c1.visible(); c1.setAutoRange() # set autorange >>> c1.draw(p1) >>> r=LinReg(p1) >>> print "Intercept=",r.getIntercept(), "+/-",r.getInterceptError() >>> print "Slope=",r.getSlope(),"+/-",r.getSlopeError()
This code should follow after the code which creates the object “pn” as discussed before. The execution of this example creates a X-Y graph with the values of the second and third columns, performs a least-squares regression and prints the values of the intercept and the slope (with their statistical uncertainties) of the linear-regression line. But how to visualize this line? We can create a function using the values of the slope and the intercept using the Python approach:
>>> func='%4.2f*x+%4.2f' % (r.getSlope(),r.getIntercept()) # a string representing a function a*x+b >>> f1=F1D( func, p1.getMin(0), p1.getMax(0)) # a function object in the data range >>> c1.draw(f1)
This part should follow after the code discussed before. Here we build a function
a*x+b using the slope and the intercept values instead of the symbols “a” and “b”. Note that we reduce the precision of these values during the string formatting (which is not too important in this example). Then we build a function object from the string in the X-axis range given by the data (
p1.getMin(0) means the minimum value of our data on the X-axis and
p1.getMax(0) is the maximum value).
Now we can do something more: we will calculate a 95% prediction interval of the regression line . The 95% prediction interval is the area in which 95% of all data points are expected to fall. Do not confuse it with the 95% confidence interval which is the area that has a 95% chance of containing the true regression line. The jHepWork can calculate both, but here we only discuss the 95% prediction interval and will try to plot this interval in a form of band on top of data points.
>>> from java.awt import Color >>> p=r.getPredictionBand(Color.green) # extract 95% prediction band >>> p.setLegend(False) # do not draw the legend for this band >>> p.setErrColor(Color.green) # set green color for error bars >>> c1.draw(p) # show on the canvas
getPredictionBand() returns a
P1D data container with a 95% prediction interval. We show this band using errors colored in green using the “Color” class from the standard Java
Showing data in 3D
Let us continue with this example by displaying the data in three-dimensions (3D) using three arbitrary columns. This time we will display data for 1,2,3 and 1,3,4 columns using two separate interactive plot regions (the so-called “pads”). As before, we assume that this code follows right after the previously discussed lines and the object “pn” has already been created:
>>> c2=HPlot3D('3D plot',600,400,2,1) # create a 600x400 canvas and make 2 drawing pads >>> c2.visible() >>> c2.cd(1,1); c2.setAutoRange() # navigate to first pad and set autorange >>> p2=pn.getP2D(0,1,2) # extract 3 columns with index 1,2,3 >>> c2.draw(p2) >>> c2.cd(2,1); c2.setAutoRange() # navigate to second pad and set autorange >>> p3=pn.getP2D(0,2,3) # extract 3 columns with index 1,3,4 >>> c2.draw(p3)
The execution of the above code makes two interactive 3D pads which can be rotated and zoomed in. Use the methods of the Java class “HPlot3D” to change its style. For example, one can change the color of the drawing box to a gray using the
java.awt.Color class as
c2.setBoxColor(Color(200,210,210)) which can be inserted after the pad navigation method
It should be noted that, instead of using the JythonShell, one can use the jHepWork editor. Create a file called “example.py” and copy and paste the lines above. To run this file using Jython, press [F8] or click on the icon on the tool-bar menu of the jHepWork IDE. There is one essential advantage in using this approach: One can use the built-in code assist which contains detailed description of all methods. For example, assuming that the “pn” object is created as shown before, type a dot after “pn” in the editor and press [F4]:
>>> pn. # + press [F4] to display a list of methods
The execution of this script brings up a table showing all methods of this class. One can get a detailed description of each method and insert a selected method into the code editor. Later one can make necessary modifications of the code and rerun it using [F8] or clicking on the icon .
Putting all together
Now let us run all examples of this tutorial in one go. The above tutorial is given in the file “tutorial.py” which can be found on the jHepWork web page. In the jHepWork IDE, go to the menu [File] and then [Open from URL]. Copy and paste this string to the URL window:
and press the button [Open] (to see the code in the editor) or [Run] (to run the code). You will see images with our tutorial as shown in Figures 2 and 3.
A final word. jHepWork comes with more than 200 example scripts, a detailed on-line tutorial and even a book describing all aspects of the Jython and jHepWork approach to data analysis. To run the examples included in the jHepWork IDE, simply go to the main Menu, select
[Tools] and then
[jHPlot examples]. Then one can open a Jython example code and run it in the jHepWork IDE.
About the license: the core numerical and graphical Java libraries are licensed under the GNU General Public License v3. Documentation, examples, installer, code assist database, language files used by the jHepWork editor are licensed under the Creative Commons Attribution-Share Alike License; either version 3.0 and are free for non-commercial usage (academic research, science and education).
References The jHepWork project, http://jwork.org/jhepwork/
 The Python language http://www.python.org/
 The Jython project http://www.jython.org/
 TIOBE index. http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
 The R statistics http://www.r-project.org/
 Confidence and Prediction band. Wikipedia http://en.wikipedia.org/wiki/Confidence_band
 S.V.Chekanov, Scientific Data analysis using Jython Scripting and Java. Book. 497p. Springer-Verlag, London 2010 ISBN 978-1-84996-286-5