OpenSymphony'sClickStream is a user tracking component for Java web applications. This means you can take a look and analyze the traffic paths and the sequence of pages that users have generated as they browsed your site. This traffic path is called a clickstream and it is the logical grouping of a HTTP session identifier and the requests associated with it, until the end of this session. The good news is you can easily add this feature to your application by embedding OpenSymphony's ClickStream to take advantage of this site usage information.
We'll look first at how ClickStream works and what information it collects. Then we'll proceed to configure your application to use ClickStream. Finally we'll log the ClickStream-generated information to a database and exploit it with standard database queries.
Understanding the ClickStream Lifecycle
ClickStream starts tracking the user's activity as soon as the web container creates an HTTP session. As you might know, the J2EE specification defines a listener model. Session listeners are one of the Servlet-2.3-specified listeners and are notified each time an HTTP session is created or destroyed. ClickStream's session listener is called
ClickstreamListener, and it and
ClickstreamFilter are the fundamental components of ClickStream.
ClickStream's main element is a servlet filter called
ClickstreamFilter, which intercepts all the requests to a defined web resource (a single page) or resource sets (a set of pages) designated by a page pattern. Both components are configured inside your application's web.xml file, but don't worry about this yet; we'll look into the configuration of this filter and session listener in a future section.
For now, we'll take a look at the information gathered from each clickstream.
The Logged ClickStream Data
ClickStream logs specific information from each request and accumulates it in the corresponding
ClickStreamobject. This object is the one sent to the
ClickStreamLogger for logging when the session ends. ClickStream uses the following information as the ClickStream header data:
- The remote host making the request, which it gets from
request.getRemoteHost(). Remember that if your application runs behind a reverse proxy (a common scenario for firewalled applications), only the proxy's IP address is logged.
- The stream start time, which is the timestamp when the session listener creates the clickstream, as soon as the application server creates the new session.
- The last request time, the timestamp of the last request associated to that session.
- The HTTP referrer header, namely the user's previous page, if available.
- The bot flag: if the user is really a crawler bot, this flag will turn to true. ClickStream detects more than 250 different kinds of bots.
- The session ID, for association purposes.
- The request protocol: HTTP 1.0, HTTP 1.1, etc.
- The request parameters: the parameters following the
?in the URL and separated by
&. It also logs the parameter data submitted via the HTTP
POSTmethod as well.
- The absolute request URI: the JSP or servlet mapping.
- The session ID for associating with the header.
- The remote host port used to connect to the server.
- The request timestamp.
- The remote user, if available, from the container's
Embedding ClickStream into Your Application
The first step is obviously to download the ClickStream distribution from the OpenSymphony site. Then, to embed ClickStream into your application, start by addingclickstream.jar and commons-logging.jar (if your project doesn't already use this component) into theWEB-INF/lib directory of your WAR application.
Then edit the web.xml descriptor from your application with any text editor. You must add ClickStream's session listener and filter. The filter is defined for each resource wildcard you want to track with ClickStream. For example, if you want to track every page hit, you must define the
/* wildcard. On the other hand, if you want to record only the hits directed to the
/MyServlet path, use a
/MyServlet/*wildcard. See the servlet specification for more wildcard examples.
The next part of the web.xml instructs the filter to record only the hits directed to JSP and HTML pages.
<filter> <filter-name>clickstream</filter-name> <filter-class>com.opensymphony.clickstream.ClickstreamFilter</filter-class> </filter> <filter-mapping> <filter-name>clickstream</filter-name> <url-pattern>*.jsp</url-pattern> </filter-mapping> <filter-mapping> <filter-name>clickstream</filter-name> <url-pattern>*.html</url-pattern> </filter-mapping> <listener> <listener-class>com.opensymphony.clickstream.ClickstreamListener</listener-class> </listener>
<filter-mapping> element associates the ClickStream filter with both extensions.
After adding the listener and filter, you can put the included ClickStream JSPs into your web application's root directory. Bothclickstreams.jsp and viewstream.jsp are needed to browse the ClickStream information online. Figure 1 illustratesclickstreams.jsp, which shows all the active clickstreams:
The clickstreams.jsp file lists all the active clickstreams of the application ordered by the remote host IP. When you click one of the host IPs, ClickStream'sviewstream.jsp appears, as shown in Figure 2:
These two pages allow you to browse the not-yet-stored-in-the-database clickstreams, and are very useful for quick browsing. The next section shows how to set up ClickStream to log the user traffic data into a database for further processing and analysis.
Logging the User Tracking Information to a Database
By default, ClickStream uses the Commons Logging component to store the tracking information to the console or to logging files. In this example, we'll use a custom
ClickStreamLoggerto save the information to a database. First we'll configure ClickStream to use our logger and then we'll create the corresponding database schema.
ClickStream offers the ability to change the logging strategy by creating a new logging class, which implements the
ClickStreamLogger interface, and configuring its use in the clickstream.xml file located in theWEB-INF/classes folder. You can find the
DatabaseClickStreamLogger custom database logger and the sample clickstream.xml configuration file in the included source code. Ourclickstream.xml will look like this:
<clickstream> <logger class="net.java.cs.DatabaseClickStreamLogger"/> <bot-host name="inktomi.com"/> ... thousands of bots' names skipped for brevity. </clickstream>
The configuration of the logger is done through adatabase.properties file. This property file is also included in the sample code, and looks like this:
jdbc.driver.class=org.postgresql.Driver jdbc.url=jdbc:postgresql://localhost/clickstream-db jdbc.user=jdoe jdbc.pass=secret
Just replace the URL, JDBC driver class, user, and password with the appropriate values for your database. Our configuration is ready, so let's create the ClickStream's database schema. The database model is made up of only two tables: one with the header clickstream data, and the other with the detailed request information of each clickstream. Figure 3 graphically shows the structure.
Figure 3. ClickStream DB schema (click for full-size image)
Execute the included SQL script, clickstream.sql, to create the tables in your favorite database.
We are all set up; now when your application starts, it'll begin to log the clickstream information to your database. The following section shows how to exploit the tracking information using some very useful metrics.
Exploiting the User Tracking Information
The fact that we've stored the user tracking information inside a database server means that we can classify, measure, and manipulate it at will. Some metrics you'll find very useful are:
- The distinct user count over a period of time
- The most-accessed pages
- The length of the average user browsing session, in minutes
select * from clickstream_requests where sessionid = 'nlggs2ccbeb2').
Under the Hood
You can visualize the interactions by looking at the sequence diagram in Figure 4, which depicts the complete lifecycle of ClickStream inside your web application.
Figure 4. ClickStream lifecycle sequence diagram (click for full-size image)
As you can figure out from the UML sequence diagram, the ClickStream activity starts when an HTTP request arrives. If an HTTP session is not associated in any way to the request, the web container creates one and calls the session listeners; in this case
ClickstreamListener is notified.
ClickstreamListener generates a new
ClickStream object to collect the user page track and stores it in the newly created session.
Then, if the request matches one of the resources defined by the
ClickstreamFilter wildcard inside the web.xmlfile, the web container calls the
ClickstreamFilter. This filter adds the request information to the session's
ClickStream object. This cycle continues until the session is explicitly invalidated or the session expires due to user inactivity. Each page or resource the user requests is logged into the same
ClickStreamListener is notified about the end of a user session, it logs the
ClickStream by calling
ClickStreamLogger. ClickStream configures this component with a Jakarta Commons Logging
Logger by default, but this can be overridden with a custom
ClickStreamLogger, as we saw earlier.
Of course, you don't need to wait until the session expires to see the clickstream information gathered during the application's uptime. You can browse and list your users' clickstreams and page hits with the provided viewstream.jsp JSP page.
Where to Go from Here
In this article we have covered the embedding of ClickStream into your web application, and we've seen how to exploit this information once stored in a database. Be aware that even slight user activity can generate a massive amount of tracking information, so it's highly recommended to do some pruning of this information every two or three days, depending on your users' activity. You'll probably encounter more uses for this information: finding unused pages and bottlenecks, spike predictions, etc.