Instant User Tracking with ClickStream Blog


    Introducing ClickStream

    OpenSymphony'sClickStream is a user tracking component for Java web applications. This means you can take a look and analyze the traffic paths and the sequence of pages that users have generated as they browsed your site. This traffic path is called a clickstream and it is the logical grouping of a HTTP session identifier and the requests associated with it, until the end of this session. The good news is you can easily add this feature to your application by embedding OpenSymphony's ClickStream to take advantage of this site usage information.

    We'll look first at how ClickStream works and what information it collects. Then we'll proceed to configure your application to use ClickStream. Finally we'll log the ClickStream-generated information to a database and exploit it with standard database queries.

    Understanding the ClickStream Lifecycle

    ClickStream starts tracking the user's activity as soon as the web container creates an HTTP session. As you might know, the J2EE specification defines a listener model. Session listeners are one of the Servlet-2.3-specified listeners and are notified each time an HTTP session is created or destroyed. ClickStream's session listener is called ClickstreamListener, and it andClickstreamFilter are the fundamental components of ClickStream.

    ClickStream's main element is a servlet filter calledClickstreamFilter, which intercepts all the requests to a defined web resource (a single page) or resource sets (a set of pages) designated by a page pattern. Both components are configured inside your application's web.xml file, but don't worry about this yet; we'll look into the configuration of this filter and session listener in a future section.

    For now, we'll take a look at the information gathered from each clickstream.

    The Logged ClickStream Data

    ClickStream logs specific information from each request and accumulates it in the corresponding ClickStreamobject. This object is the one sent to theClickStreamLogger for logging when the session ends. ClickStream uses the following information as the ClickStream header data:

    • The remote host making the request, which it gets from request.getRemoteHost(). Remember that if your application runs behind a reverse proxy (a common scenario for firewalled applications), only the proxy's IP address is logged.
    • The stream start time, which is the timestamp when the session listener creates the clickstream, as soon as the application server creates the new session.
    • The last request time, the timestamp of the last request associated to that session.
    • The HTTP referrer header, namely the user's previous page, if available.
    • The bot flag: if the user is really a crawler bot, this flag will turn to true. ClickStream detects more than 250 different kinds of bots.
    • The session ID, for association purposes.
    It then retrieves and stores from each request the following items:
    • The request protocol: HTTP 1.0, HTTP 1.1, etc.
    • The request parameters: the parameters following the ? in the URL and separated by&. It also logs the parameter data submitted via the HTTP POST method as well.
    • The absolute request URI: the JSP or servlet mapping.
    • The session ID for associating with the header.
    • The remote host port used to connect to the server.
    • The request timestamp.
    • The remote user, if available, from the container's request.getRemoteUser() method.
    This information is the what we'll store in the database to exploit and analyze the user's traffic paths. In the next section, we'll see how easy is to embed ClickStream into your application to start capturing the page hits.

    Embedding ClickStream into Your Application

    The first step is obviously to download the ClickStream distribution from the OpenSymphony site. Then, to embed ClickStream into your application, start by addingclickstream.jar and commons-logging.jar (if your project doesn't already use this component) into theWEB-INF/lib directory of your WAR application.

    Then edit the web.xml descriptor from your application with any text editor. You must add ClickStream's session listener and filter. The filter is defined for each resource wildcard you want to track with ClickStream. For example, if you want to track every page hit, you must define the /* wildcard. On the other hand, if you want to record only the hits directed to the/MyServlet path, use a /MyServlet/*wildcard. See the servlet specification for more wildcard examples.

    The next part of the web.xml instructs the filter to record only the hits directed to JSP and HTML pages.

    <filter> <filter-name>clickstream</filter-name> <filter-class>com.opensymphony.clickstream.ClickstreamFilter</filter-class> </filter> <filter-mapping> <filter-name>clickstream</filter-name> <url-pattern>*.jsp</url-pattern> </filter-mapping> <filter-mapping> <filter-name>clickstream</filter-name> <url-pattern>*.html</url-pattern> </filter-mapping> <listener> <listener-class>com.opensymphony.clickstream.ClickstreamListener</listener-class> </listener> 

    The <filter-mapping> element associates the ClickStream filter with both extensions.

    After adding the listener and filter, you can put the included ClickStream JSPs into your web application's root directory. Bothclickstreams.jsp and viewstream.jsp are needed to browse the ClickStream information online. Figure 1 illustratesclickstreams.jsp, which shows all the active clickstreams:

    ClickStream JSP Screenshot
    Figure 1. ClickStream clickstreams.jsp page

    The clickstreams.jsp file lists all the active clickstreams of the application ordered by the remote host IP. When you click one of the host IPs, ClickStream'sviewstream.jsp appears, as shown in Figure 2:

    ClickStream detail Screenshot
    Figure 2. ClickStream viewstream.jsp page

    These two pages allow you to browse the not-yet-stored-in-the-database clickstreams, and are very useful for quick browsing. The next section shows how to set up ClickStream to log the user traffic data into a database for further processing and analysis.

    Logging the User Tracking Information to a Database

    By default, ClickStream uses the Commons Logging component to store the tracking information to the console or to logging files. In this example, we'll use a custom ClickStreamLoggerto save the information to a database. First we'll configure ClickStream to use our logger and then we'll create the corresponding database schema.

    ClickStream offers the ability to change the logging strategy by creating a new logging class, which implements theClickStreamLogger interface, and configuring its use in the clickstream.xml file located in theWEB-INF/classes folder. You can find theDatabaseClickStreamLogger custom database logger and the sample clickstream.xml configuration file in the included source code. Ourclickstream.xml will look like this:

    <clickstream> <logger class=""/> <bot-host name=""/> ... thousands of bots' names skipped for brevity. </clickstream> 

    The configuration of the logger is done through file. This property file is also included in the sample code, and looks like this:

    jdbc.driver.class=org.postgresql.Driver jdbc.url=jdbc:postgresql://localhost/clickstream-db jdbc.user=jdoe jdbc.pass=secret 

    Just replace the URL, JDBC driver class, user, and password with the appropriate values for your database. Our configuration is ready, so let's create the ClickStream's database schema. The database model is made up of only two tables: one with the header clickstream data, and the other with the detailed request information of each clickstream. Figure 3 graphically shows the structure.

    ClickStream DB Schema
    Figure 3. ClickStream DB schema (click for full-size image)

    Execute the included SQL script, clickstream.sql, to create the tables in your favorite database.

    We are all set up; now when your application starts, it'll begin to log the clickstream information to your database. The following section shows how to exploit the tracking information using some very useful metrics.

    Exploiting the User Tracking Information

    The fact that we've stored the user tracking information inside a database server means that we can classify, measure, and manipulate it at will. Some metrics you'll find very useful are:

    • The distinct user count over a period of time
    • The most-accessed pages
    • The length of the average user browsing session, in minutes
    You'll find some of these SQL queries with PostgreSQL syntax in thesample code. Most of the time, you'll want to browse some sampled sessions to see what the user activity looks like; you can achieve this by selecting one session identifier only (select * from clickstream_requests where sessionid = 'nlggs2ccbeb2').

    Under the Hood

    You can visualize the interactions by looking at the sequence diagram in Figure 4, which depicts the complete lifecycle of ClickStream inside your web application.

    ClickStream lifecycle sequence diagram
    Figure 4. ClickStream lifecycle sequence diagram (click for full-size image)

    As you can figure out from the UML sequence diagram, the ClickStream activity starts when an HTTP request arrives. If an HTTP session is not associated in any way to the request, the web container creates one and calls the session listeners; in this caseClickstreamListener is notified.ClickstreamListener generates a newClickStream object to collect the user page track and stores it in the newly created session.

    Then, if the request matches one of the resources defined by theClickstreamFilter wildcard inside the web.xmlfile, the web container calls the ClickstreamFilter. This filter adds the request information to the session'sClickStream object. This cycle continues until the session is explicitly invalidated or the session expires due to user inactivity. Each page or resource the user requests is logged into the same ClickStream object.

    When the ClickStreamListener is notified about the end of a user session, it logs the ClickStream by calling ClickStreamLogger. ClickStream configures this component with a Jakarta Commons Logging Logger by default, but this can be overridden with a customClickStreamLogger, as we saw earlier.

    Of course, you don't need to wait until the session expires to see the clickstream information gathered during the application's uptime. You can browse and list your users' clickstreams and page hits with the provided viewstream.jsp JSP page.

    Where to Go from Here

    In this article we have covered the embedding of ClickStream into your web application, and we've seen how to exploit this information once stored in a database. Be aware that even slight user activity can generate a massive amount of tracking information, so it's highly recommended to do some pruning of this information every two or three days, depending on your users' activity. You'll probably encounter more uses for this information: finding unused pages and bottlenecks, spike predictions, etc.