Building a Better Brain, Part 1: The Protocol Blog

Version 2



    Define the Purpose
    Model the Data Clearly
    Download the Data
    Searching Only Updates
    Updating Entries
    Render the Data
    Documentation for Implementors

    I have a problem. Two problems, actually. Maybe you can help me.

    Problem One: I need to remember lots of small bits of arcane technical knowledge. I keep a text file and wish it was easier to search, copy, store, transport, etc.

    Problem Two: I search the Web a lot and often get the wrong results because what I'm looking for is very specific. Sometimes what I want isn't on the Web. It's in newsgroups, online documentation, forums, or just not on the net anywhere. What I'd really like to do is search through someone else's brain, or at least the part they'd let me access. I'd even pay a small fee to do it.

    Solution: solve both problems at once. Create a remotely searchable and cacheable database. To be nice to use, and widely deployed, it must be efficient and super-easy to implement: the searchable equivalent of RSS.

    This article is the first in a two-part series. We will explore designing a simple but robust web service protocol called BrainFeed, considering alternatives and balancing pros and cons. Throughout the process, we will stay focused on making the protocol simple and reuse existing (preferably open) technologies wherever possible. The second article will build on top of the protocol to create an advanced thick client for searching right from the desktop.

    I've used lots of bad web service protocols. SOAP and even XML-RPC are often overkill. They are complicated to implement and obscure the real problem you are trying to solve. The most successful web service I've seen (apart from web pages themselves) is RSS. Why? Because it's simple. It does only one thing, and it does it well. It was built on top of a stack of other open and widely deployed technologies. For intranets and extranets, the other web service specifications are quite useful, because one end (you) has some sort of a formal relationship with the other end (your customers, partners, or other departments). They can afford the costs of being highly structured. For widely deployed web services that go over the Internet, where you have either an informal or at least low-overhead relationship with the other end, we need something simpler: an RSS-level web service. We will use RSS as the inspiration for BrainFeed.

    Following in RSS' footsteps, our web service will have to do the following:

    • Define its purpose.
    • Model the data clearly.
    • Download the data.
    • Search the data.
    • Update the data.
    • Finally, render the data.

    all while:

    • Reusing existing technology wherever possible.
    • Being adaptable to many platforms and languages.
    • Being simple and adequately documented for others to implement.

    If we had tried this 10 years ago, it probably wouldn't have worked. We would have had to invent too much from scratch. Today, however, we are blessed to have open specs with free implementations all over the place. It doesn't matter what language you program in, you can probably find an XML parser and an HTML renderer of some sort. XHTML is a clean, semantic document language and CSS2 can support almost any style we want. And the best part is someone else has already written almost all of the pieces. In the Java universe, we are especially blessed to have XML parsers and HTTP access built right into the platform. We just have to put it together! Welcome to the Lego school of software design.

    Define the Purpose

    First things first: what does our web service do? A technical definition would be: a network-accessible service for searching and downloading a fairly flat data repository of small documents. So what does that mean? Well, it's on the network, so it's remote. That means we have to deal with network reliability issues, encryption, and authentication. The next part is that we are searching through the database, so we need to specify search semantics, keywords, and a query language. We are downloading the documents, so we need to specify encoding and formatting. Finally, we are rendering the documents to the screen, so we need hints on how they should be presented to the end user.

    Everything we are talking about storing in this database is small snippets of information: the syntax of an SSH command, some sample code to create a window, or Javadocs of the Robot class. We'd also like to update and add to the database, so let's drop that in there, too. So our final definition adds up to this: "A protocol for searching, downloading, and updating small documents from a network-accessible service."

    Model the Data Clearly

    Now that we have our definition of what the service should do, how should it be structured? Since we are talking about small documents, we should probably use the lingua franca of the networked document world: XML. This means we can bring all of our XML expertise and tools to bear on the project. In fact we can structure the entire database as a single logical XML document. Note the word logical here. It doesn't actually have to be stored as an XML document on either end. In fact, for any sufficiently large dataset, it probably won't be. But by specifying that it's logically an XML document we get a whole lot of useful semantics thrown in for free. We can use IDs and know that they will be unique. We get an infinitely nestable structure. We get well-defined white-space handling, structure validation, character encoding, and all of the other goodies that come with XML. All for free! I'm sure glad we live in the 21st century, unlike those encodingless savages back in the late 20th.

    So what does it look like? I'm a visual person, so I like examples. And we have no page limits on the Web, so let's start off with a simple sample.

    <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE brain SYSTEM "../brainfeed.dtd"> <brain searchable="true" language="blah"> <entry id="1"> <content> <p>This is an example</p> <pre><code>This is the code for the example; </code></pre> <img src="images/output.png"> <p>and that&apos;s how you do it!</p> </content> </entry> <entry id="b"> <content> <p>Here&apos;s how you do this:</p> <code>cmdline --args</code> <p>and that&apos;s how you do it!</p> </content> </entry> </brain>

    There's our logical XML with the usual headers and doctype at the top. A single <brain> element encloses multiple entry elements, each of which has an ID to make it unique. The entry has a <content> element, which contains valid XHTML strict markup. XHTML strict is important because it is extremely well defined and is becoming more and more supported by the HTML renderers of the world (IE, Mozilla, Safari, etc.).

    Download the Data

    Using our rule reuse existing technology whenever possible, we will keep things simple with an HTTPGET. Now we can reuse all of the existing HTTP code out there and we can use web servers and CGI scripts for our server, instead of designing custom request code. Java implements HTTP right in the package (new"").openStream()). Using HTTP also means we get to ride on port 80 and go through firewalls.

    Now, why use a GET instead of a POST? Just because it's simpler. A GET can describe everything we are likely to want. Since this is mainly one-way communication (send a few bits for a query and get a lot of bits back) we don't need the complexity of POSTing.GET describes a server, a location on the server, and allows the document on the server to be a literal document on disk instead of a program. This gives us our next rule, make it simple to implement. This also fits the inherent semantics ofGET, which is to get a document. POST andPUT are for editing.


    This could be an actual file on disk served by a stock Apache web server, and it would work just fine.


    Now that we can get a document, how can we search through it? This is more complicated, since our choice of a query language somewhat dictates the structure of our database and how users interact with our service. Timothy Bray wrote an excellent series of articles about searching. He recommends a simpleGET-based API with keywords ANDed together. AND searching is intuitive for users and pretty easy to implement. It also has the side benefit of being very easy to describe in our protocol.

    I've always envisioned an advanced version of this working as a filter on every keystroke, much like the search in iTunes, so a keyword-based API will probably work well. I personally type ANDed keywords into Google to find things, so something simiilar works here.

    We can do full text searching, but the search engine probably won't know that part of the document is more important than other parts. A great way to deal with this is to simply mark which parts are important. How does the search engine know what each document is about? Historically, documents have had titles to specify what they are about, so we'll add something like that here:

    <entry id="d"> <keyword>Java</keyword> <keyword>JNI</keyword> <title>JNI: What is JNI</title> <content> <p>JNI stands for Java Native Interface. It&apos;s a way of......</p> </content> </entry>

    To keep things simple, we will say that the search should be case-insensitive and white space and punctuation should be ignored. Such detailed searching is rarely useful, and usually causes valid results to be skipped. We will also specify searching in order of keywords, titles, and body text, but actual search is left up to the server to implement in whatever way it feels will return the best results. A Java server will probably use Lucene, which has a wide array of algorithmic tweaks available.

    To add searching to the spec, since downloading uses an HTTPGET request, we'll just extend that with some query parameters:

    This will search for entries matching both "java" and "interface." HTTP allows us to specify a parameter name more than once (which is how groups of checkboxes are often submitted), so we take advantage of this by sending multiple queryvalues.

    If the end user program wants to specify just a particular field, then we can specify it explicitly with keyword, title, and content:

    Searching Only Updates

    Now suppose we have a client that would like to cache the dataset and just receive updates when something changes. after all, if this is a source that you use all of the time, you'd like to save the dataset for faster access, plus you don't want to hog bandwidth all of the time.

    First we need to be able to specify what to download. Since a caching reader will know when it last checked for entries, it can just ask for whatever is new since a particular timestamp. We can specify this with a modified parameter,, and then add a modified element to eachentry:

    <entry> <modified>timestamp</modified> </entry>

    Simple enough. This will then return all entries modified with or after the modified timestamp. We just have to specify the timestamp.

    This one is a bit tricky. We could go grab an existing format like RFC 822: "Sat, 07 Sep 2002 00:00:01 GMT," or we could go for a completely numeric Unix timestamp like this: "1027568712." Both of these are bad, though. The first one uses abbreviations for the month and day, which won't internationalize well and can vary within a nation, requiring us to add a language marker just for the date. Plus, the day of week isn't needed just to tell when something was modified. On the other hand, it is human-readable, which is a plus. The second format, a Unix timestamp, is not human-readable at all, and it's Unix-specific. Not all platforms have a way of calculating the time based on milliseconds from a particular date, and the start date varies. Not to mention the fact that there is no timezone marker, so we could be off by as much as 24 hours. To satisfy our needs, we must clearly specify an absolute time in a format that is at least somewhat human-readable, doesn't have language issues, can be parsed fairly easily, and preferably, doesn't need HTTP escaping. I propose the following:

    format: dd/MM/yyyy-hh:mm:ss-z ex: 08/31/2004-14:35:00-GMT

    This encoding uniquely specifies time in a format somewhat familiar to humans without using any language-specific terms or punctuation that would need to be escaped. The example date is my next birthday at 2:34 in the afternoon, GMT. We can put this in a query like so:

    Now we can search for all entries modified after any particular date. For completeness, we will add a modified-before parameter as well, allowing the client to specify any range of dates, with the range being open if only one of them is specified.

    A nice side benefit of searching for updates is that we can cache the data on the client side and do our own searching. in a way that the server may never have thought of, or that would be impossible to do over a network connection, like the aforementioned incremental searching.

    Updating Entries

    Now that we can search for anything we want, let's make our system a little more two-way. How do we update entries? Well, we already have a means of representing a document, so let's just reuse it. Instead of downloading a logical XML document, we will upload it. By doing a POST to the same URL as the one we downloaded, pushing up an XML document as the complete output (rather than as a parameter) we can push our changes up to the server with a minimum of fuss. We also need to specify which entry we would like to update. Fortunately, we have already specified that each entry must have an ID attribute that is global to the entire XML document. So we declare that an existing entry with that ID should be replaced with the new one. This also means we can upload multiple entry updates in the same document. It's true that this means we are uploading the entire entry, even if there was only a spelling change, but in general this will not be a problem, because we are talking about small entries. Only the dataset itself is large. The simplicity of this scheme outweighs the benefits we would get from diffing the individual entries. It also avoids the corruption issues of diff and merge synchronization.

    To add new entries, we upload the new entry without an ID. The server will add it to the XML, assign it a new ID (based on whatever scheme the server deems appropriate), and return the entries in a new document with the IDs added.

    Uploading data will often be a restricted action. Sometimes, even downloading will need security. Some people might be storing sensitive data, such as accessing an internal company-only knowledge base. Once again, we will solve this issue with existing standards. HTTPS using SSL is the standard for the web world, and most XML-savvy platforms support it. J2SE 1.4 now supports it without needing any external libraries.

    For authentication, we will do the same, using HTTP Auth over any proprietary system. This does introduce one complication, though. We are downloading and uploading via the same URL, and HTTP Auth works by protecting a particular URL. In an ideal world, we would tell the server to use HTTP Auth only forPOSTing and not for GETing (or to implement whatever other restrictions we might want). However, most web servers attach the authentication to a particular URL or directory, instead of the pair of a URL and HTTP request type. In the interest of simplicity, we can say that authenticated uploading will go to a different URL than downloading. This makes the implementation simple, but introduces a usability problem. Now we have two URLs to remember, not one. If you've ever tried to explain the difference between IMAP and SMTP servers to your mother, you know what I'm talking about. It would be better if we could tell someone, "The web address to your brainfile is this," and then let the software autodetect how to post. Autodetection might also be useful for other things, too.

    To implement autodetection of configuration we need to add some metadata to the XML file. By adding a meta tag, we can store whatever we need. Clients are told to only use what they understand and ignore the rest. Here I've added some simple bits of metadata along with the URL to use for posting.

    <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE brain SYSTEM "../brainfeed.dtd"> <brain searchable="true" language="blah"> <meta> <uri></uri> <author>Joshua Marinacci</author> <author-email></author-email> <description>This is Joshua&apos;s Brain</description> <post-url> </post-url> </meta>

    As an extra feature to help with autodetection when first adding a Brain Feed to a client, we will say that doing a GETwith the parameter meta=only will return just the metadata. meta=true and meta=false will do a normal request and include or not include the metadata with it. True is the default.

    Render the Data

    To render the data on screen, we need to choose an appropriate font for the character set. This means we need to know the encoding and language of the BrainFeed. All XML files can set the encoding at the top, and most XML parsers will take care of converting the files' encoding into the native platform's preferred encoding. For Java, the parsers are required to take care of this automatically and convert all text into Unicode. We can reasonably expect Win32, Cocoa, GTK, and other platforms to provide similar capabilities.

    For example:

    <?xml version="1.0" encoding="UTF-8"?>

    The language can be specified with a lang attribute on the brain element. We will also add an optional attribute to each feed in case it's a multilingual feed.

    <?xml version="1.0" encoding="UTF-8"?> <brain lang="en-us"> <entry id="1" lang="en-us"> ...

    The actual display of an entry is left up to the client. XHTML specifies everything semantically with no style. To allow the author of the BrainFeed to suggest some style, we can add a CSS link with every entry by including a <style>style attribute and element. If this is a web-based client, the CSS will be passed to the browser. In the case of a thick client, such as a Swing app using the HTMLPane, the CSS would be parsed and pulled into the display.

    <entry id="c" style="mystyle.css"> <style type="text/css"> { color: blue; } </style> <content> <p class="cool">my cool entry</p> </content> </entry>


    Creating a specification that can be implemented on a variety of devices and platforms, each with different needs and resources, can be quite a challenge. Scalability is a hard problem. However, virtually every technology we are using was designed with scalability in mind. Again the beauty of open standards shines through. All we need to do is specify which parts of our spec are optional and how. The underlying tech of HTTP, XHTML, and CSS will take care of most of the rest.

    Riffing off of CSS we will define our specification in different levels, each one building on top of the previous one. We will group the different features by estimated difficulty of implementation and need for the feature.

    • Level 1: Downloading only. no searching, updates, or pulling down updates incrementally. This makes our service not as useful as it might otherwise be, but it means anyone can publish by just dropping a flat file onto their server. No scripts or CGI programs required.

    • Level 2: Searching by the query, keyword, title, body, and timestamp. This doesn't require the infrastructure for authentication and posting, but allows everything else.

    • Level 3: Posting

    • Levels 4 and up: Reserved for future use

    Again, for autodetection, we can add another metatag for the level:

    <brain> <meta> <level>1</level> </meta>

    Documentation for Implementors

    That's it. We have completely described a web service, or at least its protocol. However, if we want our service to be popular, we need to do one last thing. Now that we have our brain file system designed, we have only one thing left to do: documentation. The documentation of an XML web-based service really needs three parts. First, we need a computer-readable spec. XML conveniently provides this in the form of DTDs. Even though DTDs are intended for XML parsers, they are often the documentation of last resort for client implementers, so it helps to format them nicely with good comments. The DTD declares, unambiguously, what goes where. If you document the DTD well, you will have fewer emails from developers screaming, "What the @&#$ does this tag do?!"

    Here is a portion of the BrainFeed DTD:

    <!-- The root level element. There is only one of these. --> <!ELEMENT brain (meta?, entry*)> <!-- ======== META =========== --> <!-- This contains the meta information about the feed. It is optional. If it exists then it should go at the top to make parsing easier. --> <!ELEMENT meta (uri,owner,description)> <!-- These all go inside the meta. They are really just descriptive --> <!ELEMENT uri (#PCDATA)> <!ELEMENT owner (#PCDATA)> <!ELEMENT description (#PCDATA)>

    One tricky thing to note here is that we are using XHTML as part of our definition. Now, it's all fine and well to say that we are using XHTML, but if we expect brain files to properly validate, then we need to have XHTML embedded in our DTD. What we want is a definition for content that looks something like this:

    <!ELEMENT content(#PCDATA, div, p, pre, h1, h2, h3, .....

    If we took this approach, then we would also have to redefinediv, p, and all of the other tags in XTHML. We could also copy and paste the full XTHML DTD into ours. Either approach would be a lot of work and we would require hacking around in someone else's DTD. Fortunately, the designers of XHTML thought of this and designed a modular version. All we have to do is import their DTD and turn off the parts we don't want.

    <!-- ignore the meta and title elements --> <!ENTITY % title.element "IGNORE" > <!ENTITY % xhtml-meta.module "IGNORE" > <!-- XHTML include --> <!ENTITY % xhtml11.mod PUBLIC "-//W3C//DTD XHTML 1.1//EN" "" > %xhtml11.mod; <!-- define content to contain mixed block level content --> <!ELEMENT content (%Block.mix;)*>

    The first two ENTITY lines turn off thetitle element (since we don't want it to clash with our own title element) and the metamodule, which contains the meta tag and all of its attribute definitions. The net four lines import the actual XHTML DTD. First it defines the entity, xhtml11.mod, which points to the external DTD. We give it both the PUBLICURI (the official W3C name for XTHML) and a SYSTEM URL (a location from which to download the DTD). Then it uses the entity on the next line (%xhtml11.mod;) to include all of XHTML's definitions in our DTD. Finally, the last line actually declares the content element as containing%Block.mix;, which the XHTML DTD defines as the contents of a body element. It's shorthand fordiv, p, h1, h2,h3, etc. And with that, we have imported and customized XHTML into our BrainFeed DTD.

    The next step is to create a human-readable document; one that takes the reader through each part of the protocol, explaining its purpose and usage. It should also give a 30,000-foot view of the system to help implementors get the general idea. Remember that more people will be attracted to your project if they get a good, quick description up front. For the BrainFeed project, I have decided to write an article about it for a notable technical web site, but this is not always required. :)

    The last step is to create a sample implementation. It should be as simple as possible. Don't worry about speed or efficiency; just make it clear and comprehensive. If you can release the code as open source, then it's even better, since this will give other developers a base upon which to build their versions.

    For BrainFeed, I have created a simple client and server implementation. The client is a pair of JSPs, one for searching and one for posting. They do simple queries to a BrainFeed URL and return the results. The server is implemented with a simple servlet. The servlet responds to GETs andPOSTs, saving the results into an on-disk XML file. Production-quality implementations would never do this, of course; they would store the entries in a database. But for our sample implementation, this will more than suffice. You can try out the sample client here. This contains a few entries about Java development. Try searching on "Java," "JNI," and "SQL."


    Now that we've created our protocol, and documented it enough for others to build their own versions of it, what can we do with it? The first thing is, of course, a personal database of hand-entered snippets, but as the implementations mature and become popular we can imagine many other uses:

    • Search through the brainfeeds of famous Java developers.
    • Search through all of your old emails.
    • Search through Javadocs and Java forums from within your IDE.
    • Easy searchable FAQs for web sites.
    • Search through weblog archives.
    • An open helpfile format.
    • A searchable collection of cheatsheets and quickrefs.
    • Quick lookup from a dictionary and thesaurus.
    • Search through your book collection from The Gutenberg Project.
    • Search through an intranet knowledgebase.
    • Search through Google and Yahoo from your own program.

    The next logical step is to design a new interface. Instead of a web-based thin client, we could have a thick client that does real-time searching and caches the results. In the second article of this series, we will build a Swing application with local caching, keystroke incremental searching, and an embedded HTML renderer.