Hack 75. Run a Search Engine Inside Your Browser

Use many of the techniques of web search engines inside your browser.

In many cases, adopting Ajax techniques means tying a web application more tightly to the server; however, as this hack shows, that's not always the case. Using data stored in JSON format [Hack #7], it's possible to include all the required data through an ordinary script tag. Once the initial page is loaded, no further network access is required.

In fact, some purists might argue that since no special server interaction takes place, the techniques shown here shouldn't be considered part of Ajax. Nevertheless, a self-contained web application has many potential uses, from CD-ROM documentation to situations where the user might go offline at times.

The application we'll look at here demonstrates web search capabilities, using many of the same techniques employed by major search engines, but on a smaller scale. The material searched is my 2003 O'Reilly book, XForms Essentials, which was ideal because the amount of data needed by the searcher is small enough to easily fit into browser memory, and the text is available under an open content license favorable for the purposes of an online demo.

As with many different kinds of searches, the key is an appropriately constructed index.

Indexing 101

An inverted index, in its most basic form, is a simple data structure that maps terms to specific locations. For example, at the end of this book, you'll find an index containing a list of common terms used in the book, and for each term you'll find a list of page numbers where discussion of that topic occurs. The terms themselves are in alphabetical order, so readers can quickly find the terms they seek and go to the appropriate pages. (If you wanted to find, say, information in this book on the topic of JSON, you could either methodically page through the entire book, or turn to the index for a list of pages where it's discussed.)

In this hack, the index maps from individual words to a list of up to some 200 small documents in which those words occur, based on the version of the text at http://www.xformsinstitute.com/essentials/browse/. It turns out that an ordinary JavaScript object, acting as an associative array, is the perfect data structure for a simple index; it is designed to perform rapid lookups based on a given key.

Prior to a new page appearing in the search results, online search engines devote a significant amount of their resources to retrieving web pages and creating a suitable index. Similarly, this hack performs preprocessing to create the necessary index. For the heavy lifting of generating such an index, I used an excellent resource: David Mertz's public-domain Gnosis Utilities for Python, found at http://www.gnosis.cx/download/ and described in the article at http://www-128.ibm.com/developerworks/library/l-pyind.html. Once the library code has constructed the index, a small fragment of Python writes out the data structure as a pair of object literals, a fragment of which is as follows (line breaks added for readability):

var jswords={'NFORMS':[18,45],'LATEX':[18,7],
'MODIFICATIONS':[10,11,18,6,7], 'EVERYONE':[18,78,5,6],
'OCCURRENCE':[18,26],'SUPPOSED':[18,50,21], LENGTHS':[40,72,18,21],
'APPEARANCE':[69,7,45,49,18,21,23,24,26,60,29,63] 
... 
var jsfiles=['unused','apa.php','apas02.php','apas03.php','apas04.php', 
...

A few things of note in this code:

All the words are normalized to uppercase, indicating a case-insensitive search mode.
Instead of spelling out the full name of each result file over and over, file references are given as an offset into a separate array of filenames.

Further, to keep index size down, common stopwords such as "and," "the," and "a" are omitted from the index.

From this short fragment, it's obvious that "NForms" appears twice (in the 18th and 45th files), "LaTeX" appears twice (in the 18th and 7th files), and so on.

Putting It Together

Given the JavaScript index in a file named xfi.js, a small bit of additional script is needed to implement the query engine. Instead of a submit button, the code sets a timer of 250 milliseconds. When the time expires, it checks whether the value in the query control has changed and, if so, provides an immediate update:

<html>
<head>
    <title>Full-text search of XForms Essentials (beta)</title>

    <script type="text/javascript" src="xfi.js"></script>
    <script type="text/javascript">
        var lastq = "";
        var delay = 250;

        function requery(  ) {
            if (!jswords || !jsfiles) return;  //still loading
            var currentq = document.query.q.value;
            if (currentq == lastq)
                    return;

            var results = localfind(currentq.split(" "))
            lastq = currentq;
            updateResults(results);
        }

        //This function is adapted from David Mertz's public domain 
        //Gnosis Utils for Python with some extra gymnastics since 
        //jsfiles uses the more compact js array instead of object/dicts
        function localfind(wordlist) {
            var entries = {};
            var hits = {}
            for (var idx=0; idx < jsfiles.length; idx++) {
                hits[idx] = jsfiles[idx];    //copy of the fileids index
            }
            for (var idx in wordlist) {
                var word = wordlist[idx]
                word = word.toUpperCase(  )
                if (!jswords[word]) return {}   //nothing for this one word 
                 (fail)
                var entry = {}
                //For each word, get index
                //of matching files
                for (var idx=0; idx < jswords[word].length; idx++) { 
                    entry[jswords[word][idx]] = "hit";
                }

                //eliminate hits for every non-match
                for (var fileid in hits) { 
                    if (!entry[fileid]) {
                        delete hits[fileid];
                    }
                }
            }
            return hits;
        }

        function updateResults(results) {
            var upd_loc = document.getElementById("results");
            var url_base = "http://xformsinstitute.com/essentials/browse/";
            //remove previous results, if any
            while (upd_loc.hasChildNodes(  )) {
                upd_loc.removeChild(upd_loc.childNodes[0]); 
            }
            var newh1 = document.createElement("h1");
            newh1.appendChild(document.createTextNode("results:"));
            upd_loc.appendChild(newh1); 
            for (var fileid in results) { 
                var hit = jsfiles[fileid]; 
                var newp = document.createElement("p"); 
                newp.appendChild( makeHyperlink( url_base + hit, hit )); 
                upd_loc.appendChild(newp); 
            } 
        } 

        function makeHyperlink( url, text, title ) { 
            var aelem = document.createElement("a"); 
            if (title) aelem.setAttribute( "title", title );
            if (url) aelem.setAttribute( "href", url ); 
            aelem.appendChild(document.createTextNode(text)) 
            return aelem; 
        } 

setInterval(requery, delay);</script> 
</head> 
<body>
<p>Just type here, and watch the results magically appear. (JavaScript 
required, but other than initially loading the document, no network access 
is required)</p> <p>Currently, only whole-word matching is implemented. 
Multiple words are ANDed together.</p> 
<form name="query" action="no_submit" method="POST"> <input type="text" 
name="q" autocomplete="off"/> </form> <div id="results"> </div> 
<p><a href="http://dubinko.info/blog/2004/12.html#perm2004-12-
26_localindex">technical details</a></p> </body>
</html>

The function localfind( ) does the actual lookup, taking into account one additional wrinkle: the query might contain more than one keyword. In that case, the list of hits for each keyword needs to be combined. The code treats this as a Boolean AND, so only documents containing all the keywords get returnedthat is, the intersection of the lists gets computed.

The functions updateResults( ) and makeHyperlilnk( ) use standard DOM manipulation to show the results directly, as simple unstyled hyperlinks, sidestepping the round trip normally associated with a search engine request.

To operate the search engine, just open the document, found online at http://www.xformsinstitute.com/essentials/xfi.html, in a browser, and enter some terms in the text control. The results appear immediately in the page, providing hyperlinks to full-text sections of the book. Despite all the work going on behind the scenes, the queries return results nearly instantly.

Hacking the Hack

In modifying the code for your own purposes, you might want to experiment with different values for the timerthe delay variable in the JavaScriptthat periodically checks whether to rerun the query. Also take note of the url_base variable, which sets the common part of the URL for each result.

The search engine could be enhanced in a number of ways. Perhaps the most obvious would be to include the ability to search for exact phrases, typically indicated via quotation marks in web search engine queries. To do so would require a more sophisticated index structure that keeps track of not only what document each word occurs in, but also where in the document it occurs.

Another enhancement would be to add stemming, so, for instance, a search for "appear" would find pages containing "appear," "appears," "appearance," and "appearing." Doing so also would involve using a slightly more sophisticated data structure for the index.

Finally, since not every browser supports JavaScript or has it enabled, this hack could be modified to include a submit button to perform a normal, server-based query. When JavaScript is enabled, the button can be hidden, performing instant lookups as described above.

Resources

You can find more information on inverted indexes at http://en.wikipedia.org/wiki/Inverted_index.

For additional discussion of the Gnosis Utilities from David Mertz's Text Processing in Python (Addison-Wesley), check out http://www.gnosis.cx/TPiP/.

Micah Dubinko