ColdFusion in Context: Rank-Ordered Site Search

Suppose you want your users to be able to use a rank-ordered list of words they supply in order to find pages on your site. That's similar to the way most people use my favorite search engine/portal: Yahoo! (TM) They don't use boolean qualifiers. They simply enter a few words, hopefully the most important first, and links come back with excerpts to help them decide which links will meet their needs. (Many other portals use an implicit "any" instead of scoring pages based on all the terms supplied; they're less useful as a result.)

ColdFusion server comes with a version of Verity, a powerful indexing engine that lets you dump in raw text and then do quick searches against it. The only barrier that keeps some sites from implementing Verity for this purpose is their lack of control over the information that Verity returns.

Verity returns a summary that usually appears to have nothing to do with the search terms but hopefully provides enough context to let a user decide if the link might be useful. This is nice, but it isn't the same feedback that users are used to receiving with other search engines.

This tip lets you use Verity to do the heavy lifting and then use ColdFusion to provide excerpts more closely related to what the user asked for.

Create a Collection

Put this code in setup.cfm. Much of this code is commented out; because, you'll only run it once. Don't delete it, however. Life will be much easier a year from now if you've kept the code to create and delete collections.

Start by setting defaults. You could enter these values in-line, but placing them in variables with names that match their meaning will make this code clearer months from now. MyIndexPath provides the disk location of the directory that will hold the index for the collection. MyWebLocation provides, relative to the Web root, the directory tree containing the files that you want to index. MyDiskLocation contains the absolute disk location of the directory tree - the same tree defined by MyWebLocation but from a different perspective - that you want to index.

<!--- Set defaults --->
<cfset MyIndexPath=
"c:\web\ht\context\search\myindex">
<cfset MyWebLocation=
"/context">
<cfset MyDiskLocation=
"c:\web\ht\context\">

Leave the following code commented out until told to uncomment the snippet that creates the collection. However, do make a directory for Verity to put this index into. The directory should be empty when you start off. Verity will fill it when you eventually execute this code.

<!--- Uncomment to create the collection
<cfcollection action="create" collection="articles"
path="#MyIndexPath#">
--->

This code will be run each time that setup.cfm is run. It indexes or reindexes the collection. It also sets the value of form.Reindex to an empty string so that the page that calls this one doesn't keep calling it over and over. (This will be clearer when you supply code for search.cfm.)

Some explanation of the index code is in order. It's locked for up to 30 seconds to keep anyone else from accessing the index while it's being rebuilt. The name of the collection is given. The url path is relative to the Web root as defined above. The key is the disk location of the directory tree to be indexed. When type="path" is specified, this tells Verity to check each file containing an extension in the list of extensions shown.

<!--- Index/Reindex --->
<cflock name="reindexmyarticles" timeout="30" type="exclusive">
<cfindex action="update" collection="articles"
urlpath="#MyWebLocation#" key="#MyDiskLocation#"
type="path" recurse="yes" extensions=".htm">
</cflock>
<cfset form.Reindex="">

When you're tired of experimenting, or you want to drop this index so you can create another one where you want it, then you can uncomment the following code, run it once, and comment it out again.

<!--- Uncomment to delete the collection
<cfcollection action="delete" collection="articles"
path="#MyindexPath#">
--->

Search

Put this code in search.cfm. It will introduce a search concept, get input, perform the Verity search, then display relevant excerpts based on that search.

Introduce the concept. Provide a default empty value for the search string. Set a reasonable input size and a generous maximum length.

<!--- Introduce concept --->
Enter desired words separated by spaces.
Enter the most important words first.

<!--- Get input --->
<cfparam name="form.Want" default="">
<form name="Seek" action="search.cfm" method="post">
<input type="text" name="Want" size="50"
value=<cfoutput>"#form.Want#"</cfoutput>
maxlength="100">
<input type="submit" name="doit"
value="Search">
</form>

Verity will return a URL. In order for ColdFusion to read the file to provide a user-friendly excerpt, you'll need to convert the url to an absolute disk path. This function does that by replacing forward slashes with backslashes and by appending the result to the absolute parent directory ("DiskRoot").

<!--- Define function to convert
a url to an absolute disk path
and filename --->
<cfscript>
function urlToDir(UrlIn)
{
  var DiskRoot="c:\web\ht";
  var Dir="";
  Dir="#DiskRoot##replace(UrlIn,"/","\","all")#";
  return Dir;
}   
</cfscript>

When input is present, do the search. To do this, convert the list of space-separated words to a list of comma-separated words. Verity performs a rank-ordered search when it encounters a comma-separated list. This is more restrictive than having "or" between each word and less-restrictive than having "and" between each word. Verity will also find the desired words embedded in other words, but it doesn't rank this kind of match as high as when it finds the words by themselves. Tell the user what to expect if Verity finds anything at all.

<!--- When input is present... --->
<cfparam name="Want" default="">
<cfset Want=trim(Want)>
<cfif len(Want)>
  <!--- Get and display query
  for files that match criteria --->
  <cfsearch collection="articles" name="mysearch"
  type="simple" criteria="#replace(Want,' ',',')#">

  <cfif mysearch.recordcount>
    Best matches are presented first.
    Excerpts display the first search
    term in context.
  </cfif>

Now for each link that Verity found, show the query row number, show the URL, and try to find an excerpt containing the first word in the search. If you can't, return Verity's "summary" instead with a caveat that the summary only represents nearby text and does not contain an exact match on the first word. Be sure to wrap the excerpt or summary in the htmleditformat tag to keep its code from executing in the search list instead of being displayed.

To create the excerpt, treat the page as a space-delimited list and try to find an element (a word) contained in the page. If successful and the word is less than 21 words from the beginning of the page, display from the beginning to the word. If successful and the word is not less than 21 words from the beginning, display the previous 20 words. Use similar logic to also display either the next 20 words or to the end of the page, whichever is shorter.

If the first word is not found - it might be embedded or not present at all - display Verity's summary accompanied by its score and a short disclaimer.

  <cfoutput query="mysearch">

  <!--- On a new row, show query row
  number.  Could show score but won't --->
  <p>
  #CurrentRow#.  
  <!--- Score: #score# --->

  <!--- Show URL --->
  <a href="#url#">#url#</a><br>

  <!--- Show first excerpt containing
  the first word of the criteria --->
  <cffile action="read" file="#urlToDir(url)#" variable="Guts">
  <cfset Pivot=listFindNoCase("#Guts#",
  listgetat(Want,1," ")," ")>
  <cfif Pivot lt 21>
    <cfset BeginAt=1>
  <cfelse>
    <cfset BeginAt=Pivot-20>
  </cfif>
  <cfif (listlen(Guts," ")-Pivot) gt 20>
    <cfset EndAt=Pivot+20>
  <cfelse>
    <cfset EndAt=listlen(Guts," ")>
  </cfif>
  <cfif Pivot is not 0>
    ...<cfloop from="#BeginAt#" to="#EndAt#"
    index="WordNr">
    #htmleditformat(listGetAt(Guts,WordNr," "))#
    </cfloop>...
  <cfelse>
    [Variations on your input were found, scoring #score# out of 1.000.
    Here is some nearby text...]<br>
    #htmleditformat(summary)#
  </cfif>
  </cfoutput>
</cfif>

Maintain

Provide a second form at the bottom of the list to demonstrate how an Administrator can reindex the collection when necessary. If the button is pressed, have this page include setup.cfm to run whatever snippets aren't commented out.

<!--- Reindex if asked --->
<cfparam name="form.Reindex" default="">
<cfif len(trim(form.Reindex))>
  <cfinclude template="setup.cfm">
</cfif>

<!--- Provide reindex option --->
<form name="Maintain" action="search.cfm" method="post">
<p>
[ Administrators would also see:
<input type="submit" name="Reindex"
value="Reindex Collection"> ]
</form>

Demonstrate

Once you've gathered some text (or provided a real tree that contains text), here are some logical steps to follow. 1) Create the directory that will hold the index. 2) Uncomment the snippet that creates the collection. 3) Browse setup.cfm and press "Reindex Collection". This will create the collection and index it. 4) Comment out the snippet that created the collection.

Now you're ready to try it out. Browse search.cfm. Try various lists of words against your collection. (Don't press "Reindex Collection" unless your tree full of text changes and you need a new index.) Notice that you could use this code immediately in your Web site just by changing the parameters for the tree to be indexed and following the steps above.

You may find that searches you would expect to bring back pages don't. When you examine the pages manually, you'll find that your pages don't really contain the terms that people would use to find them and that it might be a good investment to modify those pages. Rewrite your descriptions, or experiment with meta tags to get the results you want.

Along the same lines, be sure not to make pages available unless you want users to be able to browse them. Within an application, it's normal to have pages that you don't want browsed directly. This index lets users browse potentially every page that matches the "filter" in the index. This tip indexes files whose names end in .htm. You can readily change the list of file extensions to which you'll provide this index and access. Just be sure to consider the consequences.

Improve

This interface harnesses only a fraction of Verity's power. If you used the exact string entered by the user instead of simplifying the interface as shown here, the user could search within a given tag type in html documents (within, say only title tags or only within h4 tags). The user could exclude words, tell Verity to treat the words being entered as stems so that any form of the word is acceptable, or do soundex searches.

Of course, the user would need some help, perhaps in the form of text fields joined or modified by radio buttons to hide the underlying syntax. Now it's your turn to demonstrate a powerful search interface that requires only a few lines of code. Tag! You're it. =Marty=
[To keep the code displayed in these demonstrations from being executed, I use ampersand combinations instead of less-than and greater-than signs "under the hood". So, when I use this tip, I add the replaceNoCase function to change the ampersand combinations back to less-than and greater-than signs as I return the results of the site search.]