ColdFusion in Context: Spellcheck with A Web Site Indexer

Suppose your users want to check the spelling of their inputs; suppose you want to learn about Verity, a tool bundled with ColdFusion to index documents. You can do both with this tip, and along the way, you'll do a little filtering, looping, and string display.

Make "Dictionary" and "Index" Directories

First, a bit of perspective. One of the advantages of using Verity as your spellcheck engine is that you can toss some of your documents in a directory and Verity can include all of the words you've used in every document it finds there in its "dictionary" (when you reindex). A disadvantage of using Verity is that you'll have to find a suitable dictionary to add your favorite touches to. Look around on the Web. Check your local UNIX box. Any straight text document works well. Verity will interpret and index many types of documents. However, the file extension is significant; see your favorite ColdFusion reference for details. For this demonstration, just make a directory - call it "mydocs" - and toss a few of your text documents (having .txt extensions) into it. You'll find that Verity can use a dictionary of perhaps a half a megabyte to check the text area of a form in a few seconds. This approach isn't something you'd use on a term paper - the words it didn't like will be shown out of context - but it gives reasonable results if you put enough of your typical work into the dictionary directory ("mydocs" in this example).

Now you'll need an index directory. Just make the directory; call it "myindex". Verity will fill it when the time comes.

Make an Input Form

You'll need an input form. You'll need a place to enter text. Because the textarea tag is appropriate for this, you'll want to build this form using the HTML form tag; because, the cfform tag doesn't understand the textarea tag. Further, to be sure that the field's contents wrap after refresh in major browsers, specify wrap="physical".

The first time through, the form variables don't exist; so, give them values with the cfparam tag. Notice that the submit buttons provide useful values but the cfparam tag defines the variables for these submit buttons as empty strings. That's so you can tell which button was pressed. The submit button that was pressed will have a value whose length is greater than zero. The other button will have a value whose length is zero.

As mentioned above, this form has two submit buttons. If you've added documents to the "dictionary", you'll want a way to reindex. If you just want to spellcheck using the current index, you'll want a button for that also. Put all of this tip's code into spell.cfm.

<cfparam name="form.myinput" default="enter text here">
<cfparam name="form.doit" default="">
<cfparam name="form.redo" default="">
<form action="spell.cfm" name="myspeller" method="post">
<textarea name="myinput" wrap="physical" rows="4" cols="30" value="">
<cfoutput>#form.myinput#</cfoutput></textarea>
<input name="doit" type="submit" value="Spellcheck">
<input name="redo" type="submit" value="Reindex">
</form>

Define the Collection

First, define the physical path for your index and your collection. Doing it in the manner shown simplifies subsequent code for this example, but you can simply provide the entire path where appropriate if you wish. The collection and index don't really even have to be on the same server, let alone in the same parent directory.

<cfset mypath="c:\PathOnYourBox\spell\">
<cfset myindex=#mypath#&"myindex">
<cfset mydocs=#mypath#&"mydocs">

You'll want to tell Verity where to put the initial index. If you ever move or get rid of it, you'll want to tell Verity to forget where it put the index. These fragments of code won't hurt the original pages; they only affect the index directory.

Leave the comments around the fragment to delete a collection. You won't need to run this code until you want to get rid of the collection someday.

<!--- Delete a Collection
<cfcollection action="delete" collection="stuff" path="#myindex#">
--->

You will run the following code once to create the collection. It's shown surrounded by comments. REMOVE THE COMMENTS. Run the code once. THEN REPLACE THE COMMENTS.

<!--- Make a Collection
<cfcollection action="create" collection="stuff" path="#myindex#">
--->

Reindex When Asked

If the submit button whose name is "redo" has a value whose length is greater than zero (because you pressed the "Reindex" button, you want to build a new index. Not knowing what version of ColdFusion you're running, this tip explicitly purges the old index and then builds a new one. While you're changing the index, you should lock it; this example does that. It assumes that all your documents have .txt extensions. If they have other extensions, include them in the "extensions" string also.

<cfif len(form.redo)>
  <!--- Purge a Collection's Index --->
  <cflock name="purgemystuff" timeout="20">
  <cfindex action="purge" collection="stuff"
  key="#mydocs#" type="path" extensions=".txt">
  </cflock>
  <!--- Index a Collection; extension is necessary --->
  <cflock name="indexmystuff" timeout="20">
  <cfindex action="update" collection="stuff"
  key="#mydocs#" type="path" extensions=".txt">
  </cflock>
</cfif>

Prepare to Spellcheck

You'll need to clean up the input and prepare to display an output. To know if you need to display a message saying what to do with a list of misspelled words, you need to count them. You'll need a string to put the bad words into. You'll want to use a separator to precede each bad word except the first. (That's why "outsep" is initially set to an empty string.) You'll want to remove double quotes. (Yes, chr(42) is a double quote.) And, you'll want to ignore spaces, digits, and special characters. (This doesn't ignore a colon; you may want to add stuff like "subj:" as words, and you'll need the colon to know when the abbreviation makes sense.)

<cfset badcount=0>
<cfset outstring="">
<cfset outsep="">
<cfset form.myinput=replace(form.myinput,chr(42),"")>
<cfset mynulls=" 0123456789.;!=+-,$%^&()*?">

Spellcheck

You're actually going to do a Verity search for each word one at a time. Verity is so fast that you can get away with that. The loop will use your list of characters to ignore as the delimiter between words. Just before it looks in the collection, it makes sure the resulting "word" doesn't have a special meaning to Verity: you don't want to use "and", "or", or "not" for this kind of search.

When you use Verity for other tasks, you want to bring back a summary of what was found. However, in this context, you just want to know if you got at least one hit: recordcount is greater than zero. If the word was NOT found, add the output separator - it starts off empty - in front of the bad word. Then, make the output separator into a comma followed by a space so that the new combination will precede each bad word from now on. Increment the badcount so you'll know at least one word was NOT found. Then, loop to do more searches until spellchecking is complete.

<cfloop index="word" list="#form.myinput#" delimiters="#mynulls#">
<cfif not findNoCase(word,".and.or.not")>
  <cfsearch collection="stuff" name="mysearch" type="simple" criteria=#word#>
  <cfif not mysearch.recordcount>
    <cfset outstring=outstring&outsep&word>
    <cfset outsep=", ">
    <cfset badcount=badcount+1>
  </cfif>
</cfif>
</cfloop>

Display the Results

If at least one word was NOT found, badcount will be greater than zero. When this happens, add an explanation to the bare list of words and display the list.

<cfif badcount>
  <cfoutput>#outstring# ...should be checked</cfoutput>
</cfif>

Put it Together

Make sure you've laid the groundwork. You have two directories: myindex and mydocs. You have run the "create" collection code once and then have put the comments back around it so it won't run again. You have put some documents having .txt extensions into the mydocs directory. Maybe you've even found a mid-size text dictionary and have added that too.

Now run spell.cfm, type in some text if you wish, and click on "Reindex" to get Verity to build an index for your "dictionary". Reindexing may take a minute or so. If you typed in text that contains a misspelled or uncommon word, you should find it in a list of words for you to check. If you think the word should be in your dictionary, put it in one of the text documents in mydocs or add another document that has that word, then reindex again.

Now that you have an index, type some more text and press "Spellcheck". This time, the search will take only a few seconds to run.

To add spellchecking to an existing form, you can add a "Spellcheck" submit button and have your current action page spellcheck if it sees the value for this button. Better yet, it may be possible to do something less intrusive. Imagine opening a separate window for the spellcheck, leaving your original form ready to proceed. Imagine opening an alert box with the incorrectly spelled words. Then do these things and tell us how. =Marty=