ColdFusion in Context: Parsing Variable-Length Fields

Text files that carry data between applications often do it in one three ways: fixed-length fields, variable-length fields, and labeled fields. A report printed to disk is an example of the first method. A comma-delimited file is an example of the second method. XML is an example of the third.

Suppose you wanted ColdFusion to correctly parse a file consisting of variable-length fields. How would you do it? Here's an example from ANSI X12, a transfer method with variable-length fields, to show why the obvious approach to parsing this kind of data isn't the right approach. Along the way, you'll use ColdFusion index loops and regular expressions.

Peek at American National Standards Institute (ANSI) X12

ANSI X12 defines economical formats for the completely automated exchange of standard business data between computers. Each record begins with a label followed by fields separated by asterisks. Here's an example of one record in an X12 transaction:

G72*41*02***-2.180*36.000*CA

All users of ANSI X12 work from a common dictionary. They don't use every element available, but if they do use an element, it has the meaning given in the standard.

The power of X12 is that its segment and element meanings have been agreed on through decades of negotiation across all the business sectors you might think of, the X12 transactions built up from these pre-defined building blocks have achieved the same broad level of acceptance, and the X12 protocol uses very little bandwidth. And did I mention that X12 includes checksums and automated feedback that assures the sender that the transmission was received?

XML, a transfer method using named fields, takes perhaps six times as much space as X12 to pass the same amount of data, lacks these additional features, and XML document formats are still hammered out on a case-by-case basis (or at best by a few companies within a single type of business).

Set Up an Index Loop

That's enough pep talk describing the importance of X12. Now, how would you parse this string to put the label and its elements into separate chunks to look them up against a dictionary?

You might be tempted to make this the heart of your parser to break down a single line:

<cfloop index="Field" list="#Record#" delimiters="*">

This would be OK if you never had two delimeters with nothing between them. However, you saw in the example that when an optional value is left out, the delimiters before and after it touch.

G72*41*02***-2.180*36.000*CA

If you tried to parse this line using the method above, here is how the loop would break it down: G72 (the label), 41, 02, -2.180, 36.000, CA. This would cause -2.180 to appear to be the charge number (element 3) instead of the charge amount (element 5). Your parser needs to break it down as G72, 41, 02, {empty}, {empty}, -2.180, 36.000, CA.

So what causes this problem? It turns out that the index loop treats multiple consecutive delimiters as a single delimiter; it will not loop at all for the empty values between them.

So in order to loop for empty values as well as non-empty ones, what can you do? One approach is to replace those empty values with a predefined non-empty value. Then the loop will run the proper number of times, and the logic within the loop will know which field number corresponds with each value (or with your standard replacement for an empty value).

Prepare the Data with Regular Expressions

Putting a real value such as "{EMPTY}" between consecutive delimiters so that no fields are really empty seems simple enough. ColdFusion has a function which finds and replaces substrings that match a regular expression. The REReplace function will start with a string, look for a substring within it, and replace the substring as indicated. Its instructions for use state that specifying the keyword "ALL" will cause it to replace all occurences of the substring. The regular expression "\*{2}" escapes an asterisk so it will be treated as a plain character instead of as a special value. The number 2 in curly braces says to look for two of these in a row. Knowing this information, you might be tempted to handle the problem with a single line preceding the index loop:

<cfset Record=REReplace(#Record#,"\*{2}","*{EMPTY}*","ALL")>

This almost does the job. It expands the original record into this:

G72*41*02*{EMPTY}**-2.180*36.000*CA

So what happened? You specified that you wanted all occurences of two consecutive delimiters to be replaced with *{EMPTY}*, but it only picked up the first occurrence. The problem seems to be that characters used in the first replacment are ignored as the search continues. The scan doesn't see the first of two asterisks following the first empty value; because, that asterisk is part of what the first occurrence was replaced with. If you experiment on your own, you'll see that it picks up every OTHER pair. A stream of six asterisks would be converted to this:

*{EMPTY}**{EMPTY}**{EMPTY}*

So now what? It turns out that there's a simple fix: just do the replacement twice. With an input form and index loops, here's a demonstration parser for records of variable-length delimited fields. Thanks to Don Gilliland for pointing out that some browsers need the textarea field to have a wrap="physical" attribute for proper display within the input field. chr(10) is a linefeed, a reasonable character to look for in order to recognize the end of a line of text.

<cfif not isDefined("form.Doc")>
  <cfset Doc="">
</cfif>
<cfform name="parser" action="parseDoc.cfm" method="post">
Paste an X12 document here, and press "Parse"<br>
<cfoutput><pre>
<textarea name="Doc" wrap="physical" rows="8" cols="50">#trim(Doc)#</textarea>
<br></pre></cfoutput>
<input type="submit" name="go" Value="Parse">
<p>
<hr>
</cfform>
<cfif len(trim(Doc)) gt 0>
  <cfloop index="Record" list="#trim(Doc)#" 
delimiters="#chr(10)#">
  <cfset Demo="">
  <cfif len(trim(Record)) gt 0>
    <cfset FieldNo=-1>
    <cfset Record=ReReplace(#Record#,"\*{2}","*{EMPTY}*","all")>
    <cfset Record=ReReplace(#Record#,"\*{2}","*{EMPTY}*","all")>
    <!--- (yes, the above Record must be used twice) --->
    <cfloop index="Field" list="#Record#" delimiters="*">
    <cfset FieldNo=#FieldNo#+1>
    <cfif FieldNo is 0>
      <!--- FieldNo 0 is the label for the record type --->
      <cfset RecType=trim(Field)>
      <cfif len(RecType)>
        <!--- In real life, you'd do a lookup here to get the record layout --->
        <cfset Demo=Demo&"RecType: #Field#">
      <cfelse>
        <cfset Demo=Demo&"RecType: {EMPTY}">
      </cfif>
    <cfelse>
      <!--- Subsequent fields are really fields --->
      <!--- In real life, you'd look up this field number for this record type --->
      <cfset Field=trim(Field)>
      <cfif len(Field)>
        <cfset Demo=Demo&", FieldNo_#FieldNo#: #Field#">
      <cfelse>
        <cfset Demo=Demo&", FieldNo_#FieldNo#: {EMPTY}">
      </cfif>
    </cfif>
    </cfloop>
  </cfif>
  <cfoutput>#Demo#</cfoutput><br>
  </cfloop>
</cfif>

Watch it Work

Copy the code to a file; name the file parseDoc.cfm, and paste something into it. Here's an input example you can paste into the form. The first record would normally be an ISA and the last would be IEA (its complement), but the ISA record is longer than one screen line. Since long lines will probably be broken up when this tip is published, and broken lines won't parse right, the ISA and IEA are omitted for this example:

GS*GP*802896274*4058407200*20001108*1312*47*T*004010UCS
ST*880*0060
G01*20000916*6076*20000927*426151-ME
G61*IC*MARTY LADNER
G62*11*20000927
G27*M****DEF
G23*01*3*20000916*01*2*20000926*10*20000927*11*18286*896024*914310
G25*PC*04
N1*BT*PARENTSTORE*9*006943773ME00
N3*236 MONTGOMERY HWY
N4*BIRMINGHAM*AL*35216*USA
N1*ST*LOCALSTORE*9*006943773ME00
N3*2548 REDSTONE RD SW
N4*HUNTSVILLE*AL*35803*USA
N1*RE*LIGERS*9*8028962740001
N3*3410 MCGEHEE RD
N4*MONTGOMERY*AL*36111*USA
G17*60*CA*26.09*074816211179******60*CA
G69*CHOCOLATE MUFFINS
G20*12*16*OZ
G17*330*CA*23.34*074816261452******330*CA
G69*BRAN MUFFINS
G20*12*11*OZ
G17*30*CA*26.09*074816291547******30*CA
G69*PUMPKIN MUFFINS
G20*12*18*OZ
G31*420*CA
G33*914310
SE*28*0060
GE*1*47

Extension

The result (a list of parsed records) doesn't look very imposing. Why bother doing this at all? It's because the result proves that you can definitely identify the record type for each record and the field contents for each field. With this knowledge, you can accept the information into a database as a transaction. With this knowledge, you can (using a database that contains the ANSI X12 standard) spit out a human-readable interpretation of the entire transaction. With this technique, you can parse any document that uses variable-length fields and can create low-bandwidth interfaces. =Marty=