ColdFusion in Context: Order and Rank by Subset

Suppose you want to enter pairs of data, sort them only on the first element in the pair, and then determine their rank order (including averaging for tie values) based on that sort. This is a requirement that comes up frequently in statistical work. Built-in sorting functions will sort on both elements at once; this tip shows you one way to defeat that behavior. This tip also shows you how to look ahead to determine the average that must be applied to tie values without having to step both forward and backward iteratively to perform the task. Among other list functions, you'll use the listValueCount function.

Overview

Everything will take place on a single page: subsort.cfm. The page must initialize its own variables. The bulk of it will be a form whose textarea fields are shaped into tall, thin columns for data entry and display. The user will enter data into the first two columns, click "calculate", and view the result.

Initialize and Begin the Form

The setup consists of housekeeping details. One column of the outer table will hold an inner table that formats the user's input from two columns Xraw and Yraw. After a separator, remaining columns will display the scores as ordered pairs and will show their rank based solely on the X column score.

Sort by Subset

<!--- Set defaults --->
<cfparam name="form.Xraw" default="">
<cfparam name="form.Yraw" default="">
<cfset form.Xrank="">

<!--- Begin form and overall table --->
<form name="Ranks" action="subsort.cfm" method="post">
<input type="submit" name="doit" value="Submit">
<table border="1">
<tr><td>RAW PAIRS</td>
<td bgcolor="LightGrey"><hr width="0" height="0"></td>
<td>ORDERED</td>
<td>X RANK</td></tr>

<!--- Begin inner table for raw data --->
<tr><td>
<table>
<tr><td>X</td><td>Y</td></tr>

Display and Store X Data

The textarea is tall and thin to encourage the user to enter just one number per row, press enter, and enter another number in the same field. The textarea enforces a physical wrap to avoid losing this separation between numbers. A break and horizontal rule are added after the textarea as a placeholder that can be replaced with counts or sums as desired.

The break between numbers might be a space (not desirable but possible), a newline, or the combination of a newline and carriage return; so, all are shown as delimiters when treating the form input as a list. Consecutive delimiters are treated as one.

<!--- Show X raw data --->
<tr><td><textarea name="Xraw" wrap="physical" rows="25" cols="3">
<cfoutput>#trim(form.Xraw)#</cfoutput></textarea><br><hr></td>

<!--- Store X raw data --->
<cfset XrawPoints=ArrayNew(1)>
<cfset XrawCount=0>
<cfloop list="#form.Xraw#" index="Value"
delimiters=" #chr(10)##chr(13)#">
<cfset XrawCount=XrawCount+1>
<cfset XrawPoints[XrawCount]=val(Value)>
</cfloop>

Display and Store Y Data

This step is nearly identical to the display and storage of X data.

<!--- Show Y raw data; end inner table --->
<td><textarea name="Yraw" wrap="physical" rows="25" cols="3">
<cfoutput>#trim(form.Yraw)#</cfoutput></textarea><br><hr></td></tr>
</table>
</td>

<!--- Store Y raw data --->
<cfset YrawPoints=ArrayNew(1)>
<cfset YrawCount=0>
<cfloop list="#form.Yraw#" index="Value"
delimiters=" #chr(10)##chr(13)#">
<cfset YrawCount=YrawCount+1>
<cfset YrawPoints[YrawCount]=val(Value)>
</cfloop>

Sort by X Values Alone

Test to be sure that both input columns have the same number of values; proceed if they do.

The actual sort takes a bit of thought and preparation. You want to wind up with ordered pairs, but you don't want the second value of the pair to influence the sort. You can achieve this result by inserting a sequence number between the first value and the second value. The sequence number will be considered before the second value, thus removing its influence.

However, you want to rank order these scores from highest to lowest. That means that a descending sort should be used. Therefore, to correctly mask the influence of the second value, the sequence to be inserted should also be descending. It should begin with a high value and descend to a lower one.

Because this sort will be run against a composite element consisting of an X value, a sequence number, and a Y value, it can't be numeric. However, if you sort "1", "2", and "11" as strings, they'll come out with "11" in between. To fix this behavior, you need a common format that right-aligns and zero-fills numbers. This is easy to do for the X and Y values; just use the numberFormat function. For the sequence, pretend that the lowest value has sequence 1001 (forcing four positions) and adjust later to get the real sequence when needed.

Build each composite element using colons as separators within the element. Finally, sort the overall string. Here's the result.

<!--- Stop if columns aren't ready --->
<cfif XrawCount neq YrawCount>
  </table></table>
  Cannot proceed until the X and Y raw score columns have have the same number of values.
  <cfabort>  
</cfif>

<!--- Make, then sort triplets like 000X:{1000+Seq}:000Y by X alone --->
<cfset Seq=1000+YrawCount>
<cfset Xlist="">
<cfloop from="1" to="#YrawCount#" index="Point">
<cfset Seq=Seq-1>
<cfset Entry="#numberFormat(XrawPoints[Point], "0000")#:#Seq#:#numberFormat(YrawPoints[Point], "0000")#">
<cfset Xlist=listAppend(Xlist, Entry)>
</cfloop>
<cfset Xsort=listSort(Xlist, "text", "desc")>

Display the Sorted Pairs

After inserting a table cell separator to serve as a reminder that the pairs are not in the same sequence as the original input, construct and display sorted pairs from the newly sorted list. Each of the data points is extracted from the Xsort string whose elements are separated by commas but whose subelements (x:sequence:y) are separated by colons.The ShowByX field will hold XY pairs (separated by a tilde) for display. The chr(10) forces a newline between pairs to cause them to appear on separate lines of the display. The Xord variable will hold a list of X values to be ranked.

<!--- Separate input from processed data --->
<td bgcolor="LightGrey"><hr height="0" width="0"></td>

<!--- Format pairs for display; store X values for manipulation --->
<cfset Xord="">
<cfset form.ShowByX="">
<cfloop list="#Xsort#" index="Pair">
<cfset form.ShowByX=listAppend(form.ShowByX,
"#val(listGetAt(Pair,1,":"))#~#val(listGetAt(Pair,3,":"))#",
"#chr(10)#")>
<cfset Xord=listAppend(Xord,"#val(listGetAt(Pair,1,":"))#")>
</cfloop>

<!--- Display sorted pairs --->
<td>
<table>
<tr><td>by X</td></tr>
<tr><td><textarea name="ShowByX" wrap="physical" rows="25" cols="6">
<cfoutput>#trim(form.ShowByX)#</cfoutput></textarea><br><hr></td></tr>
</table>
</td>

Rank by X Values

This is in some ways the toughest part to understand. If the X values are 8,6,6,4,2, then their ranks are 1,2.5,2.5,4,5. The tie values receive a rank equal to the average rank of the individual tied values. It would be difficult for each value to count forward for each value that is the same, note its raw rank, and then average those ranks and remember to use the average ranks rather than the raw ranks for those values. Fortunately, there's an easier way.

  1. Get the value to be ranked.
  2. Use the listValueCount function to return a count of all matching values.
  3. Sum this rank to the next rank and so on until the count has been reached. The sum of numbers in a series is in other contexts referred to as a Fibbonachi progression; so, "Fib" was used as the variable name for this sum.
  4. Divide the sum by the number of matching values.
  5. Post this average rank to each of the affected rows.
Note that if there are no duplicates, this procedure posts the raw rank as you might expect. It continues until all the rows have been handled.

Once this is done, build a newline-separated string, Xrank, to display the ranks in a form field. Close out the table and the form.

<!--- Get X Rank --->
<cfset Xarray=ArrayNew(1)>
<cfset Row=1>
<cfset Max=Listlen(Xord)>
<cfloop condition="Row le Max">

<!--- Get value to be ranked --->
<cfset Seek=val(listGetAt(Xord,Row))>

<!--- Count the duplicates --->
<cfset NrSame=(listValueCount(Xord,"#Seek#"))>

<!--- Sum the like row numbers --->
<cfset Fib=0>
<cfset LastLike=Row+NrSame-1>
<cfloop from="#Row#" to="#LastLike#" index="RawRank">
<cfset Fib=Fib+RawRank>
</cfloop>

<!--- Divide by the number of duplicates --->
<cfset Rank=Fib/NrSame>

<!--- Post to matching rows --->
<cfloop condition="#Row# le #LastLike#">
<cfset Xarray[Row]=Rank>
<cfset Row=Row+1>
</cfloop>

<!--- Continue until all rows have been processed --->

</cfloop>

<!--- Format X rank --->
<cfloop from="1" to="#listLen(Xord)#" index="Row">
<cfset form.Xrank=listAppend(form.Xrank,
"#Xarray[Row]#",
"#chr(10)#")>
</cfloop>

<td>
<table>
<tr>
<td>
1..N
</td>
<tr>
<td><textarea name="Xrank" wrap="physical" rows="25" cols="6">
<cfoutput>#trim(form.Xrank)#</cfoutput></textarea><br><hr></td>
</tr>
</table>
<td>
<td>
</tr>
</table>
</form>

Try It, Then Improve It

Browse subsort.cfm, fill in some values, and watch it work. In theory, array-based functions will be faster than the list functions used here. Try your hand at converting this tip, and tell us what you've learned. =Marty=