ColdFusion in Context: Instant Replay In Theory

In an effort to reduce communications requirements and to speed up the apparent response to users, browsers often display cached data. This data may come from intermediate caches or the browser itself. (The Web server can also be configured to cache pages, but that involves global controls that are outside the scope of this discussion.)

Suppose, however, that the user really needs to see new data on every visit to the page. If it's your bank statement, old data simply won't do. This memo explores theoretical ways to avoid an instant replay of old data.

Four Basic Approaches to Avoid Instant Replay

The following overview is simplified a bit. It omits some seldom-used headers. However, reviewing it will place the details that follow in a useful context. There are basically four approaches you can take to avoid instant replay: change the description of the data so cached copies aren't used; keep the data from being cached; immediately mark cached data as expired; and convince the caches to check for newer data AND convince them based on that check that the data they have is old.
Change the Description of the Data
One way to keep the user from seeing an instant replay is to change the URL in links and controls every time the page they are a part of is displayed, perhaps by adding extra path information or by varying a query string. This is the most reliable method. The pages will get cached when users visit them, but the cached copies won't be requested again and therefore won't be used.
Keep the Data from Being Cached
Another way to get fresh content each time is to keep the original content from being cached in the first place.
Immediately Mark Cached Data as Expired
Another way to get fresh content each time is to cause the caches and browser to mark the page as expired as soon as they get it.
Convince Caches to Check Data AND Consider It Old
The usual way to get fresh content each time is to cause the caches and browser to revalidate their copies of the data, i.e., compare the stored page header with the served page header when the user asks for the page again.

First, the caches have to decide that perhaps they should revalidate (check for newer data).

Next, the caches have to decide during revalidation that their copy isn't good enough. When the browser tries to revalidate, it passes a head request or a conditional get request through intermediate caches to (hopefully) the Web server. The head request asks only for headers, not content. (If the devices infer from the headers that the content has changed, they will then ask for the entire page in a separate get request.) The conditional get request performs both functions in one pass. If the page has been modified since the date of the cache's copy, then the entire page is provided at once.

Tool-by-tool Discussion

Here is an explanation of each of the tools at our disposal and why the preceding discussion is so full of "ifs" and "shoulds". References beginning with P are paragraphs from http://www.ietf.org/rfc/rfc2616.txt - Hypertext Transfer Protocol HTTP/1.1. References labeled Microsoft are from links formatted this way: http://support.microsoft.com/default.aspx?scid=kb;EN-US;{reference number}. See also http://www.ietf.org/rfc/rfc1945.txt for RFC 1945, the final Request for Comment that describes the theoretically obsolete but very much in force legacy HTTP/1.0.
Set a Query String or Extra Path
We can vary the URL by modifying the query string; this can be as simple (and as seemingly useless) as adding a random number to the query string of the URL and then ignoring that number. If the browser and intermediate servers don't remember the entire URL, they should ask for a fresh page. This is the most reliable method of making sure that the user receives fresh content every time.

One side effect we have noted when providing extra path information during troubleshooting another person's code is that if a meta refresh specifies a new destination as a partial URL, then on refresh, the partial URL with be CONCATENATED with the existing URL, and the browser will remain on the current page. Because documentation of the meta refresh capability states that full URLs should be used when refreshing to a different location, this will only be a problem with improper code. We can live with the other side effects: caches will fill with "use once" pages, and search engines probably won't index them.

Another side effect that has been reported is that some Web servers won't accept form data (via Post) if a URL query is present. This does not affect Microsoft IIS.

Still another side effect is that browsers that have large page caches set by default will fill those caches with completely useless pages (which will eventually be deleted automatically).

Here's an example showing how the URL might be easily modified so it is different nearly every time.

<cfoutput>
<a href="mypage?Nr=#rand()#">
</cfoutput>

Or if we want to use extra path information instead of a query string...

<cfoutput>
<a href="mypage/Nr/#rand()#">
</cfoutput>

... which may fill search engines with useless pages but otherwise should work.

One caveat applies here. The links won't change if the user can press a back button to reach the page they're on; because, when you go to a page using the back button, the page isn't reloaded to let ColdFusion it them different. To make the links different every time they are encounted, even if the page they're on is retrieved from history, you'll need a javascript function instead of a true link.

Set a Cookie
Intermediate servers are not supposed to cache cookies. Under HTTP/1.0, a header containing "set cookie" tells the cache not to cache this object. However, under HTTP/1.1, a header containing "set cookie" AND (cache-control: no-cache="set-cookie" or cache-control: private) tells the cache not to cache this object. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) This implies that cookies might be cached under HTTP/1.1 if cache-control headers are not used. The bottom line is that cookies by themselves should not be counted on to keep page content from being cached.
Post
Caches should go back to the original server when a form posts data to a page. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) However, it is not clear that caches really always do this.

P 9.5 says: "Responses to this method are not cacheable, UNLESS the response includes appropriate Cache-Control or Expires header fields [My emphasis]." This implies in theory that using cache-control and expires headers in a page to which the user will post form variables may actually work AGAINST us by allowing the page that receives the form variables to be cached when otherwise it might not be cached at all.

We could just omit cache-control headers for posted pages. However, the Web server may have a default that provides less-restrictive headers on its own, allowing the data to be cached for longer than we would like. It seems safest, therefore, to explicitly set headers for these pages to values we can live with, just as we would with other pages.

Configure the Browser
If the browser is set to retrieve fresh pages NEVER or ONCE PER SESSION, then it will probably ignore anything else we might do. To combat this, we could make a browser walk-through part of the account approval process. Otherwise, if the customer never calls, we won't know that this problem exists.

IE 5 and 6 are set to use HTTP/1.1 by default for direct connections to the Internet, but they are set to use HTTP/1.0 by default when proxy servers are used. To modify this setting, go to Tools..Internet Options..Advanced: Browsing: and set "use HTTP 1.1" and "use HTTP 1.1 through proxy connections". For maximum flexibility with all but some very old Web sites (still running only HTTP/1.0), we would like to have both boxes checked, but it's unlikely that our users will do this unless we tell them to.

To make matters worse, we can't assume that browser configurations will stay fixed even if every user follows these instructions. Many customers have large IT departments who push default browser settings to offices or entire departments without stopping to consider that most users don't want or need an instant replay of static data when they request dynamic pages.

Upgrade the Browser and Server?
IE 4.0 and Netscape 4.5 request objects using HTTP/1.0 format first (as of 1996), in case the Web server doesn't support HTTP/1.1. This means that many of the features of HTTP/1.1 are not available even though both the browser and the Web server actually support them. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.)

Some personal Web servers - mine included - still only support HTTP/1.0. This may cause a difference in behavior between the development and production environment.

As noted under browser configuration, newer browsers do use HTTP/1.1 by default if no proxy server is involved. Unfortunately, they use HTTP/1.0 by default when proxies are involved, and many customers have proxy servers.

Use SSL?
Multiple sources claim that secure content isn't cached, but in my experience, it is. According to browser documentation (below), the no-cache command is more likely to mean what the term implies if a secure connection rather than a non-secure connection is used, but in practice, it doesn't seem to matter.
Use Meta Tags?
Don't bother using meta tags to keep pages from being cached. Seriously, most caches don't read page content; so, they won't see the meta tags contained in the content. Perhaps because caches won't look at meta tags, newer versions of IE ignore cache-control meta tags. (See Microsoft q234067.) Newer versions of IE will look at other kinds of meta tags; see below.
Use Header - Pragma: No-Cache
The pragma header is understood by both HTTP/1.0 and HTTP/1.1 caches. (See http://ietf.org/rfc/rft1945.txt for HTTP/1.0.)

Older caches don't understand cache-control headers. However, if they see "pragma: no-cache", they are supposed to avoid retaining the data for re-use. In some references, Microsoft claims that IE will obey this directive and will also honor the pragma meta tag. (See Microsoft 189409, 172896, 165150.)

However, Microsoft also says that an older IE browser (pre-5.0) doesn't begin to put the page into cache until it has seen 64K. If it sees the tag before this point, it "removes" nothing from cache, then finishes storing the page: not quite what we want. The cure seems nothing short of bizzare: create two "head" sections in the page. That is to say, put the meta tag within a second (!) head tag placed after the body and before the html closes. (See Microsoft 222064.)

Further, it clarifies that if this a non-secure connection (i.e., not https), IE will place the data in cache after all when it sees this pragma but will immediately mark the data as expired. (See Microsoft q234067.)

The bottom line seems to be that the best one can consistently hope for when using this header with IE is that the cached content is immediately marked as expired.

Multiple sources report that the "pragma: no-cache" header prevents Netscape Navigator from caching the page. A Boston University study says that this header tells the browser to validate the resource even if it has a cached copy.

To set the pragma header, simply say:

<cfheader name="pragma" value="no-cache">

The corresponding meta tag (that we won't bother with) looks like this...

<meta http-equiv="pragma" content="no-cache">
Use Header - Date: {GMT date/time in HTTP format}
The date header is understood by both HTTP/1.0 and HTTP/1.1 caches. (See http://ietf.org/rfc/rft1945.txt for HTTP/1.0.) Most Web servers supply this automatically. If it doesn't, then assuming we've converted the current time to Greenwich Mean Time (GMT), formatted it as an HTTP-compliant date/time stamp, and stored it to the variable "Today", we can set it in this manner:

<cfheader name="date" value="{such as Tue, 15 Nov 1994 08:12:31 GMT}">
Use Header - Last-Modified: {date/time}
The last-modified header is understood by both HTTP/1.0 and HTTP/1.1 caches. (See http://ietf.org/rfc/rft1945.txt for HTTP/1.0.)

Some proxies (such as Squid) assume an object is still current if it was modified very long ago and has been seen recently. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) This suggests that the last-modified date should be very recent to force such caches to revalidate.

The last-modified header is useful; because, it's supposed to tell a cache that doesn't understand Etags that the content has changed. If the last-modified header time has changed, the content of the URL has presumably changed. However, some caches will assume the resource is still good if the current time (on their machines) isn't at least 60 seconds after the last-modified header time. One way to get around this might be to set the last-modified header time to a few minutes earlier than the current moment. However, this seems risky if other caches do what the RFC seems to intend. If the expires and date headers aren't present at all, then the RFC says that caches "should" assume an expiration based on how old the last-modified time is. (See P 9.4; P 13.2.4; P 13.3.3.) America On-Line says it will store a page for 20% of its age (or 24 hours, whichever is shorter) if it sees a last-modified date without an expires date. (See http://webmaster.info.aol.com/caching.html for an enlightening discussion.)

If an object has no Etag or last-modified date, then one source postulates that the object should be reloaded. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) However, it seems more likely that if the browser can't confirm that something has changed, it won't get a fresh page.

Therefore, the key seems to set the last-modified date to the current GMT date/time and to set expires to an old date. Assuming we've formatted the current time as an HTTP-compliant time stamp and stored it to the variable "Today", set the last-modified date this way:

<cfheader name="last-modified" value="#Today#">
Use Header - Expires: {date/time}
The expires header is understood by both HTTP/1.0 and HTTP/1.1 caches. (See http://ietf.org/rfc/rft1945.txt for HTTP/1.0.) If the expires date has passed, then the resource is stale and becomes a candidate for revalidation. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.)

Set the Expires date to a date/time in the past. Here's an example:

<cfheader name="expires"
  value="Mon, 06 Jan 2003 04:15:30 GMT">
Use Header - Etag: "{unique label}"
According to RFC HTTP/1.1, an entity tag header (i.e., Etag) is useful; because in theory, it's the strongest way to tell a modern cache that the content has changed. If the tag of the cached object doesn't match the tag of the object on the Web server, the content of the object has presumably changed. (See P 9.4; P 13.3.3; P 13.3.4; P 13.3.5.)

An automated summary (compiled via a Web robot) of educational sites in the United Kingdom reported that 40 percent of their HTML pages and 45 percent of their images (!) used the Etag header. (See http://www.ukoln.ac.uk/web-focus/webwatch/reports/hei-nov1998/ - Figure 10.)

However, because IE 4.0 and Netscape 4.5 request objects in HTTP/1.0 format first (as of 1996) in case the Web server doesn't support HTTP/1.1, the Etag (in theory) doesn't actually get used by these browsers. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) Further, if you check the settings for IE 5 and 6, you'll see that they specify HTTP/1.0 by default when a proxy server is involved (which is most of the time). So, these modern browsers don't consistently use the Etag either.

Through browser and server upgrades, it is likely that newer browsers will default to HTTP/1.1 someday and will eventually find the Etag useful. The ColdFusion createUUID function creates a unique identifier. Set a unique etag with the following code:

<cfheader name="etag" value="#createUUID()#">
Use Cache-Control Headers
Header - Cache-Control: Private
This should force a shared cache to revalidate subsequent requests for a resource (since they're not supposed to be storing private data anyway). (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) Note that this doesn't stop the browser (which is a private cache) from caching the resource. Here's how to use it:

<cfheader name="cache-control: private">
Use Header - Cache-Control: No-Cache
This should force caches to revalidate subsequent requests. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) Here's how to use it:

<cfheader name="cache-control" value="no-cache">
Use Header - Cache-Control: No-Store
In theory, this should keep data from getting stored in the first place. In practice, this should at least force caches to revalidate subsequent requests. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) Here's how to use it:

<cfheader name="cache-control" value="no-store">
Use Header - Cache-Control: Must-Revalidate
The "cache-control: must-revalidate" directive should be useful; because, it is supposed to force modern caches to ask the chain of caches if their data is fresh. (See P 13.1.6.)

Without it, caches are free to press the boundaries a bit, continuing to serve content that may be old. (See P 13.2.1; P 13.8; P 14.9.3; P 14.4.)

This will generate this header:

<cfheader name="cache-control" value="must-revalidate">
Use Header - Cache-Control: Max-Age=0
This HTTP/1.1 directive says that the cached item is only expected to remain current for zero seconds. Here's how to use it:

<cfheader name="cache-control" value="max-age=0">

Conclusion

First, it's worth noting that often, caches often do the right thing from your standpoint. The previous discussion notwithstanding, caches usually somehow infer that the page is different. Therefore, you may decide not to take any special action unless you've detected a problem. (Of course, learning that you have a problem can be difficult when your browser lies.)

The most straightforward way to get a fresh page to the user every time (well, almost every time) is to add a random number to each URL, or better yet, a short string that's unique from the client perspective. This is appropriate for all pages that should not be cached except for pages that the user would want to bookmark. Fix code that uses a partial URL to refresh to a different page; the URL should be a full one anyway. As noted above, if the user can reach the links or controls via a back button, use a javascript function instead of a true link to change the URL.

For pages that the user would want to bookmark and for which the browser will therefore always use the same URL, the preceding theoretical discussion can help you decide which header combinations to test. You'll have to do this testing yourself. The experts have agreed to disagree, and recommendations based solely on conflicting theory are risky.

Major HTTP/1.1 References by Paragraph

Paragraph 9.4, Head, says:

Paragraph 9.5, Post, says:

Paragraph 13.1.6, Client-Controlled Behavior, says:

Paragraph 13.2.2, Heuristic Expiration, says:

Paragraph 13.2.3, Age Calculations, says:

Paragraph 13.2.4, Expiration Calculations, says:

Paragraph 13.3.3, Weak and Strong Validators, says:

Paragraph 13.3.4, Rules for When to Use Entity Tags and Last-Modified Dates, says:

Paragraph 13.3.5, Non-validating Conditionals, says:

Paragraph 13.5.2, Non-modifiable Headers, says:

Paragraph 13.9, Side Effects of GET and HEAD, says:

Paragraph 13.10, Invalidation after Updates and Deletions, says:

Paragraph 13.11, Write-Through Mandatory, says:

Paragraph 13.12, Cache Replacement, says:

Paragraph 14.9, Cache-Control, says:

Paragraph 14.9.1, What is Cacheable, says:

Paragraph 14.9.2, What May be Stored by Caches, says:

Paragraph 14.9.3, Modifications of the Basic Expiration Mechanism, says:

Paragraph 14.9.4, Cache Revalidation and Reload Controls, says:

Paragraph 14.18, Date, says:

Paragraph 14.19, Etag, says:

Paragraph 14.21, Expired, says:

Paragraph 15.1.3, Encoding Sensitive Information in URIs, says:

Experimentation

The gap between theory and practice can be narrowed through experimentation. Ultimately, the only way to be sure what works is to try something. Do so, and tell us what you've learned. Make unwanted instant replay a thing of the past. =Marty=