ColdFusion in Context: Instant Replay In Theory
In an effort to reduce communications requirements and to speed up the apparent response to users, browsers often display cached data. This data may come from intermediate caches or the browser itself. (The Web server can also be configured to cache pages, but that involves global controls that are outside the scope of this discussion.)
Suppose, however, that the user really needs to see new data on every visit to the page. If it's your bank statement, old data simply won't do. This memo explores theoretical ways to avoid an instant replay of old data.
Four Basic Approaches to Avoid Instant Replay
The following overview is simplified a bit. It omits some seldom-used headers. However, reviewing it will place the details that follow in a useful context. There are basically four approaches you can take to avoid instant replay: change the description of the data so cached copies aren't used; keep the data from being cached; immediately mark cached data as expired; and convince the caches to check for newer data AND convince them based on that check that the data they have is old.
Change the Description of the Data
One way to keep the user from seeing an instant replay is to change the URL in links and controls every time the page they are a part of is displayed, perhaps by adding extra path information or by varying a query string. This is the most reliable method. The pages will get cached when users visit them, but the cached copies won't be requested again and therefore won't be used.
Keep the Data from Being Cached
Another way to get fresh content each time is to keep the original content from being cached in the first place.
- Posted data isn't supposed to be cached, but beware: the use of cache-control headers might cause it to be cached.
- Pages containing cookies aren't supposed to be cached, but omitting certain headers may cause them to be cached anyway.
- Caches aren't supposed to cache secure content, but they do anyway.
- An HTTP/1.0 exchange can send and understand "pragma: no-cache" and sometimes honors it by not placing data in cache. Whether and how it honors it can depend on whether the connection is secure and whether the device sees this directive before or after it has actually started to store content. If the connection is secure and the directive is seen after content storage has begun, then documentation indicates that the data won't be cached in an IE browser.
- An HTTP/1.1 exchange can also send and understand "cache-control: private" which should tell intermediate (public) caches (but not the browser) not to store this material for later use; "cache-control: no-store" which means what it sounds like; and "cache-control: no-cache" which may affect storage or subsequent use of the data.
Immediately Mark Cached Data as Expired
Another way to get fresh content each time is to cause the caches and browser to mark the page as expired as soon as they get it.
- In an HTTP/1.0 exchange, "pragma: no-cache" may have this effect, especially when combined with an "expires: {Greenwich Mean Time (GMT) date/time in HTTP format}" that is the same as "date: {GMT date/time in HTTP format} and when used for non-secure content. (Date examples will follow later.)
- In an HTTP/1.1 exchange, "cache-control: no-cache" may also have this effect for all uses except history traversal (e.g., the back button).
Convince Caches to Check Data AND Consider It Old
The usual way to get fresh content each time is to cause the caches and browser to revalidate their copies of the data, i.e., compare the stored page header with the served page header when the user asks for the page again.
First, the caches have to decide that perhaps they should revalidate (check for newer data).
- In an HTTP/1.0 exchange, devices may notice that "expires: {GMT date/time in HTTP format}" has passed the device clock time, or they may infer an expiration date from "last-modified: {GMT date/time in HTTP format}" if the expires header is not present.
- In an HTTP/1.1 exchange, they may also notice that "cache-control: max-age=0", and they may notice the presence of "cache-control: must-revalidate". The latter control tells them not to bend the rules in favor of using older content but to revalidate if other headers imply that the data might not be current.
Next, the caches have to decide during revalidation that their copy isn't good enough. When the browser tries to revalidate, it passes a head request or a conditional get request through intermediate caches to (hopefully) the Web server. The head request asks only for headers, not content. (If the devices infer from the headers that the content has changed, they will then ask for the entire page in a separate get request.) The conditional get request performs both functions in one pass. If the page has been modified since the date of the cache's copy, then the entire page is provided at once.
- In an HTTP/1.0 exchange, these requests compare the copied and current "content-length: {bytes}" and "last-modified: {Greenwich Mean Time (GMT) date/time in HTTP format}". If one or both have changed, the entire document should be reloaded.
- In an HTTP/1.1 exchange, it also compares the "etag: {value}" (if any). If the etag is the same, then in theory the document will not get reloaded.
Tool-by-tool Discussion
Here is an explanation of each of the tools at our disposal and why the preceding discussion is so full of "ifs" and "shoulds". References beginning with P are paragraphs from http://www.ietf.org/rfc/rfc2616.txt - Hypertext Transfer Protocol HTTP/1.1. References labeled Microsoft are from links formatted this way: http://support.microsoft.com/default.aspx?scid=kb;EN-US;{reference number}. See also http://www.ietf.org/rfc/rfc1945.txt for RFC 1945, the final Request for Comment that describes the theoretically obsolete but very much in force legacy HTTP/1.0.
Set a Query String or Extra Path
We can vary the URL by modifying the query string; this can be as simple (and as seemingly useless) as adding a random number to the query string of the URL and then ignoring that number. If the browser and intermediate servers don't remember the entire URL, they should ask for a fresh page. This is the most reliable method of making sure that the user receives fresh content every time.
One side effect we have noted when providing extra path information during troubleshooting another person's code is that if a meta refresh specifies a new destination as a partial URL, then on refresh, the partial URL with be CONCATENATED with the existing URL, and the browser will remain on the current page. Because documentation of the meta refresh capability states that full URLs should be used when refreshing to a different location, this will only be a problem with improper code. We can live with the other side effects: caches will fill with "use once" pages, and search engines probably won't index them.
Another side effect that has been reported is that some Web servers won't accept form data (via Post) if a URL query is present. This does not affect Microsoft IIS.
Still another side effect is that browsers that have large page caches set by default will fill those caches with completely useless pages (which will eventually be deleted automatically).
Here's an example showing how the URL might be easily modified so it is different nearly every time.
<cfoutput>
<a href="mypage?Nr=#rand()#">
</cfoutput>
Or if we want to use extra path information instead of a query string...
<cfoutput>
<a href="mypage/Nr/#rand()#">
</cfoutput>
... which may fill search engines with useless pages but otherwise should work.
One caveat applies here. The links won't change if the user can press a back button to reach the page they're on; because, when you go to a page using the back button, the page isn't reloaded to let ColdFusion it them different. To make the links different every time they are encounted, even if the page they're on is retrieved from history, you'll need a javascript function instead of a true link.
Set a Cookie
Intermediate servers are not supposed to cache cookies. Under HTTP/1.0, a header containing "set cookie" tells the cache not to cache this object. However, under HTTP/1.1, a header containing "set cookie" AND (cache-control: no-cache="set-cookie" or cache-control: private) tells the cache not to cache this object. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) This implies that cookies might be cached under HTTP/1.1 if cache-control headers are not used. The bottom line is that cookies by themselves should not be counted on to keep page content from being cached.
Post
Caches should go back to the original server when a form posts data to a page. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) However, it is not clear that caches really always do this.
P 9.5 says: "Responses to this method are not cacheable, UNLESS the response includes appropriate Cache-Control or Expires header fields [My emphasis]." This implies in theory that using cache-control and expires headers in a page to which the user will post form variables may actually work AGAINST us by allowing the page that receives the form variables to be cached when otherwise it might not be cached at all.
We could just omit cache-control headers for posted pages. However, the Web server may have a default that provides less-restrictive headers on its own, allowing the data to be cached for longer than we would like. It seems safest, therefore, to explicitly set headers for these pages to values we can live with, just as we would with other pages.
Configure the Browser
If the browser is set to retrieve fresh pages NEVER or ONCE PER SESSION, then it will probably ignore anything else we might do. To combat this, we could make a browser walk-through part of the account approval process. Otherwise, if the customer never calls, we won't know that this problem exists.
IE 5 and 6 are set to use HTTP/1.1 by default for direct connections to the Internet, but they are set to use HTTP/1.0 by default when proxy servers are used. To modify this setting, go to Tools..Internet Options..Advanced: Browsing: and set "use HTTP 1.1" and "use HTTP 1.1 through proxy connections". For maximum flexibility with all but some very old Web sites (still running only HTTP/1.0), we would like to have both boxes checked, but it's unlikely that our users will do this unless we tell them to.
To make matters worse, we can't assume that browser configurations will stay fixed even if every user follows these instructions. Many customers have large IT departments who push default browser settings to offices or entire departments without stopping to consider that most users don't want or need an instant replay of static data when they request dynamic pages.
Upgrade the Browser and Server?
IE 4.0 and Netscape 4.5 request objects using HTTP/1.0 format first (as of 1996), in case the Web server doesn't support HTTP/1.1. This means that many of the features of HTTP/1.1 are not available even though both the browser and the Web server actually support them. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.)
Some personal Web servers - mine included - still only support HTTP/1.0. This may cause a difference in behavior between the development and production environment.
As noted under browser configuration, newer browsers do use HTTP/1.1 by default if no proxy server is involved. Unfortunately, they use HTTP/1.0 by default when proxies are involved, and many customers have proxy servers.
Use SSL?
Multiple sources claim that secure content isn't cached, but in my experience, it is. According to browser documentation (below), the no-cache command is more likely to mean what the term implies if a secure connection rather than a non-secure connection is used, but in practice, it doesn't seem to matter.
Use Meta Tags?
Don't bother using meta tags to keep pages from being cached. Seriously, most caches don't read page content; so, they won't see the meta tags contained in the content. Perhaps because caches won't look at meta tags, newer versions of IE ignore cache-control meta tags. (See Microsoft q234067.) Newer versions of IE will look at other kinds of meta tags; see below.
Use Header - Pragma: No-Cache
The pragma header is understood by both HTTP/1.0 and HTTP/1.1 caches. (See http://ietf.org/rfc/rft1945.txt for HTTP/1.0.)
Older caches don't understand cache-control headers. However, if they see "pragma: no-cache", they are supposed to avoid retaining the data for re-use. In some references, Microsoft claims that IE will obey this directive and will also honor the pragma meta tag. (See Microsoft 189409, 172896, 165150.)
However, Microsoft also says that an older IE browser (pre-5.0) doesn't begin to put the page into cache until it has seen 64K. If it sees the tag before this point, it "removes" nothing from cache, then finishes storing the page: not quite what we want. The cure seems nothing short of bizzare: create two "head" sections in the page. That is to say, put the meta tag within a second (!) head tag placed after the body and before the html closes. (See Microsoft 222064.)
Further, it clarifies that if this a non-secure connection (i.e., not https), IE will place the data in cache after all when it sees this pragma but will immediately mark the data as expired. (See Microsoft q234067.)
The bottom line seems to be that the best one can consistently hope for when using this header with IE is that the cached content is immediately marked as expired.
Multiple sources report that the "pragma: no-cache" header prevents Netscape Navigator from caching the page. A Boston University study says that this header tells the browser to validate the resource even if it has a cached copy.
To set the pragma header, simply say:
<cfheader name="pragma" value="no-cache">
The corresponding meta tag (that we won't bother with) looks like this...
<meta http-equiv="pragma" content="no-cache">
Use Header - Date: {GMT date/time in HTTP format}
The date header is understood by both HTTP/1.0 and HTTP/1.1 caches. (See http://ietf.org/rfc/rft1945.txt for HTTP/1.0.) Most Web servers supply this automatically. If it doesn't, then assuming we've converted the current time to Greenwich Mean Time (GMT), formatted it as an HTTP-compliant date/time stamp, and stored it to the variable "Today", we can set it in this manner:
<cfheader name="date" value="{such as Tue, 15 Nov 1994 08:12:31 GMT}">
Use Header - Last-Modified: {date/time}
The last-modified header is understood by both HTTP/1.0 and HTTP/1.1 caches. (See http://ietf.org/rfc/rft1945.txt for HTTP/1.0.)
Some proxies (such as Squid) assume an object is still current if it was modified very long ago and has been seen recently. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) This suggests that the last-modified date should be very recent to force such caches to revalidate.
The last-modified header is useful; because, it's supposed to tell a cache that doesn't understand Etags that the content has changed. If the last-modified header time has changed, the content of the URL has presumably changed. However, some caches will assume the resource is still good if the current time (on their machines) isn't at least 60 seconds after the last-modified header time. One way to get around this might be to set the last-modified header time to a few minutes earlier than the current moment. However, this seems risky if other caches do what the RFC seems to intend. If the expires and date headers aren't present at all, then the RFC says that caches "should" assume an expiration based on how old the last-modified time is. (See P 9.4; P 13.2.4; P 13.3.3.) America On-Line says it will store a page for 20% of its age (or 24 hours, whichever is shorter) if it sees a last-modified date without an expires date. (See http://webmaster.info.aol.com/caching.html for an enlightening discussion.)
If an object has no Etag or last-modified date, then one source postulates that the object should be reloaded. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) However, it seems more likely that if the browser can't confirm that something has changed, it won't get a fresh page.
Therefore, the key seems to set the last-modified date to the current GMT date/time and to set expires to an old date. Assuming we've formatted the current time as an HTTP-compliant time stamp and stored it to the variable "Today", set the last-modified date this way:
<cfheader name="last-modified" value="#Today#">
Use Header - Expires: {date/time}
The expires header is understood by both HTTP/1.0 and HTTP/1.1 caches. (See http://ietf.org/rfc/rft1945.txt for HTTP/1.0.) If the expires date has passed, then the resource is stale and becomes a candidate for revalidation. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.)
Set the Expires date to a date/time in the past. Here's an example:
<cfheader name="expires"
value="Mon, 06 Jan 2003 04:15:30 GMT">
Use Header - Etag: "{unique label}"
According to RFC HTTP/1.1, an entity tag header (i.e., Etag) is useful; because in theory, it's the strongest way to tell a modern cache that the content has changed. If the tag of the cached object doesn't match the tag of the object on the Web server, the content of the object has presumably changed. (See P 9.4; P 13.3.3; P 13.3.4; P 13.3.5.)
An automated summary (compiled via a Web robot) of educational sites in the United Kingdom reported that 40 percent of their HTML pages and 45 percent of their images (!) used the Etag header. (See http://www.ukoln.ac.uk/web-focus/webwatch/reports/hei-nov1998/ - Figure 10.)
However, because IE 4.0 and Netscape 4.5 request objects in HTTP/1.0 format first (as of 1996) in case the Web server doesn't support HTTP/1.1, the Etag (in theory) doesn't actually get used by these browsers. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) Further, if you check the settings for IE 5 and 6, you'll see that they specify HTTP/1.0 by default when a proxy server is involved (which is most of the time). So, these modern browsers don't consistently use the Etag either.
Through browser and server upgrades, it is likely that newer browsers will default to HTTP/1.1 someday and will eventually find the Etag useful. The ColdFusion createUUID function creates a unique identifier. Set a unique etag with the following code:
<cfheader name="etag" value="#createUUID()#">
Use Cache-Control Headers
Header - Cache-Control: Private
This should force a shared cache to revalidate subsequent requests for a resource (since they're not supposed to be storing private data anyway). (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) Note that this doesn't stop the browser (which is a private cache) from caching the resource. Here's how to use it:
<cfheader name="cache-control: private">
Use Header - Cache-Control: No-Cache
This should force caches to revalidate subsequent requests. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) Here's how to use it:
<cfheader name="cache-control" value="no-cache">
Use Header - Cache-Control: No-Store
In theory, this should keep data from getting stored in the first place. In practice, this should at least force caches to revalidate subsequent requests. (See http://www.cs.bu.edu/techreports/pdf/2000-019-web-cachability.pdf from Boston University.) Here's how to use it:
<cfheader name="cache-control" value="no-store">
Use Header - Cache-Control: Must-Revalidate
The "cache-control: must-revalidate" directive should be useful; because, it is supposed to force modern caches to ask the chain of caches if their data is fresh. (See P 13.1.6.)
Without it, caches are free to press the boundaries a bit, continuing to serve content that may be old. (See P 13.2.1; P 13.8; P 14.9.3; P 14.4.)
This will generate this header:
<cfheader name="cache-control" value="must-revalidate">
Use Header - Cache-Control: Max-Age=0
This HTTP/1.1 directive says that the cached item is only expected to remain current for zero seconds. Here's how to use it:
<cfheader name="cache-control" value="max-age=0">
Conclusion
First, it's worth noting that often, caches often do the right thing from your standpoint. The previous discussion notwithstanding, caches usually somehow infer that the page is different. Therefore, you may decide not to take any special action unless you've detected a problem. (Of course, learning that you have a problem can be difficult when your browser lies.)
The most straightforward way to get a fresh page to the user every time (well, almost every time) is to add a random number to each URL, or better yet, a short string that's unique from the client perspective. This is appropriate for all pages that should not be cached except for pages that the user would want to bookmark. Fix code that uses a partial URL to refresh to a different page; the URL should be a full one anyway. As noted above, if the user can reach the links or controls via a back button, use a javascript function instead of a true link to change the URL.
For pages that the user would want to bookmark and for which the browser will therefore always use the same URL, the preceding theoretical discussion can help you decide which header combinations to test. You'll have to do this testing yourself. The experts have agreed to disagree, and recommendations based solely on conflicting theory are risky.
Major HTTP/1.1 References by Paragraph
Paragraph 9.4, Head, says:
- Servers often test links with a head request to revalidate them. The head request is a shortcut for validation and does not return the body. "If the new field values indicate that the cached entity differs from the current entity (as would be indicated by a change in Content-Length, Content-MD5, ETag or Last-Modified), then the cache MUST treat the cache entry as stale. [Of these fields, Last-Modified is the easiest for us to supply.]
Paragraph 9.5, Post, says:
- "Responses to this method are not cacheable, unless the response includes appropriate Cache-Control or Expires header fields." [The standard implies here that using cache-control and expires headers in a page to which the user will post form variables - an action page - may actually work AGAINST us by allowing the page that receives the form variables to be cached.]
Paragraph 13.1.6, Client-Controlled Behavior, says:
- "cache-control: must-revalidate" is a preferred method of forcing caches to validate every request.
- "Servers specify explicit expiration times using either the Expires header, or the max-age directive of the Cache-Control header." The cache is not required to reload expired data unless a user asks for it again and then only if the cache believes that the data has really changed.
Paragraph 13.2.2, Heuristic Expiration, says:
- If an Expires value is not available, servers usually use other header values to estimate an expiration time.
- "We encourage encourage origin servers to provide explicit expiration times as much as possible."
Paragraph 13.2.3, Age Calculations, says:
- "HTTP/1.1 requires origin servers to send a Date header, if possible, with every response, giving the time at which the response was generated (see section 14.18)."
Paragraph 13.2.4, Expiration Calculations, says:
- "The max-age directive [and probably also s-max-age] takes priority over Expires..."
- Freshness is the max-age value. s-max-age is the same thing as max-age, but for caches only. If these are not present, freshness is the expires value minus the date value.
- If expires and date are not present either, then the cache "should" assume an expiration based on the relationship between the current date and the Last-Modified date.
Paragraph 13.3.3, Weak and Strong Validators, says:
- Entity tags are strong validators by default.
- If the last-modified time is not at least 60 seconds before the date value - some caches may specify an even longer period - then the cache might treat it as a weak validator.
Paragraph 13.3.4, Rules for When to Use Entity Tags and Last-Modified Dates, says:
- "The preferred behavior for an HTTP/1.1 origin server is to send both a strong entity tag and a Last-Modified value."
- "HTTP/1.0 clients and caches will ignore entity tags. Generally, last-modified values received or used by these systems will support transparent and efficient caching."
Paragraph 13.3.5, Non-validating Conditionals, says:
- Only entity tags (and last-modified headers, in the case of HTTP/1.0) are used to validate a cache entry.
Paragraph 13.5.2, Non-modifiable Headers, says:
- A transparent proxy may add add an Expires header equal to the Date header in that response.
Paragraph 13.9, Side Effects of GET and HEAD, says:
- "Since some applications have traditionally used GETs and HEADs with query URLs (those containing a '?' in the rel_path part) to perform operations with significant side effects, caches MUST NOT treat responses to such URIs as fresh unless the server provides an explicit expiration time. This specifically means that responses from HTTP/1.0 servers for such URIs SHOULD NOT be taken from a cache." This appears to mean that servers should not cache query URLs unless the explicit expiration time says the copy is probably still valid.
Paragraph 13.10, Invalidation after Updates and Deletions, says:
- Caches should mark data obtained as the result of a post operation in a manner to require its mandatory revalidation before re-use. This doesn't say it won't be re-used. It says the cache shall revalidate it first. If the page looks the same as one it already has, this directive by itself would not prevent the data from being used.
Paragraph 13.11, Write-Through Mandatory, says:
- All methods except for get and head must be written through to the origin server. This implies that post requests must be passed completely through the chain.
Paragraph 13.12, Cache Replacement, says:
- The cache may store any response that is at least as new as ones it currently has. It may use any of this data in its responses. Unless the response is "not modified" or "partial", the cache does not have to provide its newest copy! Caches are free to fill with data they don't need and to provide data that isn't their newest.
Paragraph 14.9, Cache-Control, says:
- HTTP/1.0 clients might not implement cache-control; they might only implement pragma: no-cache.
Paragraph 14.9.1, What is Cacheable, says:
- "cache-control: private" says that part of the data is intended for a single user. I suspect it also controls the scope of other cache-control directives.
- "cache-control: public" says that the data may be kept in a shared cache. I suspect it also controls the scope of other cache-control directives.
- "cache-control: no-cache" is not recognized or obeyed by most HTTP/1.0 caches.
Paragraph 14.9.2, What May be Stored by Caches, says:
- "cache-control: no-store" tells caches not to permanently store this data except in a history buffer.
Paragraph 14.9.3, Modifications of the Basic Expiration Mechanism, says:
- "cache-control: max-age=0" says "the client is not willing to accept a stale response".
- The max-age directive trumps the expires directive, but older caches don't understand it.
- If the expires header matches the date header, then many HTTP/1.0 caches will treat the data as non-cacheable. If this match occurs and there is no cache-control header, then HTTP/1.1 caches "should" also treat the data as non-cacheable for compatibility with HTTP/1.0 caches.
Paragraph 14.9.4, Cache Revalidation and Reload Controls, says:
- "cache-control: no-cache" requests end-to-end reload for current clients. It says that this should not be combined with "cache-control: max-age=0". [However, IE may listen to the max-age setting better than it does to the no-cache setting.]
- "pragma: no-cache" requests end-to-end reload for HTTP/1.0 clients.
- "cache-control: max-age=0" requests specific end-to-end revalidation. (This is also mentioned in
- HTTP/1.1 clients must obey the "cache-control: must revalidate" directive.
Paragraph 14.18, Date, says:
- "date: {UTC date}" says when the resource was created. The date of creation is useful when combined with the expired date. See below.
Paragraph 14.19, Etag, says:
- "etag: "{string}" says that if the tag has changed, the resource has changed and should be reloaded completely.
Paragraph 14.21, Expired, says:
- "expires: {UTC date}" says when the resource is expected to change.
- "To mark a response as 'already expired,' an origin server sends an Expires date that is EQUAL [my emphasis] to the Date header value."
Paragraph 15.1.3, Encoding Sensitive Information in URIs, says:
- "Clients SHOULD NOT include a Referer header field in a (non-secure) HTTP request if the referring page was transferred with a secure protocol." This implies that if we use a non-secure help application to augment a secure operational application that the help application should not expect to use the presence and content of a referer header to enforce navigation within the help application.
Experimentation
The gap between theory and practice can be narrowed through experimentation. Ultimately, the only way to be sure what works is to try something. Do so, and tell us what you've learned. Make unwanted instant replay a thing of the past. =Marty=