Based on recent experience with an indexing robot, I discovered a critical architecture flaw in the way I set up my calendar for viewing Log Entries by Day. Due to a bit of laziness, as well as intrigue of the cool-factor in going to any possible date between 1000 and 9999, I didn’t originally include stops at each end of the archive (earliest post and latest post). The calendar allows a user to tab through each month, one at a time, displaying the dates for which entries are available as links. Nothing special here.
Until recently, the calendar allowed navigation beyond my earliest and latest entries. Way beyond. When a value is provided in the query string for a date which has no entry, I return a friendly error message stating no entries were found for that period, and provide suggestions for where to go next. But the calendar still rendered, and still allowed navigation in the same direction. I thought it was an interesting “leave-in” to be able to go further back in time (or forward into the future). I knew human behavior would eventually stop at some point, realizing no more entries existed beyond a certain date.
But robot behavior doesn’t match human behavior. A robot will follow every single link on a site, and won’t stop as long as it continues to arrive at an existing page containing more links to follow. It won’t use human logic to realize it has crawled to a date beyond my first entry. Nor would it stop once it reaches a date in the future for which entries don’t yet exist.
Without my own stops in place on either side, an indexing robot could follow links to every possible date combination, regardless whether an entry is present or not. On each of these pages, it would receive (and index) my friendly error messages, then continue following the links on that page allowing it to go further in time in the wrong direction. That’s exactly what happened.
For some reason, this recent robot started indexing my /archive/
section with a date value somewhere in the year 1970. (Don’t know why it chose this year, since it wasn’t a particularly good one, and I wasn’t born yet.) Since the “Previous Month” link appeared before the “Next Month”, it chose to follow the previous link instead, happily crawling along backwards, one month at a time, indexing each of my identical friendly error pages. All the way back to the year 1542, where it maxed out at a specified limit of pages to index on my domain. Of course traffic numbers were completely skewed that day.
Adding in some conditional stops — which remove the navigation in that direction once the date has reached or gone beyond the relevant stop — solved the problem. I still allow date values in the query string for which there are no available entries. This compensates for a possible typo when linking to a specific date of this weblog. Yes, you could in theory, use my site to determine on what day of the week John Travolta’s birthday would have fallen in the year 1236. Or for that matter, if you’re considering cryogenics, you could also check any date into the future [“the future, Conan?”] all the way to the year 9999.
Update: With the installation of — and converstion to — Movable Type in June 2003, these possibilities no longer exist. Links demonstrating the respective query strings have been removed from this post.