Part 2 – SEO cheat codes: Making dynamic content search spider friendly

Many search engine spiders make a request for the http header information. If the header has changed they make a request for the page. The important piece of information it is looking for is the “Last-Modified:”  line in the header. If the header has not changed they continue to look for another page.

If the spider is indexing the same URLs over and over again at some point it will decide it has enough information from the site, even though it has not picked up the new content; And, not index pages that have been added. This spider activity can be seen in the sites log files. It is in the best interest to let the spider know the page has not changed if it has not.

Here is a simple script to accomplish this trick that can assist the spider in getting though your site and getting the new content.

$acton = $ENV{REQUEST_URI};

# the requested information is from the zzz directory everything after zzz is the dynamic content.

$acton =~ /\/zzz\/(.+)/;

$acton = “$1”;

# get date of dynamic content – in this case an xml file is used to tell the script what to deliver for the request it is the date of the content /DVDs.xml would be an example. $action contains the value DVDs

$t = -M “$acton.xml”;

# Date from -M is relative from now – convert it to actual GMT time and format it to http format.
$now = time;

$t = $t * 24 * 60 * 60;
$t = $now – $t;

# now parse the time into fields
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = gmtime($t);
$year=$year+1900; # 2010 at this point is 110 need to add 1900 to get 2010.

# http data has leading zeros gmtime() does not provide them.
$mday = “00$mday”; $mday =~ /(..\Z)/; $mday=$1;
$hour = “00$hour”; $hour =~ /(..\Z)/; $hour=$1;
$min = “00$min”; $min =~ /(..\Z)/; $min=$1;
$sec = “00$sec”; $sec =~ /(..\Z)/; $sec=$1;

#day of week and month are numeric http data text data used is


# data has been cooked now format and deliver.
print “Last-Modified: $WEEK[$wday], $mday $YEAR[$mon] $year $hour:$min:$sec GMT\n”;


print “Content-type: text/html\n\n”;

The goal here is to get more of a site indexed. If the entire site is smaller than 100 pages there is no need to worry about the spiders reading the same pages. If all the pages on the site are indexed normally there is no need – however different search engine spiders have different behavior.

The .htaccess needs to be something like this to deliver the requests to a script for the directory instead of a 404 error.

# enable .pl
AddHandler cgi-script .pl

# open index for requests to this folder first
DirectoryIndex index.html

# for folder request process them through

RewriteEngine On
RewriteOptions inherit
RewriteBase /zzz/
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . [L]

this .htaccess coding will run the file to deliver the content for the requests of files “-f” and or /zzz/directories/ “-d” under the zzz directory  /zzz/

If index.html exists in the /zzz/ directory  it will be the content for /zzz/ that does not ask for a file or directory “DirectoryIndex index.html” but will be called for all requests of urls that do not actually exist “RewriteRule . [L]”


One response

  1. […] dig up some an old post which is still relevant but not within 2 clicks … Making Dynamic Content Search Spider Friendly … It is good stuff. Categories: Drupal, SEO, html Comments (0) Trackbacks […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: