Site Map Generator for Google


Submitting a sitemap to Google is one easy and effective way to make sure all the pages of your website are indexed by the Google Bot. Without a sitemap, you depend on the GoogleBot following links to get from one page to another in order to, eventually, find and index all your pages. This works, but it can take months.

Submitting a sitemap is one of the Google Webmaster Tools you can use to monitor and improve Google's handling of your website. And since everyone wants to score well in a Google search, why should you pass on this opportunity ?

Now, if your website is just a bunch of html pages on a webserver, in that free web space that came with your internet account, you don't have any access to the web server so you can't run the Google Sitemap Generator. You do, however, have a your website on your local hard disk. To create a sitemap, you can just list all the paths and filenames of the html files. If you then replace the top level directory names with the relevant http://hosting_domain/site_directory string, you've got yourself a text file sitemap that can be submitted to Google as a (simple) sitemap. You can also transform the text file into Google's preferred format - an xml file, by feeding it to the Site Map Generator.

	#!/bin/bash
	# script to create sitemap.txt
	# Koen Noens, October 2006

	LOCAL_ROOT="/home/jp/websites/mysite"		# replace with your path
	SITE_ROOT="http://my.isp.com/my_site"		# replace with your site URL
	EXTENSIONS=".htm .html .php .asp .aspx .jsp"

	pushd $LOCAL_ROOT

	#find all .htm, .html, .php, ... pages, remove trailing dot and concatenate with SITE_ROOT

	cd $LOCAL_ROOT
	rm sitemap.txt || echo "no previous sitemap found"
	FOUNDFILES=$(mktemp)

	for ext in $EXTENSIONS ; do
		 find . -name "*$ext" >> $FOUNDFILES
	done

	# remove leading . and insert site_root to build urls	
	sed -i 's/\.//' $FOUNDFILES
	for FILE in $(cat $FOUNDFILES); do
			echo $SITE_ROOT$FILE  >> $FOUNDFILES.0
	done


	# if there is an exclude list, exclude the files in it from the sitemap
	empty=""
	if [[ -e exclude.lst ]]; then
		cat exclude.lst | while read entry; do 
			sed  -i "s,$entry,$empty,g" $FOUNDFILES.0  
		done; 
		# remove blank lines as well
		sed -i '/^$/d' $FOUNDFILES.0 
	fi

	# finishing touches
	sort -f -u $FOUNDFILES.0 >> sitemap.txt
	rm $FOUNDFILES.0
	rm $FOUNDFILES

	# add sitemap to files_to_upload
	echo "$LOCAL_ROOT/sitemap.txt" >>  $LOCAL_ROOT/upload
	

I'm sure there are even shorter ways to do this, using pipes and more advanced sed scripts, but so far this is the best I can do. For Windows, you can use a visual basic "sitemap generator" script that does roughly the same. (see also to these Visual Basic scripts). Or create a html sitemap to add to your website.


Koen Noens
July 2003