Bulk Text Processing

--=oOo=--


The default Linux shell, bash, is an advanced command interpreter that allows a system administrator to program (or 'script') complex system administration tasks. Linux is pretty much a text oriented system, and lots (if not all) of the configuration is stored in text files. This allows for easy configuration with basic tools, such as a text editor. The controll mechanisms of the shell and the text manipulation tools also allow you to process text files in batches : it's quite easy to modify a large number of text files automatically - without having to open, modify, save them all one by one.

In the following exercise, we want to add a footer (eg a script for a web statistics hit counter) to our web pages. We've been doing this manually so far, and got sloppy, so some pages have the script, and some don't. We want to find those that don't, and add the text to them. If you run your own web server, the web server could probably do this for you, but let's assume you have a collection of static web pages that you upload to some web space your ISP has assigned to you.

What do you need ?

Procedure

## find files that don't have 'webstat' text in it
#	 grep, recursive, list non-matching --> files with .htm extension only --> list in 'targets' file
grep -L -R "webstat" /home/me/website | grep ".htm" > targets.lst

## review and edit target list (remove files that don't need changing)
vim targets.lst

## read file list and process files therein
cat targets.lst | while read filename ; do 
	# remove /body and /html tags at end of file so insertion doesn't fall ouside html document body
	sed -i 's/<\/html>//g' $filename
	sed -i 's/<\/body>//g' $filename
	sed -i 's/<\/HTML>//g' $filename
	sed -i 's/<\/BODY>//g' $filename

	# insert text from a file (eg the webstat counter script)
	cat srcfile  >> $filename

	# insert body and html end tags again
	echo "  </body>" >> $filename
	echo "</html>" >> $filename
done
	

More ...

If you found this useful or interesting, you might want to have a look at


Koen Noens
December 2007