Clean up a Text File or two

with Visual Basic Script


A while a go, the hard disk of my PC gave up on me. And I had no backup of my web pages, figuring the only backup I needed was ... on the web. So I downloaded all the files from my providers' web server. When I opened them with Notepad - my web authoring tool of choice - I saw that the nicely formatted code had disappeared : all line ends had gone, and seemed replaced by some garbage character, a small black rectangle.

I figured out that this character had ascii value 10. Notepad seems to use a combination of ascii value 10 (CR - Carriage Return) and ascii 13 (LF - LineFeed) to end its lines. This can also be seen in the Visual Basic constant vbCrLf.

So all I had to do was to replace the character of ascii 10 with ascii 10 + ascii 13. Because it would take too long to go through every html file and put back all the line ends in their right place, I made a script to do that :

	'Koen Noens
	'july 2004'
	' script takes text file as input, and creates a text file as output.
	' contents is the same in both files, except that ascii char 10 (newline ?)
	' is replaced by Microsoft's CR & LF
	' this restores end of line in files edited in non-Microsoft text editors
	' so that they appear less messy in Microsoft's NotePad etc. 
	'
	' script might be modified to work  the other way around as well 
	' (remove Microsoft's carriage return/line feed' and replace by normal newline
	' but Unix / Linux also has its own tools to accomplish that
	' ------------------------------------------------------------

	msgbox "go !"
	Const ForReading = 1
	Const ForWriting = 2
	srcfilename = "j:\src.txt"
	destfilename = "j:\dest.txt"
	
	'Dim fso 	
	'Dim fsoStream 
	Set fso = CreateObject("Scripting.FileSystemObject")
	Set src = fso.OpenTextFile(srcfilename, ForReading)
	Set dest = fso.OpenTextFile(destfilename, ForWriting, True)
	Do While Not src.AtEndOfStream
        	char = src.read(1)	
		'ms returns are ascii 13 & 10  = CR en LF or vbCrLf	
		'unix returns are acscii 10 only-  newline ?	
		'cleanup = insert a 13 before every 10 (or replace 10 by 13,10)
		If asc(char) = 10 Then
			dest.Write (vbCrLf)
		Else
			dest.Write (char)
		End If
    	Loop
    	src.Close
    	dest.Close
    	Set src = Nothing
    	Set dest = Nothing
    	Set fso = Nothing
	MsgBox "Finished"
	

That worked well, but it would still have to be run for every file separately. So the next step was to include the above routine in a script that would go through a directory and its subdirectories, clean up all the htm files it comes across, and save them in a new file. In order for me not to have to put every file back in its place manually, the cleaned files needed to be stored in a directory tree that is a mirror of the original file locations, and the filenames would need to be identical. That way, I would just have to replace the root of the web page directories with the root that contains the clean files, all in their rightful place.

The script looked like this :

	'Koen Noens
	'july 2004
	'
	' script to process files in a given directory, including subdirectories
	' the given directory is mirrored at a given location, so that the relative
	' location of all files and directories is maintained after processing.
 	' The actual processing of files is defined in a subroutine and can therefore
	' be easily replaced by any relevant process
	'
	' The use of subroutines and the fact that they can be called recursively I got from
	' the 'Get Folder Size' script by Hans Van der Zaag, September 23, 2003
	' ------------------------------------------------------------

	Const ForReading = 1
	Const ForWriting = 2
	Const ForAppending = 8

   	FirstFolder = Inputbox("Enter directory: " & _ 
                        	chr(10) & chr(13) & "(e.g. C:\Program Files )" ,  _ 
                        	"File Processing Tool", "C:\MyMessedupTextFiles")
	Wscript.Echo ("All Subdirectories of " & FirstFolder & " will be processed, "  _
			& vbCrLf & "and their subdirectories as well, and so on ..."  _ 
			& vbCrLf & "a new directorie tree with processed files will be created "_
			& vbCrLf & "under the directory you specify :"	)
	OutputFolder = Inputbox("Enter directory where output will be saved: " & _
                         	chr(10) & chr(13) & "(e.g. C:\Program Files )" , _
                         	"File Processing Tool", "C:\CLEANED")

	'start running, keep time
	startTime = now			'start tree for output
   	Set fso = CreateObject("scripting.filesystemobject")
	If  fso.FolderExists (Outputfolder) = False Then
		fso.CreateFolder (OutputFolder)
	End If
	fso.CreateFolder (OutputFolder & getPathWithoutDriveLetter(fso.getFolder(FirstFolder)))

	'Run checkfolder
   	CheckFolder (FSO.getfolder(FirstFolder))

	'Done running, check time
	endTime = Now

	Wscript.Echo "Done" & vbcrlf & _
			"  Started at " & Starttime & vbcrlf & _ 
			"  Finished at " & EndTime & vbcrlf

	Set fso = NothingSub CheckFolder(objCurrentFolder)

       	For Each objFolder In objCurrentFolder.SubFolders
		'do something with or in this folder
        	ProcessFolder(objFolder)
       	Next
      	'Recurse through all of the folders
       	For Each objNewFolder In objCurrentFolder.subFolders
              CheckFolder objNewFolder
      	Next

	Set objFolder = Nothing
	Set objNewFolder = Nothing
       End SubSub ProcessFolder(objThisFolder)
	Set fso = CreateObject("scripting.filesystemobject")

	'create a corresponding directory for output
	OutputPath = OutputFolder & getPathWithoutDriveLetter(objThisFolder.Path)
	If fso.FolderExists (OutputPath) = False Then
 		fso.CreateFolder (OutputPath)
	End If

	'process the files in this directory
	For Each objFile in ObjThisFolder.Files
		If fso.FileExists(objFile) Then
			Process (objFile)
		End If
	Next
	Set objFile = Nothing
	Set fso = Nothing
End Sub

Sub Process(objSrcFile)
	
	'In this case, only process my .htm files, not the .bmp and .jpg etc ...
	ext = ".htm"
	If Right(objSrcFile.Name,4) = ext Then
		srcfilename = objSrcFile.Path
		destfilename = OutputFolder & getPathWithoutDriveLetter(srcfilename)
		Set fso = CreateObject("Scripting.FileSystemObject")
		Set src = fso.OpenTextFile(srcfilename, ForReading)
		Set dest = fso.OpenTextFile(destfilename, ForWriting, True)
		Do While Not src.AtEndOfStream
	        	char = src.read(1)
			If asc(char) = 10 Then
				dest.Write (vbCrLf)
			Else
				dest.Write (char)
			End If
	    	Loop
	Else
		'log skipped files
		'Set log = fso.OpenTextFile(LogFileName, ForAppending, true)
		'log.Write (objSrcFile.Path & " : file skipped - not a " & ext & " file." & vbcrlf)
		'log.Close
	End If
	set src = Nothing
	set dest = Nothing
	'set fso = Nothing
End Sub

Function getPathWithoutDriveLetter (strPath)
	'remove 2 characters at beginning of path (eg c:\) so it can be concatenated to an another path
	getPathWithoutDriveLetter = Right(strPath,(Len(strPath)-2))
End Function	

And it processed my entire website in little under 5 minutes.


Koen Noens
December 2003