A while a go, the hard disk of my PC gave up on me. And I had no backup of my web pages, figuring the only backup I needed was ... on the web. So I downloaded all the files from my providers' web server. When I opened them with Notepad - my web authoring tool of choice - I saw that the nicely formatted code had disappeared : all line ends had gone, and seemed replaced by some garbage character, a small black rectangle.
I figured out that this character had ascii value 10. Notepad seems to use a combination of ascii value 10 (CR - Carriage Return) and ascii 13 (LF - LineFeed) to end its lines. This can also be seen in the Visual Basic constant vbCrLf.
So all I had to do was to replace the character of ascii 10 with ascii 10 + ascii 13. Because it would take too long to go through every html file and put back all the line ends in their right place, I made a script to do that :
'Koen Noens
'july 2004'
' script takes text file as input, and creates a text file as output.
' contents is the same in both files, except that ascii char 10 (newline ?)
' is replaced by Microsoft's CR & LF
' this restores end of line in files edited in non-Microsoft text editors
' so that they appear less messy in Microsoft's NotePad etc.
'
' script might be modified to work the other way around as well
' (remove Microsoft's carriage return/line feed' and replace by normal newline
' but Unix / Linux also has its own tools to accomplish that
' ------------------------------------------------------------
msgbox "go !"
Const ForReading = 1
Const ForWriting = 2
srcfilename = "j:\src.txt"
destfilename = "j:\dest.txt"
'Dim fso
'Dim fsoStream
Set fso = CreateObject("Scripting.FileSystemObject")
Set src = fso.OpenTextFile(srcfilename, ForReading)
Set dest = fso.OpenTextFile(destfilename, ForWriting, True)
Do While Not src.AtEndOfStream
char = src.read(1)
'ms returns are ascii 13 & 10 = CR en LF or vbCrLf
'unix returns are acscii 10 only- newline ?
'cleanup = insert a 13 before every 10 (or replace 10 by 13,10)
If asc(char) = 10 Then
dest.Write (vbCrLf)
Else
dest.Write (char)
End If
Loop
src.Close
dest.Close
Set src = Nothing
Set dest = Nothing
Set fso = Nothing
MsgBox "Finished"
That worked well, but it would still have to be run for every file separately. So the next step was to include the above routine in a script that would go through a directory and its subdirectories, clean up all the htm files it comes across, and save them in a new file. In order for me not to have to put every file back in its place manually, the cleaned files needed to be stored in a directory tree that is a mirror of the original file locations, and the filenames would need to be identical. That way, I would just have to replace the root of the web page directories with the root that contains the clean files, all in their rightful place.
The script looked like this :
'Koen Noens
'july 2004
'
' script to process files in a given directory, including subdirectories
' the given directory is mirrored at a given location, so that the relative
' location of all files and directories is maintained after processing.
' The actual processing of files is defined in a subroutine and can therefore
' be easily replaced by any relevant process
'
' The use of subroutines and the fact that they can be called recursively I got from
' the 'Get Folder Size' script by Hans Van der Zaag, September 23, 2003
' ------------------------------------------------------------
Const ForReading = 1
Const ForWriting = 2
Const ForAppending = 8
FirstFolder = Inputbox("Enter directory: " & _
chr(10) & chr(13) & "(e.g. C:\Program Files )" , _
"File Processing Tool", "C:\MyMessedupTextFiles")
Wscript.Echo ("All Subdirectories of " & FirstFolder & " will be processed, " _
& vbCrLf & "and their subdirectories as well, and so on ..." _
& vbCrLf & "a new directorie tree with processed files will be created "_
& vbCrLf & "under the directory you specify :" )
OutputFolder = Inputbox("Enter directory where output will be saved: " & _
chr(10) & chr(13) & "(e.g. C:\Program Files )" , _
"File Processing Tool", "C:\CLEANED")
'start running, keep time
startTime = now 'start tree for output
Set fso = CreateObject("scripting.filesystemobject")
If fso.FolderExists (Outputfolder) = False Then
fso.CreateFolder (OutputFolder)
End If
fso.CreateFolder (OutputFolder & getPathWithoutDriveLetter(fso.getFolder(FirstFolder)))
'Run checkfolder
CheckFolder (FSO.getfolder(FirstFolder))
'Done running, check time
endTime = Now
Wscript.Echo "Done" & vbcrlf & _
" Started at " & Starttime & vbcrlf & _
" Finished at " & EndTime & vbcrlf
Set fso = NothingSub CheckFolder(objCurrentFolder)
For Each objFolder In objCurrentFolder.SubFolders
'do something with or in this folder
ProcessFolder(objFolder)
Next
'Recurse through all of the folders
For Each objNewFolder In objCurrentFolder.subFolders
CheckFolder objNewFolder
Next
Set objFolder = Nothing
Set objNewFolder = Nothing
End SubSub ProcessFolder(objThisFolder)
Set fso = CreateObject("scripting.filesystemobject")
'create a corresponding directory for output
OutputPath = OutputFolder & getPathWithoutDriveLetter(objThisFolder.Path)
If fso.FolderExists (OutputPath) = False Then
fso.CreateFolder (OutputPath)
End If
'process the files in this directory
For Each objFile in ObjThisFolder.Files
If fso.FileExists(objFile) Then
Process (objFile)
End If
Next
Set objFile = Nothing
Set fso = Nothing
End Sub
Sub Process(objSrcFile)
'In this case, only process my .htm files, not the .bmp and .jpg etc ...
ext = ".htm"
If Right(objSrcFile.Name,4) = ext Then
srcfilename = objSrcFile.Path
destfilename = OutputFolder & getPathWithoutDriveLetter(srcfilename)
Set fso = CreateObject("Scripting.FileSystemObject")
Set src = fso.OpenTextFile(srcfilename, ForReading)
Set dest = fso.OpenTextFile(destfilename, ForWriting, True)
Do While Not src.AtEndOfStream
char = src.read(1)
If asc(char) = 10 Then
dest.Write (vbCrLf)
Else
dest.Write (char)
End If
Loop
Else
'log skipped files
'Set log = fso.OpenTextFile(LogFileName, ForAppending, true)
'log.Write (objSrcFile.Path & " : file skipped - not a " & ext & " file." & vbcrlf)
'log.Close
End If
set src = Nothing
set dest = Nothing
'set fso = Nothing
End Sub
Function getPathWithoutDriveLetter (strPath)
'remove 2 characters at beginning of path (eg c:\) so it can be concatenated to an another path
getPathWithoutDriveLetter = Right(strPath,(Len(strPath)-2))
End Function
And it processed my entire website in little under 5 minutes.