The following two scripts can be used to make a list of all the hyperlinks that can be found in a file ( a web page or so), by scanning the file for 'href'. Combined with File Processing in multiple directories , it can collect hyperlinks in an entire web site. The output can also be used as the basis for a tool that checks for dead links.
There are two versions. This is version 1 :
'Koen Noens'july 2004'' script takes text file as input, eg. htm file.
' script detects hyperlinks and lists them in its outputfile
' purpose is to make an inventory of hyperlinks so that they can be checked
' therefor, html tags are added/included in the output
' ------------------------------------------------------------
msgbox "go !"
Const ForReading = 1
Const ForWriting = 2
srcfilename = "j:\src.txt"
destfilename = "j:\dest.txt"
Set fso = CreateObject("Scripting.FileSystemObject")
Set src = fso.OpenTextFile(srcfilename, ForReading)
Set dest = fso.OpenTextFile(destfilename, ForWriting, True)
Do While Not src.AtEndOfStream
CurrentLine = src.ReadLine
If FindHref(CurrentLine) > 0 Then
LinkStart = FindHref(CurrentLine)
LinkEnd = FindEndTag(CurrentLine)
Link = Mid (CurrentLine, LinkStart, (LinkEnd-LinkStart))
'report and / or log found links
Wscript.Echo (Link)
dest.Write (Link & vbcrlf)
End If
Loop
src.Close
dest.Close
Set src = Nothing
Set dest = Nothing
Set fso = Nothing
MsgBox "Finished"
Function FindHref (TheString)
'find in TheSTring the position where a href tag starts
for pos=1 to len(thestring)
if mid(thestring,pos,7)="<a href" then
exit function
else
findhref = pos
end if
nex
end function
function findendtag (thestring)
'find in thestring the position of </a> tag
For pos = 1 to Len(TheString)
If Mid (TheString,pos,4) = "" Then
FindEndTag = pos + 4
Exit Function
Else
FindEndTag = 0
End If
Next
End Function
Version 1 lists the links as the appear in the original page. That means that a 'click here' or 'follow this link' link will show 'click here' or 'follow this link' as description. The underlying url will be preserved, but as these links appear out of their context, it may be more interesting to show the url itself - with the appropriate html tags to make it clickable. That is achieved in version 2 .
If you are running Linux, you can use wget and or linkchecker to follow and download or check all links in your website. linkchecker can produce output in csv or sql format : csv can be opened as an (OpenOffice.org) Spreadsheet that can be sorted and filtered to find broken links. For large sites or heaps of links, you might prefer sql output that you can use to create a database that you can query for broken links and the pages they appear on.
# linkchecker with csv output k@nix:~$ linkchecker --output=csv --no-proxy-for= http://users.telenet.be/mydotcom/ > mydotcom_linkcheck.csv Status: 9 active threads, 278 URLs checked, 758 URLs queued, runtime 5.149 seconds [ ... ]