Look For Links


The following two scripts can be used to make a list of all the hyperlinks that can be found in a file ( a web page or so), by scanning the file for 'href'. Combined with File Processing in multiple directories , it can collect hyperlinks in an entire web site. The output can also be used as the basis for a tool that checks for dead links.

There are two versions. This is version 1 :

'Koen Noens'july 2004'' script takes text file as input, eg. htm file.
' script detects hyperlinks and lists them in its outputfile
' purpose is to make an inventory of hyperlinks so that they can be checked
' therefor, html tags are added/included in the output
' ------------------------------------------------------------
	msgbox "go !"
	Const ForReading = 1
	Const ForWriting = 2
	srcfilename = "j:\src.txt"
	destfilename = "j:\dest.txt"

	Set fso = CreateObject("Scripting.FileSystemObject")
	Set src = fso.OpenTextFile(srcfilename, ForReading)
	Set dest = fso.OpenTextFile(destfilename, ForWriting, True)

	Do While Not src.AtEndOfStream
		CurrentLine = src.ReadLine
		If FindHref(CurrentLine) > 0 Then
			LinkStart = FindHref(CurrentLine)
			LinkEnd = FindEndTag(CurrentLine)
			Link = Mid (CurrentLine, LinkStart, (LinkEnd-LinkStart))
			'report and / or log found links
			Wscript.Echo (Link)
			dest.Write (Link & vbcrlf)
		End If
    	Loop

    	src.Close
    	dest.Close
    	Set src = Nothing
    	Set dest = Nothing
    	Set fso = Nothing

	MsgBox "Finished"

Function FindHref (TheString)
'find in TheSTring the position where a href tag starts

	for pos=1 to len(thestring)
		if mid(thestring,pos,7)="<a href" then
			exit function
		else
			findhref = pos
	 	end if
	nex
end function

function findendtag (thestring)
	'find in thestring the position of </a>   tag

	For pos = 1 to Len(TheString)
		If Mid (TheString,pos,4) = "" Then
			FindEndTag = pos + 4
			Exit Function
		Else
			FindEndTag = 0
		End If
	Next
End Function
	

Version 1 lists the links as the appear in the original page. That means that a 'click here' or 'follow this link' link will show 'click here' or 'follow this link' as description. The underlying url will be preserved, but as these links appear out of their context, it may be more interesting to show the url itself - with the appropriate html tags to make it clickable. That is achieved in version 2 .

If you are running Linux, you can use wget and or linkchecker to follow and download or check all links in your website. linkchecker can produce output in csv or sql format : csv can be opened as an (OpenOffice.org) Spreadsheet that can be sorted and filtered to find broken links. For large sites or heaps of links, you might prefer sql output that you can use to create a database that you can query for broken links and the pages they appear on.

	# linkchecker with csv output
	k@nix:~$ linkchecker --output=csv --no-proxy-for=  http://users.telenet.be/mydotcom/ > mydotcom_linkcheck.csv 

	Status:  9 active threads,  278 URLs checked,   758 URLs queued, runtime  5.149 seconds
	[ ... ]
	

Koen Noens
July 2004