This PHP transaction spiders the local website, and checks all internal and external references found in the HTML markup where permitted to do so by the relevant robots.txt file. No attempt is made to access files forbidden by robots.txt.
It displays the following four lists:
- A list of those links which result in an HTTP error being returned (normally 404 - file not found)
- A list of those links which result in HTTP 301 (Permanently moved)
- A list of URLs not checked due to robots.txt rules
- A list of all URLs checked, together with the HTTP error code returned
The transaction uses the PHP curl library, and it processes each URL synchronously so it can take a few minutes to run on larger sites (it spiders about 14 pages per second on my server, taking four minutes in total).
A constant in the code determines the extensions of the files which are spidered on your site. As distributed, it spiders shtml, htm, html, and PHP files. This can easily be changed by changing the value of the constant, All other files (e.g. PDF files) will only have their existence checked for:
define ("EXTENSIONS", "shtml,html,htm,php");
Once the constant has been updated, upload the link checker into your main web directory and run the transaction from your browser.
Download compressed PHP file (6,285 bytes)