Unix shell and Python scripts for automatic HTML/XHTML validation of pages, works also with generated pages.
If you want to check HTML generated code with a shell script you will need tidy and Lynx or some other browsers that supports dumping the source of the page to stdout.
#!/bin/bash PATH=/bin/:/usr/bin lynx="/usr/bin/lynx" tidy="/usr/bin/tidy" TMPFILE=`mktemp -q /tmp/$0.XXXXXX` if [ $? -ne 0 ]; then echo "$0: Can't create temp file, exiting..." exit 1 fi files="index.php foo.php bar.php" for i in `echo $files` do printf "n$in" >> $TMPFILE $lynx -source http://localhost/$i | $tidy -eq 2>&1 | grep line >> $TMPFILE done less $TMPFILE rm -f $TMPFILE
You'll have to change the first two lines to point to the location of the executables on you system. If your site isn't in the DocumentRoot you also need to make some other modifications. The last thing in the pipe grep line ensures that you won't get pages of info about tidy (that comes out even with -q). This script will redirect the output to less if you expect less errors you can get rid of it and have incremental output of the errors (otherwise you will have to wait until all has finished). You will have to call this script from the directory you have saved it to like ./check_tidy otherwise replace the $0 in the mktemp line, you can't use filenames with "/" (without the quotes).
If you have XHTML or XML which you want to check for validity you can use a validating XML parser which should find some errors that tidy won't report at all. I use Xerces C++ and again a browser to dump the source.
#!/bin/bash lynx="/usr/bin/lynx" parser="/usr/bin/StdInParse" files="index.php foo.php bar.php" cd /var/www/html for i in `echo $files` do printf "n$in" $lynx -source http://localhost/$i | $parser 2>&1 done
This one is without using a temporary file to achieve the effect of the previous one you can use check | less assuming that you have saved this file as check and it is in your path. The cd is needed so that relative system identifiers can be used in the doctype declaration like this one:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "dtd/xhtml1-strict.dtd">
Xerces is a good XML parser but sometimes outputs really weird error messages, fortunately there is a great parser by James Clark called nsgmls included in SP and in openjade; it is a SGML Parser so you can also validate HTML.
This one is used also by the W3C Validation Service. The nice thing about the validator is that it can output your generated source, which is very useful.
The next one is a Python script that uses nsgmls and because Python has module support for HTTP you don't need an external application (browser) to get the source.
#!/usr/bin/env python
import httplib, os
catalogue = '/usr/share/sgml/dtd/xhtml.soc'
options = '-wxml -s -c' + catalogue
parser = '/usr/bin/nsgmls'
files=['index.php',
'cert.php',
'javascript.php', 'slideshow.php', 'bounce.php', 'fading.php',
'linux.php', 'valid.php']
errors_name = os.tempnam()
for file in files:
h = httplib.HTTP('localhost')
h.putrequest('GET', '/' + file)
h.endheaders()
errcode, errmsg, headers = h.getreply()
print file, errcode, errmsg
f = h.getfile()
data = f.read()
f.close()
pipe = os.popen(parser + ' ' + options + ' -f ' + errors_name, 'w');
pipe.write(data);
pipe.close()
errors = open(errors_name)
err = errors.read()
err = err.split(':')
if len(err) > 1 :
data = data.split('n')
for i in range(1,len(err)):
if i % 5 == 0 :
print 'column:', err[i-2], err[i].split(parser)[0]
if i % 5 == 2 :
print data[int(err[i])-1]
print
errors.close()
os.remove(errors_name)
First tell the shell to use python to execute the code, then import the needed libraries and define some constants (you may need to change some of them). The main loop that iterates over all of the files, connect to the server first and then send a GET header. Print the end of headers and get the returned errorcode and display it with the filename. Then get the HTML returned and open a pipe to the parser, open the temporary file with the error output and parse it into array separated by :, which is used by nsgmls.
Then we check if the errors array has more than one elements (each error line is parsed into 5 elements), if so then create an array with each line of the source. Iterate over each element and use each 2nd element as an index for the line of the source, printing also the column number which is the 3rd element in the error array and the error message (each 5th element). Print an empty line to make output easier to read.
We finish off with closing the temporary file and deleting it.
A simple improvement which can be made (if you are using PHP's sessions with trans-sid) is to add a line to the headers that tells the PHP module that we already have a session, which prevents it from mangling links and forms. This can be done with a
line like the one below that's added before h.endheaders() is called.
h.putheader('Cookie', 'PHPSESSID=b319002a4f7d5c46a32cc819e8526cce')