Bulk Validator

by John Walker


The World Wide Web Consortium (W3C) Markup Validation Service is an essential resource for Web developers who wish to create standards-compliant documents. This freely-available service checks HTML and XHTML documents for compliance with a variety of versions of the relevant standards and reports errors in a form which identifies any errors in the markup. The validator can check documents specified by a Web URL, files uploaded from the user's computer, or text pasted directly into a text box on the validator request page.

These options suffice when you're developing a new page, but if you're generating a sizable collection of documents automatically (for example, with a content management system for a Web log), or you have a complicated existing Web tree you wish to check for standards compliance, submitting each document individually for validation can become tedious. BulkValidator is a Perl program which automates the process of validating multiple documents. It submits either all of the HTML/XHTML documents in a directory or all documents in that directory and its subdirectories to the W3C validator and reports the results. For any documents which failed validation, the error reports are saved in a “discrepancies” directory whence they can be subsequently scrutinised.

Downloading and Installation

The BulkValidator may be downloaded from the following link:
BulkValidator-1.2.tar.gz: Gzipped TAR archive (12 Kb)

Included in the archive are the Perl program BulkValidator.pl and the manual page for the program extracted from the documentation embedded within it, as well as this document. You can use these files in the directory in which you extracted them or install them in your system's library directories to make them available to all users. You may wish to rename the Perl program as BulkValidator so it can be run as a regular command line program; if you do so, make sure the location of Perl in the first line of the program corresponds to where Perl is installed on your system.

This program requires the Perl modules Data::Dumper, Pod::Usage, LWP, and URI::Escape. If your Perl installation lacks one or more of these modules, you will have to install them (either system-wide or for your own user account) before you can use BulkValidator. In addition, validation of files in subdirectories requires the Unix find command. While most systems which support Perl provide this command, if it is not present (for example, on a minimalist Cygwin configuration), you will have to install it if you wish to use this feature.

Manual Page

BulkValidator

SYNOPSIS

BulkValidator [--copyright] [--density num] [--discrepancy dir] [--firstfiles num] [--help] [--man] [--pause num] [--rpause factor] [--shuffle] [--skipfile num] [--tree] [--validator url] [--verbose] [--version] [directory]


DESCRIPTION

BulkValidator submits all of the HTML/XHTML files either in a specified directory (the current directory is assumed if none is given) or in that directory and any subdirectories to the W3C HTML validator and reports the results. The validation reports for any files which failed validation are saved for review.


OPTIONS

All options may be abbreviated to their shortest unambiguous prefix.

--copyright

Display copyright information.

--density num

A randomly chosen subset of num percent of the files will be validated. If you have a large collection of mostly similar files and do not want to spend the time or burden the validator with processing them all, specify a modest percentage of the files to test a statistical sample of them. Use the --firstfiles option if you wish to unconditionally validate some number of the first files in the list. If no --density is specified, all files will be validated (equivalent to a num specification of 100).

--discrepancy dir

The validation reports for any files which failed validation will be stored in the directory dir, which will be created if it does not already exist. If no --discrepancy directory is specified, reports will be stored in a ValidationDiscrepancies directory created within the current directory.

--firstfiles num

The first num files will always be validated regardless of the --density specification. The default is 0, which causes no files to be unconditionally validated.

--help

Display how to call information.

--man

Display this complete manual page.

--pause num

After each file is validated, BulkValidator will pause for num seconds (plus an additional delay governed by --rpause, see below). The default is 15 seconds. A modest delay after each request avoids unduly burdening the W3C Validator.

--rpause factor

If --pause is nonzero, a random increment between zero and the --rpause factor multiplied by the --pause num will be added to the delay after each request. The factor is a floating point number; the default is 1, which results in a delay between the --pause specification and twice that value.

--shuffle

If specified, files will be validated in random order. Otherwise, files are validated in alphabetical order.

--skipfile file

The specified file is the output from one or more previous runs of BulkValidator (which you can capture by redirecting standard output to a file or piping it to tee). All files which passed validation in previous runs will be skipped on this run. Use this option when you're chasing down validation errors in a collection of files; only the files which failed before will be re-examined in this run.

--tree

All .html and .htm files in subdirectories recursively traversed starting at the directory specified on the command line will be validated.

--validator url

The specified url is used to request validation instead of the default http://validator.w3.org/check. The validator must accept file uploads with the same form fields as the W3C HTML validator and return pass/fail results in the same syntax.

--verbose

Generate verbose output to indicate what's going on.

--version

Display version number.


EXAMPLES

Validate all HTML files in the current directory, placing discrepancy reports in a ValidationDiscrepancies subdirectory of the current directory.

    perl BulkValidator.pl

Validate the first 10 files in alphabetical order, then 15% of the remaining files chosen at random from the directory /var/www/html/recipes/ratburger and subdirectories, placing discrepancy reports for any files which fail validation in /home/chef/goofs.

    perl BulkValidator.pl --tree --firstfiles 10 --density 15 \
                          --discrepancy /home/chef/goofs \
                          /var/www/html/recipes/ratburger

Validate files in /var/www/html/recipes/ratburger, saving the pass/fail results in /home/chef/goofs/val.log. Then, after editing, revalidate all the files which failed to validate the first time.

    perl BulkValidator.pl /var/www/html/recipes/ratburger \
            | tee /home/chef/goofs/val.log
       . . . Edit, edit, edit . . .
    perl BulkValidator.pl --skipfile /home/chef/goofs/val.log
            /var/www/html/recipes/ratburger


FILES

If no directory is specified on the command line, the current directory is validated.

The validation summary is written to standard output. You can redirect this to a file or make a copy with tee if you wish to use it in subsequent runs to exclude already-validated files with the --skipfile option.

The validator reports for any files which failed validation are stored in the --discrepancy directory, which defaults to ValidationDiscrepancies in the current directory. Files in this directory are named with the path name of the validated file, with all slashes replaced by underscores. Validation reports for files which previously failed validation but passed this time will be automatically deleted, and the --discrepancy directory will be removed if, at the end of the run, no files remain within it.


BUGS

Please report bugs to bugs@fourmilab.ch, indicating the version numbers of BulkValidator, Perl, and the Perl LWP module installed on your system.


This software is in the public domain. Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, without any conditions or restrictions. This software is provided “as is” without express or implied warranty.

Fourmilab Home Page