static site checker

content

introduction
try
why
README
usage
known issues
build
download
boot notes
copyright & licence


introduction

The static site checker is an opinionated HTML nitpicker, a command–line tool to validate static HTML & XHTML websites. I built it to nitpick my hand–coded identity website.

It should not be used on untrusted content; its parsers are holier than Robin’s cow.

If you want to try it, play with the form below, or, if you’re really enthusiastic, here’s the current source.

Dylan Harris
August 2022


try







        

notes

SSC is a site checker, not a page checker, yet here you can only input a page, at most. It is not the best possible illustration of the program’s abilities, but it is better than nothing. Testing a web site would require proof of control of that site, and I’m not going there for a simple demo.

SSC is pre–alpha software. It has more errors than Fido has fleas. This page is a demo of its potential, that’s all. Do not presume reported issues are correct. Do not presume unreported issues aren’t issues. If you’re unsure, check against the appropriate standard. There is no guarantee of anything, let alone accuracy.

There is a fairly tight limit on the size of a snippet.


why ssc

Why did I make the static site checker? Aren’t there a lot of other HTML validators around? Well, first of all, I’ve not found a website validator, only web page validators. Perhaps I didn’t search sufficiently.

I have a fairly big website, with more than 100,000 pages. I’m too impatient to run each through a validator individually; I want to validate my site as a whole. Some errors occur between pages, not specifically on pages: not just missing links, but, for example, it’s invalid to link to an otherwise valid id on another page when that id’s element, or one of its ancestors, is HIDDEN.

The validators I did find were incomplete. Now, admittedly, I checked these out a few years ago, and some may well have got better. Editor based validators are certainly very useful, but they only work on individual pages, not sites. You have to edit a page to get it validated, and if you have rather a lot of them, that’s a lot of time wasted opening and closing individual pages. To expand that ID example above, if, in a specialist editor, you add HIDDEN attribute to an element which has an id on a child element, does that editor then name & shame the other page you’ve just invalidated?

All this are part of the reason why many people use frameworks. (Another, the obvious one, is to get a site up quickly.) One of the difficulties I have with frameworks, is that, most of all, so many web frameworks are, visually speaking, boring and trite. The visual arts world has had centuries to work out excellent form and vision to fit in a rectangular space, and it seems to me the modern web hasn’t noticed. The best that can be said is that some of them have approached the advances made in the 14th century, and that’s just in the Western artistic tradition. So much more is possible, yet it hasn’t happened. I want to break free from this dull, stultifying conservatism.

It may be that I’m making the wrong comparison, that the web isn’t about image, it’s about type. The comparison should not be with pictures, but papers. There’s certainly something to that. The Western visual high arts never did really suss mixing writing and form (actually, that’s not really true, but, IMHO, such arts never broke out of their context). But arts from Japan, for example, certainly did, and the web doesn’t seem to have noticed them either.

Also, to be absolutely fair, there are experimental websites mixing imaginary and text rather well. But then we get back to my point about the 14th century. Those I’ve seen, and I’ve certainly not seen as many as there are, nor come close to it, those I’ve seen still seem not to have noticed the visual forms changes made since the middle ages.

Anyway, enough of this. Rather than criticising other people for not doing, I should do. I should make my point, not by criticising others for not thinking of it, but by example. I need to knock up some example sites. That’s where SSC comes in.

You see, if I am to build a site using what is effectively an experimental visual process, I can’t use existing web site design frameworks. But if I can’t use a framework, I have to hand code everything. And there’s a key problem: HTML is such a convoluted, evolved mess, that the people who design it, in their own design presentations, make errors. Ok, I only found this out by testing SSC on them, which perhaps illustrates my point about things being overcomplicated. Anyway, I’m not going to reveal any names because these people are actually working hard to make the web a better place. Let’s just say W3 has broken links, WhatWG references withdrawn standards, and many other authors’ sites have other internal inconsistencies. I must mention that my HTML code is far worse than any of these mild examples of technical naughtiness. But the fact that the people who define the web make mistakes in the usage of their design in the documents that espouse the design, does rather explain why most other people are forced to use dull, formulaic, archaic, boring, tools.

I’ve not yet built a site inspired by the visual art world’s lessons in form and layout. My efforts have been spent in building the tool to make that possible. But now, I contend, it is at least a little more possible than it was.

Since I’m here, I’ll list other issues I have with frameworks:

  1. They have to be regularly maintained. Every time an update comes out, that update has to be applied to a site, or, alternatively, the update is ignored and the site becomes vulnerable to exploits blocked by the update, and published when the update is released. This is time lost.
  2. Updates for frameworks don’t always work. Instead of fixing issues, they break the site. This why I dropped my experimental Drupal site a few years ago. This is why I stopped using NextCloud.
  3. Frameworks are usually written in scripted languages, such as PHP. Scripts are unavoidably insecure compared to no scripts: a script cannot be hacked if it does not exist. Thus a site with no scripts is inherently more secure than a site with scripts. For example, there are, as I write, five known vulnerabilities in PHP, a very popular server scripting language (better than it once was, admittedly). If you use PHP, your site, in principle, has those vulnerabilities, and you have to spend time mitigating them. If you do not use PHP, your site cannot have those vulnerabilities. This is why my sites have no scripts, and another reason why SSC does not analyse scripts (the main one being lack of time). I do accept that sophisticated sites have no choice but to use scripts, but I suggest many sites use them unnecessarily.
  4. The worst of them all, for me, is that some frameworks, and many scripts, pull in code residing on other sites, in third–party repositories and the like, as the script is run. If you do that, this means the integrity and security of your site is entirely dependent on the security of the repository. There are unfortunately many examples of repositories being hacked, and, in consequence, all the site that used those repositories are broken in turn.

Dylan Harris
October 2021


README

Static Site Checker
(an opinionated HTML nitpicker)
version 0.0.129
https://ssc.lu/



(c) 2020–2022 dylan harris
see LICENCE.txt for copyright & licence notice
see W3-LICENCE.txt for additional copyright & licence information



WARNING: this code is:
— incomplete
— pre—alpha
— IT PROBABLY WON’T BEHAVE AS YOU EXPECT :-)
— do NOT feed it untrusted data



ssc analyses static HTML snippets, files and sites:
— HTML 1.0/+/2.0/3.0/3.2/4.00/4.01/5.0/5.1/5.2/5.3–draft
— HTML living standard, Jan 2005 to Apr 2022
— SVG 1.0/1.1/1.2 Tiny/1.2 Full/2.0/2.x draft Apr 2021
— MathML 1/2/3/4–draft
— XHTML 1.0/1.1/2.0/5.x
— finds broken links (requires curl)
— processes server side includes, mostly
— checks microdata & RDFa

with opinions on:
— standard english where dialect is required
— perfectly legal but sloppy HTML
— abhorrent rudeness such as autoplay on videoss

It does NOT:
— behave securely: its parser is holier than robin’s cow
— analyse or understand scripts
— analyse or understand styles, beyond nicking class names from CSS
— analyse or understand XML or derivatives except as noted above

It can output:
— ‘repaired’ HTML (not XHTML)
— HTML with resolved Server Side Includes
— JSON summaries of microformat and microdata content
— website statistical information
— updated website with datafile deduplication


ssc -h
for a usage summary.

ssc -f config_file
analyse site using preprepared configuration

ssc directory
analyse website based in directory



To build & run:
1. Follow the build instructions in build.txt
2. Gleefully run ssc. It will misbehave if you are insufficiently gleeful.



NOTE
SSC can be run in a CGI environment. This is intended for use with OpenBSD’s native httpd web server
(https://man.openbsd.org/httpd.8). You are reminded that SSC is pre-alpha software. Do NOT expose it
to untrusted data sources, such as the open web, without taking serious precautions. SSC probably has
more bugs than the Creator’s Ultimate All–Beetle Extravaganza (J.B.S. Haldane, apocryphal : “[the
Creator has] an inordinate fondness for beetles.”).



Notes on names:
- recipe: a nod to Vernor Vinge’s “A Fire Upon the Deep”
- tea: without tea, nothing works; then there’s builders’ tea
- sauce: identifies those who presume; and anyway, it’s obvious
- toast: toasts code; i like burnt toast
- heater: i’m not stopping now
- unii: my preferred plural of unix; both unixes and unices sound like they sing castrato




SEE ALSO
build.txt        notes on building ssc
gen.txt          a model man page
usage.txt        how to use ssc
releasenotes.txt a slight history of releases
LICENCE.txt      ssc licence information
LICENSE.txt      formal GPL 3 licence
more licences    licences for borrowed external content



written by dylan harris
mail@ssc.lu
April 2022

usage

NAME
ssc - analyse static web site source



SYNOPSIS
ssc [...] directory
ssc -f config
ssc



DESCRIPTION
ssc (the Static Site Checker) is an opinionated HTML nit-picker, intended for
people, such as its author, who hand code websites. It doesn't just check
static websites for broken links, dubious syntax, and bad semantic data, it
will actively complain about things that are perfectly legal but just a little
bit untidy, like its author.

Except when serving CGI queries, it recursively scans the directory looking
for HTML source files to analyse. It produces a list of errors, warnings,
comments, and other hints of imperfection. Once complete, it summarises
internal site inconsistencies, and can produce some simple statistics.

Scripts are ignored. CSS is only processed for class declarations.



COMMAND LINE ONLY SWITCHES

These options are only available on the command line:

-f file                 Load configuration from file, which should be in .INI
                        file format. See CONFIGURATION FILE FORMAT below.

-F                      Load the configuration file .ssc/config in the current
                        directory.

-h                      Show a summary of switches, then exit.

-H snippet              Only nitpick this snippet of html

--ontology.list         List known schema versions, then exit.

-V                      Show version details, then exit.

--validation            Show attribute extensions, then exit. Attribute
                        extensions are additional values that can be
                        associated with attributes on some X/HTML elements,
                        and is intended for use with bespoke extensions of
                        HTML.



COMMAND LINE AND CONFIGURATION FILES SWITCHES

These options are available on the command line (with dashes) and in
configuration files (without dashes). The short form alternative switches
only work on the command line.

Most binary options, e.g. those without arguments below that turn on a feature
(which may be the default), have a corresponding "no-" switch to turn it off.
The "no-" is inserted after the dot, so, for example, the contradiction to
"--general.class" is "--general.no-class". When both are specified, perhaps in
a configuration file and on the command line, the "no-" switch always applies.

--corpus.article        Prefer the content of <ARTICLE> when gathering corpus
                        text.

--corpus.body           Prefer the content of <BODY> when gathering corpus
                        text. This is the default.

--corpus.main           Prefer the content of <MAIN> when gathering corpus
                        text.

--corpus.output file    Dump XML corpus of site into file. This is intended
                        for use by a local search engine. If none of
                        --corpus.article, --corpus.body, or --corpus.main are
                        specified, the content of <BODY> is used. If more than
                        one are specified, then the text collected depends on
                        a page's content. This is incompatible with
                        --shadow.update.

--general.class         Report unrecognised classes.

--general.classic       Report all classes used.

--general.cgi           Check environment variables for snippets of HTML.
-W                      Expects environment variables as produced by
                        OpenBSD's native web server, httpd, produced
                        using <FORM METHOD=GET …>. Do NOT let ssc
                        anywhere near untrusted data should you do so,
                        unless you are very confident about your system
                        configuration (because it'll defend you against
                        naughty people who may take advantage of ssc's
                        pre-alpha nature and its corresponding myriad
                        flaws). Ignores many options including shadowing.

--general.css           Do NOT process .css files (for class names).

--general.custom EL     Define a custom element <EL> for verifying the IS
                        attribute. May be repeated.

--general.datapath dir  Look for any configuration, caches, and other useful
-p dir                  files, in this directory.

--general.error x       If nits of the specified category or worse are
-E                      generated, then exit with an error code. Values are:
                        'catastrophe', 'error' (the default), 'warning',
                        'info', or 'comment'.

--general.exclude REG   Ignore all files that match the posix regular
                        expression REG. May be repeated.

--general.file XXX      File for persistent data. Requires -N. See also
                        --general.datapath. Default extension: .ndx.

--general.git           Ignore internal git files

--general.ignore EL     ignore attributes and content of the element <EL>. May
                        be repeated.

--general.info          Report launch context when starting

--general.lang LA       If an X/HTML file does not have a language / dialect
                        specified (e.g. "en" for generic English, "en-IE" for
                        Irish English, "lb-LU" for Luxembourgish, etc.),
                        default to 'LA'. If not given, the default is your
                        system default, or, if none, then "en-US" (standard
                        American English).

--general.maxfilesize n Do not process HTML source files that exceed n bytes
                        in size (default: 4M). Specify 0 for unlimited,
                        although be warned that ssc is stunningly stupid in
                        such circumstances and may even attempt to load files
                        bigger than available memory.

--general.output        Output to the specified file. If this switch is not
-o file                 used, standard output is used.

--general.progress      Dump progress information to standard output. This can
-D                      interfere with formatted output.

--general.rdfa          Check RDFa attributes.

--general.rel           Only mention <LINK> REL values, found neither in the
                        living standard nor at microformats.org, in debug
                        output.

--general.rpt           Report CSS files that are opened. ssc opens CSS files
                        to verify classes; it does not verify CSS.

--general.slob          Ignore perfectly legal but inefficient, indeed
                        thoroughly slobby, HTML, such as being far too lazy to
                        get round to bothering to close elements.

--general.spec          Reset the values of most switches to false
-j

--general.ssi           Process Server Side Includes. Although ssc can process
-I                      many server side includes, it cannot process those
                        containing formulae. Note that processing SSIs may
                        cause incorrect line numbers to be mentioned when an
                        issue is reported.

--general.test          Output data in automated test format. Used by ssc-test.
-T                      Not generally useful. Documented so you can avoid it!

--general.thread N      Use N threads when running. Defaults to a value
                        appropriate for the hardware. Too high a value can
                        cause problems.

--general.verbose x     Output nits to the specified verbosity: 'catastrophe',
-v                      'error', 'warning', 'info', 'comment' (the default),
                        or '0' for silence. Additional values are available
                        when debugging. Each level includes its preceding
                        level, so, for example, 'warning' will also output
                        'catastrophe' and 'error' nits.

--html.rfc1867          Ignore the RFC 1867 (INPUT=FILE) extension when
                        processing HTML 2.0

--html.rfc1942          Ignore the RFC 1942 (tables) extension when processing
                        HTML 2.0.

--html.rfc1980          Ignore the RFC 1980 (client side image maps) extension
                        when processing HTML 2.0.

--html.rfc2070          Ignore the RFC 2070 (internationalisation) extension
                        when processing HTML 2.0.

--html.tags             When an HTML file is loaded that contains no DOCTYPE,
                        ssc normally presumes it's an HTML 1 file. This switch
                        tells it to presume the file follows an earlier HTML
                        Tags specification (the one at CERN). This is
                        overridden by --html.version.

--html.title n          If <TITLE> text is longer than n characters, say so.
-z n                    This applies to text enclosed by a <TITLE> element in
                        under <HEAD>, not the value of TITLE attributes.

--html.version X        If no doctype (or xml header) is specified, presume
                        version X of HTML. X can be:
                            tags        HTML tags (1991, informal),
                            1.0         HTML 1.0 (June 1993 draft),
                            +           HTML Plus (November 1993 draft),
                            2.0         HTML 2.0,
                            3.0         HTML 3.0 (March 1995 draft),
                            3.2         HTML 3.2,
                            4.0         HTML 4.0,
                            4.1         HTML 4.01,
                            4.2         XHTML 1.0,
                            4.3         XHTML 1.1 core,
                            4.4         XHTML 2.0 (December 2010 draft),
                            5.0         W3 HTML 5.0,
                            5.1         W3 HTML 5.1,
                            5.2         W3 HTML 5.2,
                            5.3         W3 HTML 5.3 (October 2018 draft),

                            2005/1/1    WhatWG WebApps draft (January 2005),
                            ...         (halfly)
                            2007/1/1    WhatWG WebApps draft (January 2007),
                            2007/7/1    WhatWG HTML 5 (July 2007),
                            ...         (halfly)
                            2021/1/1    WhatWG HTML 5 (January 2021),
                            ...         (quarterly)
                            2022/4/1    WhatWG HTML 5 (April 2022),

                            XHTML 1.0   XHTML 1.0,
                            XHTML 1.1   XHTML 1.1 core,
                            XHTML 2.0   (December 2010 draft),
                            XHTML 5.x   XHTML corresponding to equivalent W3
                                        HTML.

                        Although you can specify exact dates for versions of
                        the WhatWG HTML 5 living standard, currently only
                        broad versions published in January and July are
                        supported (quarterly from 2021).

                        Certain versions of HTML offer variants, such as loose
                        and strict definitions. ssc picks those up from the
                        <!DOCTYPE …> in the HTML file, if any, and then
                        carefully ignores them.

                        Validation of XHTML is even less strict.

                        Just to remind you, there are no guarantees of
                        accuracy (or inaccuracy).

                        Copies of the appropriate standards can be found
                        online. A copy of the copies referenced during ssc's
                        development can be found at https://ssc.lu/.

--link.301              Normally, when ssc checks external links
-3                      (--link.external), it does not report http forwarding
                        errors 301 and 308. Use this switch to have it do so.

--link.check            Check internal links, e.g. those within the website
-l                      being analysed.

--link.example          Report links to faux domains, as defined by RFC 2606
                        (note ssc also reports links to example.edu).

--link.external         Check external links, e.g. those not on the site being
-e                      checked. This requires a copy of curl on the path.
                        Note that ssc will NOT check RFC 2606 links, such as
                        example.com (see --link.example).

--link.forward          Report HTTP forwarding errors encountered when checking
                        external links (e.g. 301 and 308)

--link.ignore DOMAIN    When checking external links, ignore this domain. May
                        be repeated.

--link.local            Report links to local domains, such as domains ending
                        in .lan, .home, .corp, and others.

--link.once             Only report each broken external link once. If, for
-O                      example, the site has a number of references to a page
                        that does not exist, ssc will only report the first
                        instance of the broken link. Note that, even if it
                        reports every occurrence of the link, it will only
                        check it the first time it's encountered (requires
                        --link.external).

--link.report DOMAIN    Report links to domain and its descendents. May be
                        repeated.

--link.revoke           Do not check whether links' https certificates have
-r                      been revoked (requires --link.external).

--link.xlink            Check crosslink IDs on the site being analysed. For
-X                      example, if a link goes to /index.html#id, then, when
                        this switch is set, ssc will verify that the id exists
                        and that it is not hidden.

--math.version          Presume this version of MathML (1, 2 or 3). The
                        following versions are supported:
                                0   work it out from the (HTML) version of the
                                    file being analysed,
                                1   MathML 1,
                                2   MathML 2,
                                3   MathML 3,
                                4   MathML 4 (December 2020 draft).

--microdata.export      Export schema.org microdata encountered. This data is
                        exported in JSON format (not JSON-LD).

--microdata.root DIR    When exporting microdata with --microdata.export,
                        write files into the directory DIR. ssc will create
                        the directory tree structure as appropriate.

--microdata.verify      Check microdata found in WhatWG microdata attributes
-m                      (itemprop, itemtype, etc.). Note that ssc only knows
                        about certain ontologies (see --ontology.list)

--microdata.virtual v=d When exporting microdata using --microdata.export,
                        export the contents of virtual directory 'v' to 'd'.
                        'v' must match a directory identified with
                        --site.virtual. For example:
                            --microdata.virtual virtual=X:\virtual.

--microformat.export    Export microformat data encountered in JSON format.
                        This option will write files in the same directory as
                        the source, with the extension .json.

--microformat.verify    Verify Microformats data in class and rel attributes
-M                      (see https://microformats.org/).

--microformat.version x Presume microformats version x. The following values
                        are current accepted:
                                1   microformats version 1 only,
                                2   microformats version 2 only,
                                3   both microformats versions 1 and 2.

--nits.catastrophe n    redefine nit n as a catastrophe; may be repeated (the
                        value of n can be determined using --nits.nids below).

--nits.codes            Output nit codes.

--nits.comment n        Redefine nit n as a comment; may be repeated (the
                        value of n can be determined using --nits.nids).

--nits.debug n          Redefine nit n as a debug message; may be repeated
                        (the value of n can be determined using --nits.nids).

--nits.error n          Redefine nit n as an error; may be repeated (the
                        value of n can be determined using --nits.nids).

--nits.format F         Specify the output format; F is a template file (see
                        OUTPUT TEMPLATE below).

--nits.info n           Redefine nit n as information; may be repeated (the
                        value of n can be determined using --nits.nids).

--nits.nids             Output nit ids, which can be used to redefine nits.

--nits.override F       Use this output format, not the one specified by
                        --nits.format. F is a template file (see OUTPUT
                        TEMPLATE below). This switch is intended to aid
                        automation.

--nits.quote X          Specify quote style when using nit.format. X can be
                        one of 'text' or 'html'.

--nits.root             By default, seek nit output template files in the
                        website root.

--nits.silence n        Silence nit n; may be repeated (the value of n can be
                        determined using --nits.nids).

--nits.unique           Do not output repeated nits, even if they may contain
                        additional information.

--nits.warning n        Redefine nit n as a warning; may be repeated (the
                        value of n can be determined using --nits.nids).

--nits.watch            Output debug nits (intended for automation).

--rdfa.version          When checking RDFa files, presume this version
                        (default: 1.1.3). Note, RDFa analyis is incomplete,
                        and only intended for supporting HTML analysis.

--ontology.ONT X.Y      Presume version X.Y of ontology ONT. For example:
                            --ontology.xsd 1.1
                        defaults usage of XSD to version 1.1. The versions
                        apply to RDFa, microdata, and microformats (using
                        class) analysis. If .Y is omitted, .0 is presumed.
                        X must be present. Unspecified defaults are
                        derived from the HTML version. For a list of
                        possible values, use --ontology.list.

                        At the time of writing, the following ontology
                        versions can be verified. Note that single version
                        ontologies cannot have their version changed:
                                article 12,14,18,22
                                as 1.0,2.0
                                bibo 1.3
                                book 12,14,18,22
                                cc 1.0
                                content 1.0
                                csvw 1.0
                                ctag 1.0
                                daq 1.0
                                dbp 1.0
                                dbp-owl 1.0
                                dbr 1.0
                                dc11 1.0,1.1
                                dcam 1.0
                                dcat 1.0,2.0
                                dcmi 1.0
                                dcterms 1.0,1.1
                                doap 1.0
                                dqv 1.0
                                describedby 1.0
                                duv 1.0
                                earl 1.0
                                event 1.0
                                foaf 0.1-0.99
                                frbr_core 1.0
                                gr 1.0
                                grddl 1.0
                                gs1 1.1-1.5
                                ical 1.0
                                icaltzd 1.0
                                jsonld 1.0,1.1
                                ldp 1.0
                                license 1.0
                                locn 1.0
                                ma 1.0
                                mf 1.0-2.255
                                music 12,14,18,22
                                oa 1.0
                                odrl 1.0
                                og 10,12,14,18,22 (see below)
                                org 1.0
                                owl 1.0,2.0
                                poetry 1.0
                                profile 12,14,18,22
                                prov 1.0
                                ptr 1.0
                                qb 1.0
                                rdf 1.0-1.3
                                rdfa 1.0-1.3
                                rdfg 1.0
                                rdfs 1.0
                                rev 1.0
                                rif 1.0
                                role 1.0
                                rr 1.0
                                schema 0.10-14.0 (see below)
                                sd 1.0
                                sioc 1.0
                                sioc_s 1.0
                                sioc_t 1.0
                                skos 1.0
                                skosxl 1.0
                                sosa 1.0
                                ssn 1.0
                                taxo 1.0
                                time 1.0
                                v 1.0
                                vann 1.0,1.1
                                vcard 1,2,3,4 (see below)
                                video 12,14,18,22
                                void 1.0
                                wdr 1.0
                                wdrs 1.0
                                website 12,14,18,22
                                wwg 1.0
                                xhv 1.0
                                xml 1.0
                                xsd 1.0,1.1
                        vCard versions correspond to RDFa specs, published in
                        2001, 2006, 2010 & 2014. They do NOT correspond to
                        vCard data format specifications.
                        Open Graph versions correspond to snapshots of the
                        specs from 2010, 2012, 2014, 2018 & 2022.
                        Most versions of schema (schema.org) should be
                        specified by their version number, but this doesn't
                        work with early versions, which should be specified as
                        follows:
                                Use         For
                                0.10        June 2011
                                0.15        July 2011
                                0.20        August 2011
                                0.25        September 2011
                                0.30        October 2011
                                0.35        November 2011
                                0.40        December 2011
                                0.45        January 2012
                                0.50        February 2012
                                0.55        March 2012
                                0.60        April 2012
                                0.91-0.99   as version number
                                1.0         1.0a
                                1.1         1.0b
                                1.2         1.0c
                                1.3         1.0d
                                1.4         1.0e
                                1.5         1.0f
                                1.10        1.1
                                1.20        1.2
                                1.30        1.3
                                1.40        1.4
                                1.50        1.5
                                1.60        1.6
                                1.70        1.7
                                1.80        1.8
                                1.90        1.9
                                1.91...     as version number

--shadow.changed        When shadowing a site that has been previously
                        shadowed, only copy/link files that have changed.

--shadow.comment        Do not delete comments when writing shadow pages.

--shadow.copy X         Create a shadow directory structure from source HTML
                        files, with errors removed and some things tidied up.
                        X can be:
                                no     copy nothing (default);
                                pages  write 'fixed' source files, ignore non
                                       source files;
                                hard   set up hard links to non-source files
                                       (requires source and shadow directories
                                       to be on the same disk);
                                soft   set up soft links to non-source files;
                                all    copy non HTML files too;
                                dedu   copy non HTML files too, but
                                       deduplicate them, changing links in
                                       HTML source if necessary;
                                report report duplicates (no shadowing).
                        ssc cannot convert between versions of HTML, nor
                        between HTML and XHTML. The soft and hard link options
                        are only available on systems that support them.

--shadow.enable         Enable shadowing (set by other shadow options). If
                        shadowing is enabled, but shadow.root is not set, SSC
                        will litter the site source directories with .ndx
                        files.

--shadow.file f         Write ssc's shadow cache to file f, to accelerate
                        future shadowing of the same content.

--shadow.ignore ext     When shadowing, ignore files with this extension (may
                        be repeated).

--shadow.info           Add a comment at or near the top of each shadowed HTML
                        file noting its generation time.

--shadow.msg text       Insert a comment containing the text at the top of
                        every generated page. Note that, if any SSI included
                        file is updated, the comment will appear whether or
                        not the original page is updated.

--shadow.root dir       Where to write the shadowed site.

--shadow.space          Leave excess/repeated spaces and blank lines in the
                        shadowed files untidily untouched.

--shadow.ssi            Do NOT resolve Server Side Includes when shadowing,
                        even if --general.ssi is set.

--shadow.update         Only examine files that have changed since the last
-u                      time ssc ran. This is incompatible with --corpus.file.
                        This requires --shadow.file. Nits of files that have
                        not changed will not be reported.

--shadow.virtual v=d    When shadowing virtual directories, output the shadow
                        of virtual directory 'v' to directory 'd'. 'v' must
                        match a directory set up using --site.virtual.

--site.domain domain    The domain name of the site is 'domain'. This can be
-S domain               repeated. This is used to identify any URL that is
                        apparently external but is actually internal to the
                        site.

--site.extension ext    Treat files with this extension as X/HTML source
-x ext                  files. This may be repeated. Files with extension
                        .html are always checked.

--site.index file       This is the name of the index file in a directory.
-i file                 This can be repeated. This is used for checking
                        internal links.

--site.root dir         This is the root of the website to analyse. ssc will
-g dir                  recursively scan the directory analysing any HTML
                        files it finds. The default is the current directory.

--site.virtual v=d      The virtual directory 'v' is located in actual
-L v=d                  directory 'd' on the local filesystem. For example:
                            --site.virtual virtual=D:\actual

--spell.accept XXX      XXX is a correct spelling or a word (or a list of
                        words) in all languages.

--spell.cased           Nitpick correctly spelt but wrongly cased words.

--spell.check           Check text spelling. Uses external spelling checkers,
                        so results may be inconsistent between systems.

--spell.dict LANG,DICT  Unix only. Associate dictionary DICT with LANG. For
                        example, if the standard English dictionary is
                        en_GB-large:
                            --spell.dict en-GB,en_GB-large
                        (Under Windows, ssc uses the OS dictionaries.)

--spell.icu             If "no", do not use the ICU libraries at all (they are
                        rather slow). This will increase the inaccuracy and
                        incorrectness of the spell checks.

--spell.list FN,LANG    The file FN contains a list of valid spellings for
                        language LANG (which may include country info). If
                        LANG is omitted, the valid spellings apply to all
                        languages. For example:
                            --spell.list villages.txt,en-IE
                            --spell.list dorfer.txt,de
                            --spell.list letzstied.txt

--spell.path PATH       Unix only. Path to spelling executable. Hunspell or
                        a compatible program is expected. If none is specified,
                        ssc will seek hunspell. (Under Windows, ssc uses the
                        OS spellchecker.)

--stats.export F        Export stats to file F.

--stats.meta            Produce statistics on <META> usage in <HEAD>. Note
                        that pragmas reported (http-equiv) are those found in
                        the HTML source, not those returned by the HTTP
                        protocol. Remember that many web servers (not all)
                        will remove some pragmas when serving pages.

--stats.page            Produce statistics for each source file encountered.

--stats.summary         Produce a summary of overall statistics for the
                        website.

--svg.version x         Presume any SVG code encountered is this version,
                        unless the SVG code itself specifies a version.
                        Versions recognised:
                            1.0,
                            1.1,
                            1.2 (really 1.2/tiny),
                            1.2/tiny,
                            1.2/full (May 2004 draft, incomplete, any conflict
                                      with tiny always resolved in favour of
                                      tiny),
                            2.0,
                            2.1 (april 2021 draft).
                        If this switch is not used, and some SVG code does not
                        identify its version, the version is derived from the
                        version of the host X/HTML code.

--validation.attribute ATT
                        Add the custom attribute ATT. This attribute will
                        be ignored, not validated.

--validation.charset CH Accept CH as a charset. May be repeated.

--validation.class CL   Add the valid class CL. May be repeated.

--validation.color COL  Accept COL as a colour. May be repeated.

--validation.colour COL Accept COL as a colour. May be repeated.

--validation.country CC Accept CC as a valid two-letter country code. May be
                        repeated.

--validation.currency CUR
                        Accept CUR as a valid currency. May be repeated.

--validation.element EL Accept <EL> as a valid element. This element will be
                        ignored, not validated. May be repeated.

--validation.element-attribute EL,ATT
                        Accept the known attribute ATT on the element <EL>.
                        Doesn't work with namespaces (names containing ':').
                        May be repeated.

--validation.extension EXT
                        Accept the extension EXT as a mimetype file extension.
                        May be repeated.

--validation.httpequiv HEQ
                        Accept HEQ as a valid macro for httpequiv on <META>
                        elements. May be repeated.

--validation.lang LANG  Accept LANG as a valid language code. May be repeated.

--validation.minor x    When validating W3 HTML 5 source code, using this
-m x                    minor version of W3 HTML 5. Valid values are 0, 1, 2,
                        and 3 (draft). WhatWG versions are determined by date,
                        corresponding roughly to the date of the (online)
                        publication of the specific version. See the
                        --html.version switch.

--validation.metaname M
                        Accept M as valid for the NAME attribute of the <META>
                        element. The VALUE will be ignored. May be repeated.

--validation.microdata  Validate (schema.org) microdata.

--validation.mimetype MT
                        Accept MT as a valid mimetype. May be repeated.

--validation.sgml SGML  Accept SGML as a valid SGML schema identification (as
                        found in <!DOCTYPE …>. May be repeated.

--validation.XXX YYY    Accept YYY as a valid value for attribute type XXX.
                        For a list of possible values of XXX, use the command
                        line switch --validation.



CONFIGURATION FILE FORMAT

If a configuration file is used, it should be in INI file format. All content
is optional.

Section and option names are derived from the long form switch name, which
consists of SECTION.OPTION, laid out in the format:
[SECTION]
OPTION=yes
OPTION=123456

Switches that do not have a long form version cannot be used in a
configuration file.

Each ssc test (in the toast folder) has a configuration file; browse them for
examples.



ENVIRONMENT

If you set --general.cgi, ssc will check these environment variables:

QUERY_STRING            Run under OpenBSD's httpd server. See notes below.
SSC_CONFIG              If no configuration file is given on the command line,
                        use this one
SSC_ARGS                Preliminary command line parameters

If, when SSC is run, the environment variable QUERY_STRING is set to an
OpenBSD httpd server CGI value that includes the parameter html.snippet, then
SSC will nitpick that snippet only. Some other parameters are processed,
including general.verbose and html.version.



EXIT STATUS
If no significant nits are found, ssc exits with 0, otherwise it exits with a
value > 0.



OUTPUT TEMPLATE

The --nit.format switch allows control of output format. It takes a file name.
The format of that text file is a sequence of fixed section names, enclosed in
square brackets on their own lines, each optionally followed by text. In that
text, certain specific identifiers, enclosed in brace pairs, are substituted.
For example:

[dog-section]
My pet dog {{dog-name}} is a {{bad-dog}}.

For examples, browse recipe/toast/output/*.nit

If no file is specified, or if the file cannot be loaded, a default template is
used.

Note also the --nit.quote switch.



EXAMPLES

To verify the version of ssc:
ssc -V

To check the static web side source directory /home/site/wwwroot:
ssc /home/site/wwwroot

To check a static website for example.com which uses server side includes, that
lies in the current directory, with verification of external links, giving
rather verbose output:
ssc -e -I -x html -x shtml -s example.com -v 5 -i index.shtml

To check a static web side in the current directory, with a virtual directory,
verifying microformats:
ssc -L vitual=/home/site/virtual -M

To check a static web site using a configuration file:
ssc -f config.file

A simple configuration file might contain:
[general]
verbose=4
output=simple.out
[site]
domain=example.edu
extension=html
index=index.html
root=simple

A configuration file to check a site against HTML 5.2 and SVG 1.1 might
contain:
[general]
output=site.out
class=yes
[link]
check=yes
[site]
domain=example.edu
extension=html
index=index.html
root=site
[html]
version=5.2
[svg]
version=1.1

A configuration file to check against a particular WhatWG living standard,
gathering statistics:
[general]
output=jan21.out
[html]
version=2021/01/01
[link]
check=yes
[microdata]
version=11.0
[site]
domain=example.edu
extension=html
index=index.html
root=site
[stats]
summary=yes
meta=yes

A configuration file to shadow copy and deduplicate a site might contain:
[general]
output=dedu.out
class=yes
[site]
domain=example.edu
extension=html
index=index.html
root=site
[shadow]
copy=5
root=shadow
file=dedu.ndx

A configuration file to export microdata preparing against schema.org version
7.2 might contain:
[general]
output=export.out
class=yes
[site]
domain=example.edu
extension=html
index=index.html
root=site
[link]
check=yes
[microdata]
export=yes
root=export
version=7.2



PREPARING and UPDATING a SITE

These files are based on the steps I take to update an OpenBSD website.

Presume a directory containing the following:
site.conf    ssc configuration file for a website
site         shadow output produced by ssc

Then I run a script like this:

ssc -f site.conf
upload.sh site /var/www/site-upload server user 0
ssh user@server "cd /var/www ; mv site x ; mv site-upload site ;
mv x site-upload ; ln -sf site htdocs"

upload.sh is a macos bash script that can be found among the source code. Note
that I have rather naughtily replaced OpenBSD's httpd document directory
/var/www/htdocs with a link.

Here is site.conf:

[general]
verbose=info
class=yes
output=site.out
ssi=yes
ignore=pre
rpt=yes

[html]
version=2021/04/01

[link]
check=yes
xlink=yes

[microformat]
verify=yes

[site]
domain=example.com
extension=html
extension=shtml
index=index.shtml
root=corrupt_source

[stats]
summary=yes

[shadow]
copy=dedu
root=site
file=site.ndx
ignore=inc
info=yes




SEE ALSO
tidy
linkchecker




HISTORY
ssc is written by Dylan Harris, https://ssc.lu/.

known issues

SSC is pre–alpha software. It doesn’t do what it’s supposed to do, and what it’s supposed to do is wrong.

Note that github hosts a list of known issues.

* How can such a dangerous animal have such a cuddly name? It’s like calling the Hound of Hell ‘Fluffy’.


build

BUILD NOTES
static site checker
https://ssc.lu/
(c) 2020-2022 Dylan Harris


Introduction
============
SSC can be built from various unii using CMake, or with Visual Studios
2017 / 2019 / 2022 under Windows. I have built & tested it under x64 in
selected OSs.


Libraries
=========

Common dependencies
-------------------
You should install boost version 1.75 or better (https://boost.org), a recent
version of the ICU libraries (https://icu-project.org/), & Microsoft's GSL
library (https://github.com/Microsoft/GSL). Most unii have most available as
packages. You can install build and install them yourself if you prefer.

You may need to set these environment variables:
- BOOST: if you're not using your operating system's packaged flavour of boost,
  then set BOOST to your boost source root directory (CMake may welcome
  BOOST_LIBRARYDIR & BOOST_INCLUDEDIR being set appropriately);
- ICU_ROOT: similarly, if you're not using your operating system's packaged
  ICU, set ICU_ROOT to your ICU library source root directory;
- GSL: set it to your GSL root directory.


hunspell
--------
Building SSC under unii, including macos, requires a development installation
of hunspell (https://hunspell.github.io/). You may need to set these
environment variables:
- HUNSPELL_INCLUDE to point to the hunspell include directory
- HUNSPELL_LIB to point to the hunspell library directory
- HUNSPELL_VERSION, the actual library name (such as "hunspell-1.7.so")

Once you've got them, navigate to recipe/tea, and run cmake.

winspell
--------
The Windows build, by default, uses the native Windows spellchecker, although,
in multilingual contexts that doesn't work so well, unless you use Windows 11.


Building
========

Windows
-------
To build from Visual Studio, navigate to recipe/tea, open the appropriate .sln
file, then build. Only Visual Studios 2017 / 2019 / 2022, 64 bit, have been
built & tested, for Windows 8.1 & 10.

Unii & mock Unii
----------------
You will need CMake 3.12 or better. From the home ssc directory, compile a
normal release build thus:
cd recipe/tea
cmake .
make
ctest
make install

For a debug build:
cd recipe/tea
cmake -DCMAKE_BUILD_TYPE=Debug .
make
ctest
make install

If everything works correctly, then everything will be built, a series of
tests run, with a final result at the very end saying no failures. Having said
that, given SSC is pre-alpha, don't be too surprised to see some warnings or
some final test errors. Note in particular that complaints about being unable
to find or copy files during testing are not of concern, these come from
scripts that set up or tear down individual tests, and the standard commands
used sometimes complain if they can't find files they're supposed to delete,
which is a bit silly given things are already in the desired state.

The following have been successfully built as x64 amd/intel, although not
always under all versions of ssc:
Linux: Centos 8/9 Streams, Ubuntu Server 20.04/20.10
OpenBSD: 7.0 / 6.9 / 6.8
MacOS: Monterey, Big Sur, Catalina, Mojave, High Sierra

Note: Use clang if possible, gcc takes a wee while.


OpenBSD
-------
I've only tested the amd64 build under 6.8 / 6.9 / 7.0.

The versions of boost and cmake in packages are sufficient. You will need to
increase significantly the available memory setting in login.conf for the
build account, if you have not done so already.

Openbsd 6.8 offers hunspell 1.6, so if you use that version, you will need to
set the HUNSPELL_VERSION environment variable appropriately.

notes

If everything works correctly, then everything will be built, a series of tests run, with a final result at the very end saying no failures. Having said that, given SSC is pre–alpha, don’t be too surprised to see some warnings or some final test errors.


source

0.0.132

0.0.131

0.0.130

0.0.129

0.0.128

0.0.127

0.0.126

0.0.125

0.0.124

0.0.123

0.0.122

0.0.121

0.0.120

0.0.119

0.0.118

0.0.117

0.0.116

0.0.115

0.0.114

0.0.113

0.0.112

0.0.111

0.0.110

0.0.109

0.0.108

0.0.107

0.0.106

0.0.105

0.0.104

0.0.103

0.0.102

0.0.101

0.0.100

0.0.99

0.0.98

0.0.97

0.0.96

0.0.95

0.0.94

0.0.93

0.0.92

0.0.91

0.0.90

0.0.89

0.0.88

0.0.87

0.0.86

0.0.85

0.0.84

0.0.83

0.0.82

0.0.81

0.0.80

0.0.79

0.0.78

0.0.77

0.0.76

0.0.75

0.0.74

0.0.73

0.0.71

0.0.70

0.0.60

0.0.55

0.0.2


boot notes

Notes on folder names:

Here’s a reference documentation collection.


copyright & licence

Any dispute shall be resolved in accordance with the law of the Grand Duchy of Luxembourg.


SSC

SSC, static site checker, https://ssc.lu/
copyright (c) 2020-2022 dylan harris

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public
License along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA


W3

Some test files come from w3.org (some directly, in W3 documents, etc.), and are licensed as follows:

License

By obtaining and/or copying this work, you (the licensee) agree that you have read, understood, and will comply with the following terms and conditions.

Permission to copy, modify, and distribute this work, with or without modification, for any purpose and without fee or royalty is hereby granted, provided that you include the following on ALL copies of the work or portions thereof, including modifications:

    The full text of this NOTICE in a location viewable to users of the redistributed or derivative work.
    Any pre-existing intellectual property disclaimers, notices, or terms and conditions. If none exist, the W3C Software and Document Short Notice should be included.
    Notice of any changes or modifications, through a copyright statement on the new code or document such as
    "This software or document includes material copied from or derived from [title and URI of the W3C document]. Copyright © [YEAR] W3CÆ (MIT, ERCIM, Keio, Beihang)."

Disclaimers

THIS WORK IS PROVIDED "AS IS," AND COPYRIGHT HOLDERS MAKE NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE OR DOCUMENT WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.

COPYRIGHT HOLDERS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THE SOFTWARE OR DOCUMENT.

The name and trademarks of copyright holders may NOT be used in advertising or publicity pertaining to the work without specific, written prior permission. Title to copyright in this work will at all times remain with copyright holders.
Notes

This version: http://www.w3.org/Consortium/Legal/2015/copyright-software-and-document

Previous version: http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231

This version makes clear that the license is applicable to both software and text, by changing the name and substituting "work" for instances of "software and its documentation." It moves "notice of changes or modifications to the files" to the copyright notice, to make clear that the license is compatible with other liberal licenses.


WhatWG

Some test files come from whatwg.org (some directly, in WhatWG documents, etc.), and are licensed under a Creative Commons Attribution 4.0 International License. See https://whatwg.org/ for details.


corruptpress.com

Some test files are derived from pages at corruptpress.com. They are licensed under a Creative Commons Attribution 4.0 International License. Browse https://corruptpress.com/ for details.


dylanharris.org

Some test files are derived from pages at https://dylanharris.org/. They are licensed under a Creative Commons Attribution 4.0 International License. Browse https://dylanharris.org/ for details.