Advanced JTidyPlugin Tips: Custom Rules and Performance Tweaks

Advanced JTidyPlugin Tips: Custom Rules and Performance TweaksJTidyPlugin is a powerful tool for integrating HTML cleanup and validation into Java-based build processes and development workflows. While its default behavior is helpful out of the box, advanced projects often require fine-grained control over how HTML is corrected, validated, and reported — and they need good performance when processing large codebases or continuous-integration pipelines. This article covers practical, advanced techniques for creating custom rules, tuning JTidyPlugin for speed and resource efficiency, integrating it smoothly into builds, and troubleshooting common issues.


1. When to extend JTidyPlugin behavior

JTidyPlugin shines at automatic correction of malformed HTML and enforcing baseline cleanliness. However, there are cases where default behavior is insufficient:

  • You need to enforce organization-specific HTML conventions that Tidy doesn’t check by default (class naming, ARIA usage patterns, forbidden inline styles).
  • You want different correction/validation rules per module, environment (dev vs CI), or file type (templates vs static pages).
  • You must integrate JTidyPlugin results into a custom reporting system, or fail builds only on certain types of violations.
  • Your project is large and performance or memory usage becomes a bottleneck.

For these scenarios, consider customizing rules (via configuration and extensions), using selective file targeting, parallelizing work, and tuning memory/IO behavior.


2. Configuring JTidyPlugin: configuration file essentials

JTidyPlugin typically reads configuration from a Tidy options file (often tidy.conf, tidy.properties, or passed as plugin options). Key settings to master:

  • doctype — control output doctype (e.g., html5, strict, transitional).
  • indent — enable consistent indentation for easier diffs.
  • char-encoding — set input/output encoding (UTF-8 recommended).
  • wrap — line-wrap behavior for long text nodes.
  • tidy-mark — disable/enable generator meta tag.
  • show-warnings, show-errors — control verbosity for CI thresholds.
  • output-xhtml — produce XHTML when needed by your toolchain.
  • fix-backslash, merge-divs, join-classes — automatic structural fixes.

Use an explicit, checked-in config per project so all developers and CI use identical rules.

Example tidy options file (tidy.conf):

indent: yes indent-spaces: 2 wrap: 0 doctype: html5 input-encoding: utf8 output-encoding: utf8 show-warnings: yes quiet: yes tidy-mark: no char-encoding: utf8 

3. Creating custom rules and validations

JTidy (the underlying library) focuses on parsing and normalizing markup, not on arbitrary lint rules. To enforce custom rules you can combine several approaches:

  • Post-process JTidy output with a custom validator: Run JTidy to normalize markup, then parse the normalized HTML with an HTML parser (jsoup, HTMLCleaner, or a DOM API) and run bespoke checks (forbidden attributes, required ARIA labels, naming conventions).
  • Use XPath/CSS selectors for checks: After normalization, query the document for elements that violate rules (e.g., img:not([alt]), input[role=“button”] without accessible name).
  • Integrate with existing linters: Pipe JTidy output into linters like htmllint (via Node) or custom Java-based linters to apply rule sets not provided by Tidy.
  • Implement a small extension layer in your build: Many build tools (Maven, Gradle) let you write a plugin step that invokes JTidy, then runs additional Java code to apply rules and aggregate results.

Example checklist pseudo-flow:

  1. Run JTidy to clean and produce normalized HTML.
  2. Load normalized HTML using jsoup.
  3. Run rules: check missing alt attributes, prohibited inline styles, required header structure, etc.
  4. Produce machine-readable report (JSON) and fail CI based on configured thresholds.

4. Examples of useful custom rules

  • Accessibility checks:
    • img elements with missing or empty alt attributes.
    • Elements with role=“button” lacking keyboard-accessible handlers or aria-label.
    • Heading-order enforcement (no jumping from h1 to h4 without intermediate headings).
  • Security/consistency:
    • No inline event handlers (onclick, onmouseover).
    • No inline styles (style attribute) in production templates.
    • Disallow