Automated Accessibility Checks for Downloadable PDFs
Learn about the technical architecture, features, and the motivation behind the PDF A11y Auditor tool.
Development and License
Developed by Dr. Harald Hutter.
License: MIT License.
https://a11y-pdf-audit.fly.dev/
Open Source:
View source code, contribute or report issues on GitHub Repository.Purpose and Idea
The a11y PDF Audit is a modular web application designed to automatically check websites for accessible PDF files.
It crawls any given URL, downloads discovered PDFs, validates them using VeraPDF,
and generates structured HTML and PDF reports automatically.
VeraPDF is a purpose-built, open source, file-format validator covering all PDF/A and PDF/UA parts and conformance levels.
๐ VeraPDF - Industry Supported PDF/A Validation
German Federal Monitoring Agency for Accessibility in Information Technology
The Federal Monitoring Agency for Accessibility of Information Technology (BFIT-Bund) began its work in autumn 2019.
It was established on the basis of Section 13(3) of the Disability Equality Act (BGG).
As the federal monitoring body, BFIT-Bund performs tasks assigned to Germany by the European Union (EU) Directive on the monitoring,
review and reporting of digital services provided by public bodies. (Section 8 of Directive (EU) 2016/2102)
Many people don't know that PDFs actually have to be barrier-free. There are still misunderstandings, e.g. some people say that PDFs are not a website - but it is clear, and PDFs must be just as accessible. I would like to clarify that.
Main Features (Some are New in Version 1.2.0)
-
๐ Dual-Audit System: Validates PDFs simultaneously against the strict ISO PDF/UA-1 standard AND our custom, pragmatic ScreenReadable profile
(ignoring visual font metrics and strict matrix algorithms that don't affect screen readers like JAWS/NVDA).
-
๐ Recursive Crawler: Searches websites for downloadable PDFs (configurable depth & limit) with smart error handling.
-
๐ Reporting: Generates detailed reports in JSON, HTML, and PDF formats (using WeasyPrint) with side-by-side Strict vs. ScreenReadable results.
-
๐ป Web Interface: Easy-to-use Flask frontend with live server logs and report overview.
-
๐ Resilient Processing: Intelligent timeout handling with dynamic multipliers (x4) for large PDFs to prevent crashes.
-
๐งน Auto-Cleanup: Automatically deletes reports older than 14 days to preserve server storage.
-
โ๏ธ Modular Architecture: Config-driven setup (via `config.json`), separated CSS, and Facade & Controller patterns for scalability (Docker/Fly.io ready).
-
๐ฏ Perfect Performance: Achieves 100/100 in Google PageSpeed Insights (Performance, Accessibility, Best Practices, SEO).
Limitations and Issues
Many people don't know that PDFs actually have to be barrier-free. There are still misunderstandings, e.g. some people say that PDFs are not a website - but it is clear, and PDFs must be just as accessible. I would like to clarify that.
VeraPDF vs. axesCheck (PAC) There is a known discrepancy between VeraPDF (used by this tool) and axesCheck/PAC regarding ISO 14289-1:2014 (PDF/UA-1), specifically rule 7.5 (Tables).
- VeraPDF tends to be very strict and may report `FAIL` on tables where the headers cannot be determined algorithmically according to its strict interpretation of the standard.
- axesCheck might pass the same file if the logical structure is semantically sufficient for screen readers.
- Solution: Version 1.2.0 introduces the ScreenReadable Profile alongside the strict check to bridge this gap.
Solution: The "ScreenReadable" Profile
To bridge the gap between strict ISO validators and real-world screen reader behavior (like JAWS, NVDA, or axesCheck), this tool runs a dual-audit. First test against the strict PDF/UA-1 standard, and than against a custom ScreenReadable profile, which ignores visual font metrics and strict matrix checks.
View Excluded Rules DetailsPerformance & Infrastructure
This instance is hosted on a high-efficiency containerized environment. To provide advanced AI features on minimal hardware, we use a Smart-Resource-Architecture:
- Memory: 1GB physical RAM supplemented by a 4GB high-speed Swap-File for AI workloads.
- Storage: A persistent 15GB SSD Volume stores our machine learning models and ensures your reports are available for 14 days.
- Stability: We enforce a "Single-Worker" policy for AI tasks. If the system is busy, jobs are queued to prevent crashes.
Note: AI reconstruction of large PDFs (50+ pages) may take several minutes. Our server is configured with a 1000-second timeout to ensure even complex documents are finished successfully.
For heavy enterprise use, you can deploy your own instance using our Docker image.
Quality and Testing
| Tool | Purpose | Status / Result |
|---|---|---|
| โ flake8 | Formatting & Style Checking | No critical issues found. |
| โญ pylint | Code Quality / Docstrings Review | Score: > 9.63 / 10 points. |
| ๐ bandit | Security Analysis | No high severity findings. |
| ๐ฟ radon cc | Cyclomatic Complexity Tests | Mainly A-level functions. |
|
๐ PageSpeed
Insights |
provides suggestions
on how that page may be improved |
Performance (LCP=0.2s) 100
Accessibility 100 Best Practices 100 SEO 100 |
|
๐ Screaming Frog
SEO Spider |
use for crawling
up to 500 URLs at a time |
All reported issues solved. |