Testing Adobe’s online PDF to HTML conversion tool
15 January 2009
Many websites refer readers experiencing problems accessing their PDFs to Adobe's online PDF to HTML conversion tool, presumably on the assumption that this will solve those problems. It is perhaps not surprising that they do, given that Adobe makes the following statement on its conversion tool FAQ page.
"The Access conversion technology was developed to allow blind and visually impaired users to read Adobe PDF documents with speech synthesis software."
However, there is a problem. The technologies for making PDFs accessible (Acrobat, Word, screen readers etc) have improved beyond all recognition over the past few years and it is now easy to make almost all PDF content fully accessible. Over the same period the PDF to HTML conversion tool has not been developed. It may once have been the best available option, despite its acknowledged limitations, but it has been left far behind by developments elsewhere.
Testing the conversion tool
We tested the conversion tool using four different types of PDF: a scanned document, an untagged document, a tagged document and a form. These categories are the same as those used in the joint Adobe/American Foundation For the Blind (AFFB) publication Accessing PDF Documents with Assistive Technology: A Screen Reader User's Guide (PDF, 306KB) and represent a broad cross-section of the types of PDFs users are most likely to encounter. Incidentally, the Adobe/AFFB document gives many tips on how to deal with inaccessible PDFs but makes no mention of converting them to HTML.
Where possible, we compared the performance of the original PDFs against the HTML versions using JAWS screen reader software. The four PDFs used for testing are listed below.
- Scanned document (PDF, 106KB)
- Untagged document (PDF, 64KB)
- Tagged document (PDF, 78KB)
- PDF form (PDF, 137KB)
Summary of findings
- The scanned document could be read well enough in the PDF version although it lacked structure and hence had limited usability. However, the conversion tool was unable to create an HTML version.
- The untagged document performed moderately at best in both PDF and HTML (although it fared slightly better in the HTML version).
- The tagged PDF worked without any problems in JAWS. However, the HTML version performed moderately at best.
- The PDF form also worked well in JAWS. However, as with the scanned document, the conversion tool was unable to create an HTML version.
These findings are further summarised in the table below.
| Document | HTML | |
|---|---|---|
| Scanned | Moderate | Not possible to convert |
| Untagged | Moderate to poor | Moderate to poor |
| Tagged | Excellent | Moderate to poor |
| Form | Excellent | Not possible to convert |
Conclusion
Details of the test findings are given below. The conclusions drawn from them are straightforward: out of the eight document/format combinations only two are fully accessible – the PDF form and the tagged PDF. The conversion tool is unable to convert some types of PDF to HTML and those that it can are not particularly accessible in comparison to the original PDFs. In a nutshell, the PDF to HTML conversion tool isn't the solution. Making the PDFs themselves accessible is.
The text only option
The conversion tool also offers a text only option. Text only documents can, of course, be read by screen readers. However, for anything except the shortest of documents the ability to navigate is crucial for accessibility, and navigation is only possible via structural elements such as links, headings and lists which, by definition, are entirely absent from text only content. Therefore, this can only be considered an if-all-else-fails option.
But, given the technological and know-how advances of recent years, there is, frankly, no longer any excuse to fail …
Test 1 – the scanned document
The first test was of a simple text document, created in Word, scanned and converted to PDF. On opening the PDF, JAWS warns that it contains only an image of text and prompts the user to run Optical Character Recognition (OCR). It then prompts the user to run a type of auto-tagging process. Once both processes are complete JAWS can read the text.
Such a document contains no structural elements by which to navigate, which can be a problem, the more so the longer the document. But for short documents this format is manageable.
Testing the HTML
The conversion tool is unable to convert a scanned document to HTML. Attempting to do so simply returns a blank screen.
Test 2 – the untagged PDF
The next test was of an untagged PDF in the form of a typical short report. In addition to the main body of text it contained:
- a table of contents
- headings
- a link to an external site
- an image
- a data table
- a bullet list
- a footnote
On opening the document, as with the scanned document, JAWS prompts the user to run the auto-tagging process. Once done the document can be accessed. The table of contents and headings can be read but there is nothing to navigate by as the links in the table of contents don't work and the headings aren't recognised as such. In addition, various content items will be difficult or impossible to access. For example, the image's alt text (added in the Word original) is missing, the data from the table makes no sense and the footnote appears in the wrong place in the reading order (it relates to content in the first paragraph but is read as the last item on that page). Furthermore, the bullet list isn't recognised as a list although its text is voiced, and although the external link is clickable with a mouse it isn't keyboard accessible, nor is it recognised as a link by JAWS. All in all, the accessibility of this document is quite poor by modern standards.
Testing the HTML
On conversion to HTML the untagged PDF also performs quite poorly, although not quite as poorly as the PDF version. The links in the table of contents aren't active, the image's alt text is missing, the table does have some structure, but not enough to make sense of the data, and the footnote is voiced in the wrong place.
On the plus side, the HTML version does have a proper heading structure and does generate proper paragraph and list tags. For these reasons the HTML version does have the edge on the original PDF, although neither works particularly well.
Test 3 – the tagged PDF
The third document tested was a properly formatted and tagged version of the above untagged PDF. In this case properly formatted means that after conversion to PDF the document was edited to correct reading order problems with the footnote and the image's alt text. Tagging errors in the data table were also corrected. Note: this document was originally created in Word 2003. In Word 2007 Microsoft has fixed the problems of both alt text and footnotes appearing in the wrong place in the reading order (a common problem with Word 2003 generated documents). At the time of writing there was a serious problem with tables of contents created in Word 2007, which is why Word 2003 was used to generate the test documents. Happily, this problem too has now been resolved.
The tagged PDF works perfectly in JAWS. It is easy to navigate via semantically correct headings and the table of contents, the data table makes sense, the image's alt text is available and is in the correct place in the reading order, as is the footnote. The external link works correctly and so does the bullet list.
Testing the HTML
On conversion to HTML the tagged version has a good semantic heading structure and the alt text is readable and in the right place. The data table is mostly understandable and navigable using the standard JAWS table navigation commands (Alt + arrow keys). However, JAWS is unable to read the headings correctly for data in the last column of the table. This was originally also a problem in the PDF but was fixed relatively easily (by amending the column span value of one of the table headings). Unfortunately, the fix wasn't carried through in the conversion to HTML.
On a less positive note, confusingly, the HTML version generates two tables of contents one of which contains working links and the other not. Also, somewhat curiously, the contents of the footnote go missing entirely as does all of the text from the bullet points – the conversion tool tries to create a definition list rather than the unordered list that it should (and does with the untagged PDF), but in the process it loses the list's content.
The HTML version of the tagged document is an improvement on the HTML version of the untagged version in that it is navigable to an extent via its heading structure. However, unlike the untagged version, it also adds a few highly confusing features and some of its content goes missing entirely. Therefore, it too can only be said to have a moderate level of accessibility at best.
Test 4 – the PDF form
The last test was of a form built with LiveCycle Designer (a program within Acrobat Professional (Windows only)). This test produces the starkest contrast of all. The PDF form is very simple and straightforward to use with a screen reader. However, as with the scanned document, the conversion tool is unable to convert the PDF to HTML.
Wrapping up
Adobe's online PDF to HTML conversion tool does not provide the accessibility safety net that it is so often assumed to. Some PDFs can't be converted to HTML at all, and those that can are significantly less accessible than a properly authored PDF. The real solution is to make the PDFs themselves accessible.
Ted Page Director PWS
PDF accessibility services
For further information please see our Accessible PDFs or PDF accessibility training pages, or contact us on 01932 355 222 or 07918 952 874.
