1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-07-25 06:37:43 +00:00

LibPDF: Tolerate trailing whitespace after %%EOF marker

At first I tried implmenting the quirk from PDF 1.7 Appendix H,
3.4.4, "File Trailer": """Acrobat viewers require only that the %%EOF
marker appear somewhere within the last 1024 bytes of the file.""
This would've been like #22548 but at end-of-file instead of at
start-of-file.

This helped a bunch of files, but also broke a bunch of files that
made more than 1024 bytes of stuff at the end, and it wouldn't have
helped 0000059.pdf, which has over 40k of \0 bytes after the %%EOF.
So just tolerate whitespace after the %%EOF line, and keep ignoring
and arbitrary amount of other stuff after that like before.

This helps:
* 0000599.pdf
  One trailing \0 byte after %%EOF. Due to that byte, the
  is_linearized() check fails and we go down the non-linearized
  codepath. But with this fix, that code path succeeds.
* 0000937.pdf
  Same.
* 0000055.pdf
  Has one space followed by a \n after %%EOF
* 0000059.pdf
  Has over 40kB of trailing \0 bytes

The following files keep working with it:
* 0000242.pdf
  5586 bytes of trailing HTML
* 0000336.pdf
  5586 bytes of trailing HTML fragment
* 0000136.pdf
  2054 bytes of trailing space characters
  This one kind of only worked by accident before since it found
  the %%EOF block before the final %%EOF block. Maybe this is
  even an intentional XRefStm compat hack? Anyways, now it
  find the final block instead.
* 0000327.pdf
  11044 bytes of trailing HTML
This commit is contained in:
Nico Weber 2024-01-03 17:56:16 -05:00 committed by Andreas Kling
parent 2d12647e29
commit 9d69c5d434

View file

@ -726,6 +726,7 @@ bool DocumentParser::navigate_to_before_eof_marker()
while (!m_reader.done()) {
m_reader.consume_eol();
m_reader.consume_whitespace();
if (m_reader.matches("%%EOF")) {
m_reader.move_by(5);
return true;