Reduce PDF Size of a Scanned Document
I recently had to e-mail a multi-page document – e-mail is that awesome 1990’s technology for submitting documents when the recipient doesn’t have a secure file transfer server – only to find that the attachment size was too large.
I scanned in three separate multi-page documents and then used pdfSAM to combine them all in to a single, 9-page PDF document.
Are you with me so far?
The final file size was over 80 meg!
I didn’t think anything of it, until my e-mail server complained about the attachment size.
After a little bit of thinking…
- I had used xsane, a GUI front-end for SANE.
- Where Windows has TWAIN, Linux has SANE.
Connecting a scanner to a PC pre-TWAIN used to be a kludge, and TWAIN was the answer: a manufacturer-independent hardware interface and driver standard that provides a scanner user interface, and also allows third-party programs to access any TWAIN-compliant scanner via a standard Application Programming Interface (API).
SANE works like TWAIN, but separated in to two pieces: front-end and back-end. The back-end drivers allow the OS to talk to the scanner hardware, and provides generic access to other applications (like the TWAIN API). This allows any of several SANE front-end user interfaces to access any SANE back-end, even across a network. Very cool stuff, and there is even a SANE shim for Windows that provides TWAIN services, while using SANE on the back-end.
- As my print server is a Raspberry Pi (Linux) and my main PC is Linux, it made the most sense to set the Pi up as a SANE back-end server.
- I scanned all three multi-page documents using xsane, as mentioned, which allows you to scan a multi-page document and then save it as a single file.
- While poking around a bit, I noticed that the intermediate format for xsane’s multipage applet is TIFF.
- Like most other computing standards from the 1980’s, TIFF just isn’t very good, especially compared to modern file formats that use much better compression algorithms.
Unfortunately, TIFF was developed as a standard file format for scanners, which means that any time you’re dealing with scanners (or, oh lord, fax machines), be prepared to deal with TIFF, despite the fact that it’s a completely obsolete standard. I mean, while we’re at it, let’s just fire up the 1200 baud modem and a copy of Telix, and we can XMODEM some TIFF files, shall we??
So it became clear that xsane / sane was just saving a bunch of TIFF files, and then copying them straight in to the PDF.
It turns out that Linux has a built-in command to split a PDF apart:
pdftoppm file.pdf imagefilename -png
This will create an image file for each page of the PDF, stored in PNG format. For example, if file.pdf is 5 pages, you will get:
imagefile-1.png
imagefile-2.png
...
imagefile-5.png
These can then be recombined using ImageMagick’s convert:
convert imagefile-*.png newfile.pdf
As predicted, the resulting file was around 20 meg – 25% of the original file size.
Internally, PDF probably stores the images as TIFF, but in either case, the round trip through the PNG format is what affected the compression.
PNG performs “filtering” or normalization of the image before it’s compressed, and thus, the resulting image is “more compressible”, and the artifact of this process is that going back to TIFF (or any other format) makes the older compression algorithms perform more efficiently with the normalized image data, resulting in a much smaller file size.
Moral of the story:
If you have a huge PDF (or any kind of source image), try converting to PNG and then back again, which might make the original compression more efficient.
Other options:
- Subsample the document down to a lower resolution. I scanned at 300 DPI, but my next option if that didn’t work was going to be to subsample the document down to 150 DPI, which would result in a 4x file size reduction.
- Go from color to grayscale or black-and-white. Going grayscale from color usually results in about a 4x reduction in file size, and black and white from color might result in readability issues, but can save 16x or greater from the file size.