trial version annoyances

I had a quick-and-dirty task to do today at work:  I wanted to write a very simple program which would split an Adobe PDF document into its individual pages. It didn’t sound like a difficult thing to accomplish, to be honest. By the end of the day, however, I find myself in hacker mode, putting much more effort into doing an end-run around someone’s idea of security.

split-pdf

Options

Of course, this is relatively easy on OS X in the Automator utility. You can create a service, associate it with a folder, say, and then drag/drop a PDF into that folder. Done.

But this needed to be for Windows-based computers and I had a preference to do this in C# within Visual Studio if there wasn’t an easier way of doing it otherwise. Researching a bit I confirmed that there weren’t any native tools within Windows which would take care of this. Next, I then looked for free libraries or similar. This search turned up:

  1. iText (ruled since it’s just a .Net wrapper over Java)
  2. PdfBox.net (ruled out since it’s just a .Net wrapper over Java)
  3. Spire.pdf
  4. Aspose.pdf

And yet, each of these seems to expect money from me in order to build a solution. Granted, somebody probably put a lot of effort into these libraries. I remember myself creating a very nice one-pass XML-to-PDF compiler perhaps ten years ago and was very fond of it. Perhaps it was that experience that led me to the solution I chose: I decided to use Aspose.pdf and then programmatically render their trial-version watermark void.

You might be thinking, “why don’t you just pay for the library?” That’s a good question. The people who wrote Aspose.net expect me to minimally pay $799 per year just to be a developer. And then, presumably, each client would also need to pay this amount for a licensed DLL. They have seven even higher pricing tiers into the many-thousand area. Given the need to simply split a PDF file, I don’t see the value.

The Difficulty of Starting From Scratch

Granted, I could begin from scratch and write a PDF “tree-walker”, find the pages, iterating through them to re-create the content page by page. Since I understand the underlying storage method in a PDF file this could be done in under a month. I could then build this into my own library and charge money for it, presumably cutting the knees out from under these players in the market space.

That said, splitting a PDF file isn’t an $800 problem nor is it a one-man-month problem. A program which splits a PDF file should cost about… $10 tops.

The Problem With the Trial Version of Aspose.pdf-generated PDFs

Unfortunately, the trial version of the Aspose.pdf library places an obtrusive watermark at the top of each page.

AsposeWatermark
Example output of the  trial version of the Aspose library

 

 

Programmatically-Removing Watermarks From PDFs

So then, I researched to see if there were any available/free methods of removing watermarks from PDF files. There doesn’t appear to be. I would need to write it myself.

One challenge is the problem is patching a binary file in-place with C#. To be honest, I expected the .Net framework to have something like this but that doesn’t appear to be the case. In addition to hacking the PDF object code I would need to write a rudimentary binary search-and-replace routine for C#.

Hacking the PDF File

It’s good to be familiar with the object storage model for PDF files in order to understand what approach I then took.

A typical PDF file includes many objects and a table at the end which is essentially a table of contents for those objects. If you’re familiar with a Rich Text Format (RTF) file, then it’s much like this except for the catalog at the end.

It’s that catalog at the end that provides the first challenge, when editing a binary PDF file you can’t change the size of an object or move it. Doing so would break the catalog.

The second biggest challenge when editing a binary PDF file is the frequent use of inline compression/encoding. You can’t easily find the actual object that you’d like to overwrite. And yet, with a simple PDF file you can accomplish this by using a hexadecimal editor and iteratively change one character per object until you “break” the object in question, that pesky watermark.

AsposePDF.png
Typical PDF file contents

 

 

The Achilles Heal of Watermark-based Prevention

So now, what would it take to nuke that watermark? One method would be to find the object, physically remove the entire object from the file and remove its reference from the catalog. And yet, then I’d need to update the file offsets for half of the other objects within the file itself.

Inside the body of the PDF file each of these compressed-content objects includes the key to its own demise:  FlateDecode. This is the protocol for compressing the included text within an object and I believe it’s the ZLib (Limpel-Ziv) compression at work. And that usually includes an Adler-32 checksum at the end of it. Replace even a single byte of that compressed stream—presumably without updating the checksum—and that object content is broken.

But what does Adobe Reader do with a broken object? It silently swallows it without displaying it, which is exactly what we want to do here! Replace even a single encoded byte in that unwanted watermark and it’s effectively gone.

“Replace even a single encoded byte in that unwanted watermark and it’s effectively gone.”

So the hack then was a few lines of code. As I mentioned before, I used a trial-and-error method of temporarily editing one compressed section of PDF after another until I’d broken the watermark. At this point, I then determined that the text for my target search was “xœ}OM” or more simply “}OM”. Confirming that the watermark included the only occurrence in the file of this combination of characters allowed me to do a binary comparison and replacement.

// Above this was the Aspose sample code to write each page
// to a file. I inserted this code on a per-page basis to
// then modify that newly-created PDF file.

// This is our own code to find/replace their watermark
string fileToModify = pdfDocument.FileName.Substring(
	0, pdfDocument.FileName.IndexOf('.')
	) + "_p" + pageCount + ".pdf";
string fileModified = pdfDocument.FileName.Substring(
	0, pdfDocument.FileName.IndexOf('.')
	) + "_p" + pageCount + "_no-watermark.pdf";
using (var reader = new BinaryReader(
	new FileStream(fileToModify, FileMode.Open)))
	{
	using (var writer = new BinaryWriter(
		new FileStream(fileModified, FileMode.Create)))
		{
		byte[] buffer = new byte[1024];
		int count;
		while ((count = reader.Read(buffer, 0, buffer.Length)) != 0) {
			// Now look for our sequence
			for (int j = 0; j < (count - 3); j++) {
				if (	buffer[j] == '}' &&
					buffer[j + 1] == 'O' &&
					buffer[j + 2] == 'M')
					{
					buffer[j] =     0x31; // 1
					buffer[j+1] =   0x32; // 2
					buffer[j+2] =   0x33; // 3
					}
				}
			// Optionally having patched in place,
			// write to the destination file
			writer.Write(buffer, 0, count);
			// Empty out our buffer for another run
			for (int i = 0; i < buffer.Length; i++) {
				buffer[i] = 0x00;
			}
		}
	}

I’m sure there are prettier ways of searching a buffer but this was easy enough. Note that I only actually need to change, say, the first character at “buffer[j]” which is sufficient to break that checksum mechanism.

And the rest, as we say, is history.

AsposeWatermarkGone
Same example, after breaking the watermark

You might ask why I’d post about such things. I do it for the sake of my own curiosity and I assume that others like you are curious as well. Just as little kids build sand castles and then smash them to bits we bigger kids like to build security and then smash that as well. One of the reasons why this is good practice is that it teaches us what is “good enough” security and what is “better” security. Just because you think something is secure because you can’t think of a way around it, that doesn’t mean that some other clever person can’t work their magic.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s