Creating PDFs

In this section we'll detail how to take your manual scans and combine them into easy to read PDF files that also who up well in the Internet Archive reader. You'll need to make sure each page is a separate image file and ordered correctly by name so if you haven't done that already do so. If you're doing it with a ADF scanner you can follow my directions for doing that here.

So at this point I expect you to have all your manual pages ordered correctly by filename (Example page1.jpg, page2.jpg, page3.jpg etc.) and you've done your post production/editing on them if you decided to.

Adobe Acrobat is not a free program but it does come with some ADF scanners and is pretty easy to use for the most part. First go to the folder with all your ordered images and highlight them either by pressing CTRL-A or dragging and drawing a box around them. Next right click them and choose “Combine files in Acrobat”

Upon doing that you'll see this screen, click Combine Files and it will process your images.

At this point a new screen will pop up that looks like this.

From here if you want you can go up to File - Save and choose to save your PDF file as whatever you want. However there's some additional things you can do if you want to go to the trouble.

If you look on the right side of the screen you should see a Tools option, and if you click it it will branch out a menu with options. Choose Text Recognition and then click In This File. It should look like this -

From here you'll be shown a Recognize Text window that will allow you to OCR your text so it can be searched easier. You'll need to select the proper settings though so click the Edit Button. From there you should see this -

From here you'll select the language, you'll also be able to pick PDF Output Style. I generally recommend doing Searchable Image (Exact). But there's also another option you can choose that's just Searchable Image. It basically will attempt to straighten your scans and it also does some additional JPEG compression that can drop the file size down dramatically but again it introduces compression with hurts image quality. If you're only going to do one I'd recommend “Searchable Image (Exact)” so it's the best quality. If you're willing to do two PDF versions (Which is what I personally do) you can also run it with just “Searchable Image”. and save it with a different file name like with (Compressed) at the end of it. This is all optional but it's a nice thing to do for users imo and something I personally go ahead and do.

No matter what you run afterwards go through each page and make sure they're orientated right. Sometimes the OCR process rotates or skews pages incorrectly, rotations aren't hard to spot but skews can be and I still miss them on occasion. Try to be patient and go through each page to make sure they look correct. If they don't you can right click and rotate the pages, or replace them when they're skewed. After I replace/rotate pages then I save my PDF files and that's that!

One bug with Adobe Acrobat I've noticed especially when OCR'ing the Japanese language is that it will randomly crash. This is pretty annoying but you can get around it and still OCR your file. What you'll have to do is what for what page it crashes on during the OCR process and remember it. Now when you go to do the Recognize Text step instead of telling it to do all Pages tell it to go up to the page before the one it keeps crashing on. Once those are OCR'd go through it again but pick the page AFTER the page causing the crash until the end. Sometimes there are even multiple pages that can cause crashes which is very annoying especially with larger documents. Unfortunately that's the only way I have figured out how to OCR files that have this issue even though it's a lot of work.