We’ve all been there. At one point or another for whatever reason, instead of a good and proper text document – you will be presented with an image of said document packaged up into a PDF and sent to you. And you have to type everything over into text… GREAT!
Enter: OCR readers
OCR (Optical Character Recognition) software is the tech that is used to “read” text from an image and put it into… well, real text. That you can copy, paste, edit etc. How does it work? By detecting changes in color, shapes, contrast, differences between lines etc – plenty of algorithms and math that usually only tech people and scientists care about. Point being – there is a lot of stuff going under the hood while the ML model is trying to divine what does the image say. And thats all before we get into all the different languages, dialects, lettering (latin or cyrillic) etc. There is a few applications that do this.
Understanding the Problem
-
- Text from Images:
-
- Handling skewed or rotated text.
- Recognizing characters in multiple languages.
- Managing noise and poor image quality.
-
- Text from Embedded Images in PDFs: PDFs can contain layers of data, including text, images, and annotations. Extracting text from embedded images is more complex:
-
- Images must first be separated from the PDF in the correct format.
- OCR is applied to the extracted images.
-
- Text from Images:
1. Tesseract OCR + Python
Tesseract is an open-source OCR engine developed by Google. While effective, it requires multiple steps to process PDFs and extract text.
Workflow:
-
- Extract Images from PDF: Use a library like
pdf2image
in Pythonpip install pdf2image pip install pytesseract
- Process Images with Tesseract:
- Extract Images from PDF: Use a library like
from pdf2image import convert_from_path
import pytesseract
# Step 1: Convert PDF to images
pages = convert_from_path('example.pdf', dpi=300)
# Step 2: Run OCR on each image
for i, page in enumerate(pages):
text = pytesseract.image_to_string(page)
print("Page {i + 1}:\n{text}")
Challenges:
-
- Detect what parts of the PDF are actually an image and what parts are not.
- Extracting the images in the correct format.
- Tesseract requires precise image preprocessing for optimal accuracy (e.g., binarization, denoising).
- Configuring language packs for multi-language PDFs adds complexity.
- Reading the image.
- Saving the result.
2. ABBYY FineReader CLI
ABBYY FineReader is a commercial OCR software known for its accuracy. While its graphical interface is user-friendly, the CLI version has a steeper learning curve.
Workflow:
-
- Install ABBYY CLI: This involves purchasing a license and installing the software.
- Run OCR with Configuration Files:
-
- Create a configuration file specifying processing rules.
-
frenginecli --input example.pdf --output output.txt --lang English+French --mode OCR
Challenges:
-
- The command-line interface can be overwhelming for beginners.
- Configuring language and image preprocessing settings requires familiarity with ABBYY’s syntax.
Advanced Considerations
-
- Preprocessing: Successful OCR often depends on preprocessing steps like deskewing, resizing, or converting images to grayscale. Tools like OpenCV or ImageMagick can be integrated to handle these.
- Multi-language Support: Configuring OCR engines to handle multiple languages often requires installing language packs and fine-tuning settings.
3. Online tools
While they may be relatively easy to use (since you just upload an image) there are major privacy concerns. Is the website trusted, is it storing data somewhere, how long is it storing the data, is it collecting metadata about the file and its contents etc.
You simply cannot know because that website is essentially running on someone elses computer and you are just using it.
4. OfficeBrief
This is where OfficeBrief shines. An offline-ready application with a user friendly interface, multiple language support and complete privacy.
How can you be sure that its private? Let the app validate the license key at the start and disconnect the PC from the internet – it will work fine. The images will still be read, multiple languages will still work, and everything is done on your own personal machine. A more private solution basically does not exist (besides pen and paper, maybe).
- Native support for most image formats (JPG, JPEG, PNG etc)
- Native support for PDF’s, and images inside PDF’s
- Native multi language scan available
- Fully private and offline capable
- Local caching for instant access later should you need it
How to
Examples used:
1 – https://nlsblog.org/wp-content/uploads/2020/06/image-based-pdf-sample.pdf
Download the example PDF and run it yourself! See the results!