OCR preparation

Prepare scanned documents for better OCR.

OCR accuracy starts before recognition. A cleaner scan, correct rotation and better contrast can matter more than the OCR engine itself.

Updated June 7, 2026

Use the clearest source you have

OCR reads shapes. If the source image is blurry, dark, compressed or tilted, the engine has less information to work with. Use the original scan or highest quality photo available before trying to extract text.

Avoid screenshots of scans when possible. A screenshot often reduces detail and adds extra borders. If you have the original PDF or photo, use that instead of a captured preview.

Keep pages upright and flat

Text should be horizontal and upright. Rotated pages, curved paper and camera perspective can cause words to be read in the wrong order. If you are photographing paper, place it on a flat surface and hold the camera parallel to the page.

For multi-page PDFs, scan through the thumbnails before running OCR. One upside-down page can produce poor output even when the rest of the file is clean. Fixing rotation first is usually faster than correcting the text later.

Improve lighting and contrast

Good lighting reduces shadows and improves the difference between text and background. Avoid strong shadows from your hand or phone. Natural light or a broad desk lamp usually works better than a single harsh light source.

Low contrast is a common problem with receipts, faded forms and grey photocopies. If the text is barely visible to you, OCR will struggle too. Try rescanning with higher contrast or a clearer source before accepting a weak result.

Crop distracting borders

Large borders, desk backgrounds, fingers, other documents and page shadows can confuse OCR. Cropping the page helps the engine focus on the actual text. It also creates a cleaner PDF if you later combine images into a document.

Do not crop too aggressively. Page numbers, labels, signatures and small notes can be important. The goal is to remove noise around the document, not remove meaningful content.

Choose the right language

OCR language settings matter because they guide the engine's expectations. English text, Italian text and mixed language documents use different spelling patterns and accented characters. Choosing the closest language improves recognition and reduces strange character substitutions.

For documents with two languages, use a mixed setting if available. For documents dominated by one language with a few foreign names, use the main language and review names manually.

Prepare different document types

  • For receipts, flatten folds and avoid glare on thermal paper.
  • For contracts, make sure small clauses and page numbers remain readable.
  • For forms, keep labels and filled fields together in the crop.
  • For invoices, review totals, dates and invoice numbers after OCR.
  • For handwritten notes, expect manual correction even with a clean image.

Use page-by-page judgement

A multi-page scan is rarely perfectly consistent. The first page may be clean, while later pages are darker, rotated or partially cropped. Before processing the full PDF, skim through the page previews and look for pages that need attention. Fixing one poor page before OCR can save more time than correcting dozens of wrong words afterward.

For long documents, consider splitting out the pages that actually need OCR. If only a few scanned pages contain the information you need, there is no reason to run recognition on a whole packet. This keeps the workflow faster and makes review easier.

When the scan cannot be improved

Sometimes you only have a poor source: a faded receipt, a photo from someone else or a scan that has already been compressed. In that case, run OCR as a draft and treat the output as assistance rather than truth. Copy useful text, then manually verify the details that matter.

If the document has legal, medical or financial importance, do not rely on a weak OCR result. Ask for a clearer copy when possible, or keep the original scan attached to any extracted text so the information can be checked later.

Test before processing everything

Run OCR on one representative page first. If that page contains the same kind of text as the rest of the file, the result tells you whether the scan is ready. If the output is poor, fix the source and try again before spending time on the full document.

Keep the first OCR result as a comparison when changing scan quality or language settings. If the second attempt is only slightly better, manual correction may be faster than repeated processing.

Better input produces better OCR. Rotate, crop and check contrast before blaming the text recognition step.