pdfsandwich: OCR PDFs containing images
pdfsandwich
is a handy tool developed by Tobias Elze for OCR’ing (via tesseract
) scanned documents. Recognized text is added as an background layer, making it possible to search and index scanned documents.
pdfsandwich -lang eng+deu scanned.pdf