04-26-2023 21:17 PM - last edited 04-26-2023 21:24 PM
As of this writing there is no direct action available to extract text from an Image-based pdf purely via Desktop Flows.
This needs a different approach to extract the image-based text.
The below article explains how to achieve this step by step.
i) Consider a sample input PDF with below 2 pages.
Both the pages are Image-based, the text cannot be selected and copied.
ii) Add the 'Extract images from PDF' action.
Follow the numbering in the above screenshot.
1. Select the Image based source pdf file.
2. Choose 'All' assuming you would like to extract all the pages from the pdf.
3. This is a prefix for the extracted images. For example, if 'Img' is the prefix as shown above then the extracted images from the pdf will have file names as Img_0, Img_1 and so on.
4. Choose a folder where the extracted images will be saved.
iii) Get all the files from the folder chosen in No 4. above
(You can use a variable for the path in the above step and the same here)
iv) The Output variable produced in the above step is seen as the Files variable.
Use a 'For each' loop to iterate through the files in this folder.
v) Inside the loop we will be using the 'Extract text with OCR' on each of the extracted images.
Make sure to select 'Image on disk' and %CurrentItem% as the variable where each of the image path will be stored.
vi) As you can see above, the extracted text is saved in an output variable called 'OcrText'.
For demonstration purposes we will be writing this variable to a text file.
Follow the numbering above:
1. Full path to an Output text file
2. The output text variable coming from Step v.
I have also added a dotted line to act as a separator between the outputs of the two pages of the PDF.
3. Make sure to append the content and not overwrite it.
vii) If you followed along, the entire flow and its variables look like the below.
viii) On running the Flow, three files will the created in the given Output path.
Notice the Img prefix for the two extracted images and one single Output text file.
ix) The Output file shows the output of both the pages of the PDF with a separator between them.
These are now in Text format.
Note that this article does not explain how to retrieve only the specific part from the extracted text, for example the word 'England' from the text.
Those will have to be done using Regex or Text functions.
The good part now is that the text that were images earlier are now in Text format.
x) Sample Desktop Flow and Image based PDF attached.
Change the path and try it out.
This Flow was built using Version 2.31
xi) How to copy-paste any desktop flow from its raw form: