Re: Failed to Extract Text with OCR with Tesseract...

afmc2238 · ‎06-16-2021

I am currently trying to extract a small bit of text from a scanned pdf file. I am using the "Extract Text with OCR" action and get the error below every time. I have tried either reading all of the text from the fall or a subregion with the same result. I have confirmed that the Tesseract connector is on my local machine. I've also tried this with "Create Tesseract OCR engine" as the prior action (even though I believe that is no longer needed) with the same result.

Parameter is not valid.: Robin.Core.ActionException: Failed to extract text with OCR ---> System.ArgumentException: Parameter is not valid.
   at System.Drawing.Bitmap..ctor(String filename)
   at Robin.Modules.OCR.Utilities.Utilities.GetImageForOCR(OCRSource source, SourceScanMode sourceScanMode, Nullable`1 scanRegionX1, Nullable`1 scanRegionY1, Nullable`1 scanRegionX2, Nullable`1 scanRegionY2, IEnumerable`1 imagesToFind, Int32 tolerance, Boolean waitForImage, Boolean timeoutSet, Nullable`1 timeout, Nullable`1 searchRegionImageX1, Nullable`1 searchRegionImageY1, Nullable`1 searchRegionImageX2, Nullable`1 searchRegionImageY2, Action suspendSecureScreen, Action restoreSecureScreen, String imageFilepath, IImageFinder imageFinder)
   at Robin.Modules.OCR.Actions.ExtractTextWithOCRBase.Execute(ActionContext context)
   --- End of inner exception stack trace ---
   at Robin.Modules.OCR.Actions.ExtractTextWithOCRBase.Execute(ActionContext context)
   at Robin.Runtime.Engine.ActionRunner.RunAction(String action, Dictionary`2 inputArguments, Dictionary`2 outputArguments, IActionStatement statement)

I would greatly appreciate some help with this!

Pavel_NaNoi · ‎06-23-2021

I'm just making sure here, but is the file a PDF or an actual image? I'm fairly certain that action cannot extract text from an actual PDF file, only images or a foreground window. If it is an image, this might honestly be a case of a weird image extension, make sure its in .jpeg or .png

afmc2238 · ‎06-23-2021

I had played around with this and got it to partially work when I changed the file to a .png. However, it still doesn't work when I use the selector tool to grab only a certain area of the image. It only works if I grab all text from the image, and the results are very inaccurate.

Most likely we will just need to incorporate a better OCR tool to get it to work as we need for our use case.

Thanks for the suggestion!!

Pavel_NaNoi · ‎06-23-2021

Oh wait I forgot to ask, isn't there a PDF action in power automate desktop that extracts all the text instantly?

You could probably just parse the text that you want from the variable that action produces. with regex

Also, yeah the OCR can be a bit of a pain when it comes to this, I recommend the free trial of AI builder on the power automate platform if you haven't accessed it yet, that thing works with pdfs and images and you can select exactly what you want to extract, fairly simple to understand as well, god that sounds like an advertisement when I read it out loud ^^| but yeah, give that a spin if you're out of options.

afmc2238 · ‎06-23-2021

Well the problem is that this is a scanned document rather than a readable PDF so that's why I needed to use OCR.

I started a free trial of AI Builder last week but didn't see how to use this with desktop Power Automate. I see that you could use Microsoft Computer Vision....but would love to play around with AI Builder in PAD if possible. Do you know how to make that work?

Pavel_NaNoi · ‎06-23-2021

It depends if you have windows 10 pro/windows server 2016/windows server 2019 or not, if you do, it should be easy to feed ai-builder items into PAD through power automate, and I can help guide you through it a bit, otherwise it wont work.

Also, If you got it to run and its just being in-accurate, in the Tesseract OCRengine change the image width and height multiplier to 2 instead of 1,

this should help it a lot, from there its more of finding the correct x and y positions of the text (use If Text on screen (OCR) to find the position of a specific text value more accurately)

afmc2238 · ‎06-23-2021

Great -Thank you!

henryhvb5 · ‎10-28-2021

I have the same problem, but this problem is found after update from 2.13xx version to 2.14.173.21294, and my account is a free account, the OCR engine variable value show blank without any error message. Before update this engine can extract value. But now I have start a new flow and use the same pdf image use the same extraction method, but the unable to extract any text. What should I do ?

Pavel_NaNoi · ‎10-29-2021

Its because the tesseract engine initialization action has been depreciated in that update, the OCR engine initialization action didn't have much use outside of being an extra action, so its now just in any "Extract OCR Text" action where you have to select instead of "OCR engine variable" in OCR Engine type, to "Tesseract Engine" where it will work just like before. If that's not it you can also keep increasing the width and height like I mentioned in the previous post as that can also be the reason because OCR is just very janky.

Also, there's an action for extracting text from a pdf directly called "Extract text from PDF", try that if you get stuck and just parse it.

afmc2238 · ‎10-29-2021

Unfortunately I was never able to get this to work consistently. Luckily the option to use an API call instead became available, and that works every time.

henryhvb5 · ‎10-31-2021

Thank you for your reply, my cases can't use the Extract text from PDF, since the PDF is an invoice for user to sign and then scan back as an image.

In this case, base on my understanding from your advice, I should got another OCR Engine to install in windows and use the OCR engine variable to my flow, am I right ? ( btw, this version can select the tesseract engine in the pull down menu)

If the tesseract engine not working, where should I got those OCR engine ? (those require to paid and free engine)

afmc2238 · ‎11-02-2021

Hi!

Based on the recent update that was mentioned by @Pavel_NaNoi, PowerAutomate just took that extra engine action away. This does not mean you need to find another tool to use OCR, just use the in-built action "Extract text with OCR". No "OCR Engine Variable" needs to go before the "Extract text with OCR". Hope this works for you!

Pavel_NaNoi · ‎11-02-2021

Apologies, I've missed this in my notifications, what henry said is what i meant. ^^

henryhvb5 · ‎11-03-2021

Hi,

Thank you for your reply, but up to this moment same work flow and same PDF image with same extract X Y coordinate mark remain unable to capture anything after upgrade version from 2.13xxx to 2.14xxx. I have no idea what is going on. Except the version update no change.

That's why I post this problem to ask expert for. And I am not a programmer, I think I am a power user with a little bit technical knowledge. I have search for some expert suggestion in cloud API but most of them require monthly fee. From a free engine and working flow in 2.13xx to 2.14xx require to pay for a unexpected result cloud OCR API service, for me it is hard to ask for my boss to pay for this cost.

I still looking for any alternative solution and waiting for PAD may be next update to fix this OCR engine problem, if other user with the same unhappy experience.

Pavel_NaNoi · ‎11-04-2021

Can you give me a screenshot of what your flow looks like?

henryhvb5 · ‎11-10-2021

Hi Pavel,

Sorry for late reply, here is the screenshot of my flow

The major task for this flow is to capture the DEL number of the above image, and I have use the desktop record function and turn on the image recording and then right click the mouse to extract text from image. First highlight the DEL number and then anchor the "1 of 1" at the DEL number below and turn on the image matching algorithm to advanced. Before upgrade version, this algorithm is working fine in 2.13 xxx but extract nothing in version 2.14xxxx. Is there anything I do wrong in the anchor define or I need to change something to make this flow work again ?

Thank you for your help.

Pavel_NaNoi · ‎11-11-2021

Since I don't have the actual files, I had to improvise a little bit, I think your best bet might be doing something like this when you open the PDF up:

This will basically take a screenshot of the PDF, save it as an image and then OCR it via image on disk type OCR, (width and height are both 1, if it fails to get the DEL number, increase them both a few times) I also did a regex with a lookbehind to ignore the DEL part, thought I'm not sure if that's needed, otherwise just do DEL\d+ which will keep the DEL part in.

Unfortunately the image file cannot be deleted via Delete File(s) action (no idea why) but there will only ever be 1 image file as it will get overwritten every time a new screenshot is taken, so just store it in a useless folder or delete it manually.

Hope this helps you out man.

henryhvb5 · ‎11-11-2021

Thank you for your help, Please try this image, this interface can't attach PDF file.

Thank you very much Pavel

Pavel_NaNoi · ‎11-12-2021

Yep, it worked with this image, please try using the method I've shown in the screenshot of the previous post, put it after you've opened the pdf file and it should work.

Tell me if it doesn't and I'll see what else I can do.

henryhvb5 · ‎11-14-2021

Hi Pavel,

Unfortunately, I still unable to capture anything via the attached image, would you mind to share the inside parameter of the OCR capture part, I have try adjust the X, Y coordinate many time, except capture the whole image, but it is unstable to capture in differ image, sometime it can store the DEL number in line 30 of the variable, sometime in line 60 or sometime no record. That's why I would like to ask for the detail setting in your OCR text capture flow. I still try my best to use your method to locate the right parameter for capture DEL number.

Once again thank you for your help.

Failed to Extract Text with OCR with Tesseract Engine

Helpful resources

Community will be READ ONLY July 16th, 5p PDT -July 22nd

Summer of Solutions | Week 4 Results | Winners will be posted on July 24th

Check Out | 2024 Release Wave 2 Plans for Microsoft Dynamics 365 and Microsoft Power Platform

Updates to Transitions in the Power Platform Communities