Re: Extract Data From PDFs and Images With GPT

takolota · ‎06-15-2023

Extract Data From PDFs & Images With GPT

This template uses AI Builder's OCR for PDFs & Images to extract the text present in a file, replicates the file in a text (txt) format, then passes it off to a GPT prompt action for things like data extraction.

Seems to have a 85% or greater reliability for returning requested data fields from most PDFs. It's likely good enough to do more direct data entry on some use-cases with well formatted, clean PDFs, and in many other cases it is good at doing a 1st pass on a file & providing a default / pre-fill value for fields before a person then checks & completes something with the data.
It does not require training on different formats, styles, wording, etc. It works on multiple pages at once. And you can always adjust the prompt to extract the different data you want on different documents & adjust how you want the data to be represented in the output.

It also...
-Runs in less than a minute, usually 10-35 seconds, so it can respond in time for a Power Apps call.

-Handles 10-20 document pages at a time given the recent Create text with GPT update to a 16k model.
-Does not use additional 3rd party services, maintaining better data privacy.

The AI Builder Recognize text action returns a JSON array of each piece of text found in the PDF or image.

The Convert to txt loop goes through each vertical line in the PDF or image & creates a line of text to approximately match both the text & spacing between text for that line.

Each vertical line of text is then combined into a single block of text like a big txt file in the final Compose action, before it is then passed to GPT through the AI Builder Create text action.

Example

Demonstration Invoice Example...

The AI Builder action uses optical character recognition (OCR) on this invoice PDF to return each piece of text & its associated x, y coordinates.

Then the Convert to txt loop produces this output shown in the final Compose...

And if we copy that output over to a text (txt) notebook, then this is what it looks like...

That is then fed into this GPT action prompt...

Which produced this output...

{
"Invoice Date": "2022-09-20",
"Invoice Number": "8304933707",
"Purchase Order (PO) Number": "PO10022556-NIMR",
"Incoterms": "DAP",
"Delivery Or Ship To Address": "Dr The Mission Director, [REDACTED]",
"Consignee Address": "CHEMONICS INTERNATIONAL INC, ATT: ACC PAYABLE, GLOBAL HEALTH SUPPLY CHAIN (PSM), 1275 New Jersey Ave SE, Suite 200, WASHINGTON, DC, 20003 USA 20006, UNITED STATES OF AMERICA",
"Mode Of Shipment": "N/A",
"Product Lines": [
{
"Product Name": "KIT COBAS 58/68/8800 LYS, REAGENT IVD",
"Product Quantity": "49",
"Product Unit Price": "213.00",
"Product Line Total or Amount": "10,437.00",
"Manufacturer": "[REDACTED]"
},
{
"Product Name": "KIT COBAS 58/68/8800 MGP, IVD",
"Product Quantity": "165",
"Product Unit Price": "50.00",
"Product Line Total or Amount": "8,250.00",
"Manufacturer": "[REDACTED]"
},
{
"Product Name": "KIT COBAS 6800/8800 HIV 96T, IVD",
"Product Quantity": "5",
"Product Unit Price": "838.95",
"Product Line Total or Amount": "4,194.75",
"Manufacturer": "[REDACTED]"
},
{
"Product Name": "KIT COBAS 6800/8800 HIV 96T, IVD",
"Product Quantity": "313",
"Product Unit Price": "838.95",
"Product Line Total or Amount": "262,591.35",
"Manufacturer": "[REDACTED]"
},
{
"Product Name": "KIT COBAS 6800/8800 HIV 96T, IVD",
"Product Quantity": "65",
"Product Unit Price": "838.95",
"Product Line Total or Amount": "54,531.75",
"Manufacturer": "[REDACTED]"
},
{
"Product Name": "KIT COBAS HBV/HCV/HIV-1, CONTROL CE-IVD",
"Product Quantity": "72",
"Product Unit Price": "290.00",
"Product Line Total or Amount": "20,880.00",
"Manufacturer": "[REDACTED]"
}
],
"Invoice Total": "360,884.85",
"Banking Details": "[REDACTED]"
}

And remember you can always adjust the prompt to extract the different data you want on different documents & adjust how you want the data to be represented in the output. You can also often improve the output with more data specifications like "A PO number is always 2 letters followed by 8 digits. Only return those 2 letters & 8 digits."

Also if you are working with some Word/.docx files, there are built in OneDrive actions to convert them to .pdf files. So you should be able to process PDF, Image, and/or Word documents on the same type of set-up.

Also if you need something that can handle much larger files with a better page text filter/search set-up & larger GPT context window, check out this Query Large PDFs With GPT RAG template.

Remember, you may need AI Builder credits for the OCR & GPT actions in the flow to work. Each Power Automate premium licenses already come with 5000 credits that can be assigned to your environment. Depending on your license & organization, you may already have a few credits assigned to the environment.

If you are new, you can get a trial license to test things out: https://learn.microsoft.com/en-us/ai-builder/administer-licensing

Lastly, Microsoft recently started requiring approval actions after every GPT action. If you want to get around this requirement, see this post on setting the approval step to automatically succeed & move to the next action.

Version 1.7 simplifies some expressions. Download this version if you are just trying to initially understand the programming of the flow, & don't care as much about speed or efficiency.

Version 1.8 adds a PageNumbers compose action that allows one to input specific pages of a PDF or image packet to pass on to the text conversion & GPT prompt. This could be useful for scenarios where the relevant data is always on the 1st couple of pages or for scenarios where one must filter to only the relevant pages/images because the full packet of PDF page data or image data would exceed the GPT prompt token / character limit.

Version 2 redesigns the Convert to txt section of the flow to use several clever Select actions & expressions to avoid an additional level of Apply to each looping. So for an example 3 page document with 50 lines per page, instead of taking 15-20 seconds and 156 action calls, it takes 1 second and 21 action calls to create the text replica document.
This makes the entire flow 2X faster (15 seconds vs. 30 seconds) and 7X more efficient for daily action limits.
This makes some use-cases like real-time processing on a Power Apps document upload or processing of larger batches of documents each day much more viable.

Version 2.5 More changes to the Convert to txt component to create a little more accurate text replicas and a change to the placeholder prompt to make the message a little more concise & more accurate. Also moved the spaces & line-break into a single Compose called StaticVariables & changed the variable name to the now more accurate EachPage.
The Convert to txt piece now calculates the minimum X coordinate so it can subtract that number from all X coordinates & thus remove additional spaces on the left margin, helping to reduce the characters fed to the GPT prompt.

The Convert to txt piece also now has a ZoomX parameter in the StaticPageVariables action which sets the spaces multiple, or the number of spaces, per coordinate point. So for example, 200=More Accurate Text Alignment, but 100=Less GPT Tokens. So there may be some trade-offs here. (The recognize text bounding box coordinates around longer pieces of text seem to be dis-proportionatly larger than on smaller pieces of text & mess up the text alignment for rows/lines with multiple boxes / text entries.)

In addition, the Convert to txt piece will now include line-breaks for blank Y coordinate rows/lines to more accurately replicate the vertical spacing of pieces of text. I figured since each line should be just a line-break character, it shouldn't add much to the character / token count for the GPT prompt.

So overall 2.5 adds some better options for increased extraction accuracy or for decreased characters/tokens per page & thus for slightly larger file capacity.

Version 2.7 Another adjustment to the conversion from OCR coordinates to the text (txt) replica.
It now calculates the X coordinates of a piece of text from the mid-point between X coordinates 0 & 1. So along with the Y coordinates that were already being calculated from the mid-point between Y coordinates 0 & 3, this now registers the position of each piece of text from the center point of each coordinates box.
I also set it to start using an estimate of the length of text characters instead of the length of the overall coordinates box to calculate the whitespace / number of spaces between each piece of text.
Overall this makes this set-up even more accurate, improving text alignment, improving performance on more tilted pages, & adjusting the spacing/alignment for different font / text sizes on the same line.

Version 2.9 Adjustment For New MS Approval Requirement & Adjust Retry Policy

I added in the automatic approval step to get around the new MS approval action requirement. I also set the retry policy on the GPT action to retry every 5 seconds up to 7 times so it will fail less if wrongful 429 too many request errors occur.

If the standard import of the flow-only packages below do not work for you, you can also try importing the flows through a Power Apps solution package here: Re: Extract Data From PDFs and Images With GPT - Power Platform Community (microsoft.com)

Microsoft is deprecating the original Create text with GPT action this template relies on.

Users may need to use the new “Create text with GPT using a prompt” action & create a custom prompt on that action instead.

https://learn.microsoft.com/en-us/ai-builder/use-a-custom-prompt-in-flow

See this post for an example set-up: https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Extract-Data-From-PDFs-and-Images-With-G...

The ExtractPDFImageDataWithGPT_1_0_0_x Power Apps solution package contains a version of the flow where this is outlined.

Thanks for any feedback,

Please subscribe to my YouTube channel (https://youtube.com/@tylerkolota?si=uEGKko1U8D29CJ86).

And reach out on LinkedIn (https://www.linkedin.com/in/kolota/) if you want to hire me to consult or build more custom Microsoft solutions for you.

watch?v=mcQr-JsGj6Q

takolota · ‎08-22-2023

@twidd

So you changed the field name to "Wording"?

And you aren't even using the Create text GPT action for this?

twidd · ‎08-22-2023

@takolota

Yes, so I am aiming to return the entire contents of a PDF into an excel cell.

takolota · ‎08-22-2023

@twidd

The only thing I know that would replicate that output is using the First( ) expression on all the List files action output values for each run of the loop. But it doesn’t look like you are doing that / it looks like you are referencing the file id of the current loop item.

Is that what the reference is showing in the Get content?

Is each txt output for each run the same or different?

takolota · ‎08-25-2023

@Btr1 @JamesHinton

The newer Version 2.5 includes a setting on the StaticPageVariables action called ZoomX that helps determine the number of space characters / whitespace between the coordinates of each piece of text. Reducing this number will reduce the number of space characters across the text replica & compress the text together more.

So if you still experience issues with prompt lengths, then you could try reducing this setting down to like 80 or 90 to reduce the tokens taken up by spaces. It might lose some accuracy but it will have more capacity for many/dense pages.

takolota · ‎08-25-2023

Version 2.5

More changes to the Convert to txt component to create a little more accurate text replicas and a change to the placeholder prompt to make the message a little more concise & more accurate. Also moved the spaces & line-break into a single Compose called StaticVariables & changed the variable name to the now more accurate EachPage.
The Convert to txt piece now calculates the minimum X coordinate so it can subtract that number from all X coordinates & thus remove additional spaces on the left margin, helping to reduce the characters fed to the GPT prompt.

The Convert to txt piece also now has a ZoomX parameter in the StaticPageVariables action which sets the spaces multiple, or the number of spaces, per coordinate point. So for example, 200=More Accurate Text Alignment, but 100=Less GPT Tokens. So there may be some trade-offs here. (The recognize text bounding box coordinates around longer pieces of text seem to be dis-proportionatly larger than on smaller pieces of text & mess up the text alignment for rows/lines with multiple boxes / text entries.)

In addition, the Convert to txt piece will now include line-breaks for blank Y coordinate rows/lines to more accurately replicate the vertical spacing of pieces of text. I figured since each line should be just a line-break character, it shouldn't add much to the character / token count for the GPT prompt.

So overall 2.5 adds some better options for increased extraction accuracy or for decreased characters/tokens per page & thus for slightly larger file capacity.

takolota · ‎08-26-2023

Note since the ZoomX is an displayed as an input, you can use more dynamic criteria to determine the amount of page spaces/whitespace compression like so...

if(equals(1, length(body('Filter_array_RemoveUnselectedPageBlanks'))), 190,
if(equals(2, length(body('Filter_array_RemoveUnselectedPageBlanks'))), 150,
if(equals(3, length(body('Filter_array_RemoveUnselectedPageBlanks'))), 110, 
75)))

With this set-up it changes the page compression based on the number of pages being processed. So for processing 1 or 2 pages, it uses higher accuracy settings, but if it starts to go beyond 2 pages then it starts to prioritize compressing the number of space characters / whitespace in order to fit everything into the GPT prompt. In some of my original invoice examples each page was about 4000 characters at 185 ZoomX, so 1000 tokens in the action's 4000 token limit for both input & output length. With the extra whitespace compression at a lower ZoomX it is possible to fit one more page into the prompt.

takolota · ‎08-27-2023

Version 2.7

Another adjustment to the conversion from OCR coordinates to the text (txt) replica.
It now calculates the X coordinates of a piece of text from the mid-point between X coordinates 0 & 1. So along with the Y coordinates that were already being calculated from the mid-point between Y coordinates 0 & 3, this now registers the position of each piece of text from the center point of each coordinates box.
I also set it to start using an estimate of the length of text characters instead of the length of the overall coordinates box to calculate the whitespace / number of spaces between each piece of text.
Overall this makes this set-up even more accurate, improving text alignment, improving performance on more tilted pages, & adjusting the spacing/alignment for different font / text sizes on the same line.

takolota · ‎09-15-2023

I just had a 35,000 character prompt & response go through on the Create text with GPT action. It looks like MS may have enabled a 16k token model for the action.
I hope they keep this functionality going forward as this enables extracting text from 4x more pages.

I've changed the default ZoomX property on the StaticPageVariables settings action in the Apply to each Convert to txt loop to use some of this extra capacity to further improve the outputs accuracy. Here is the new expression I use to adjust the amount of horizontal X spacing characters based on the number of pages...

if(equals(1, length(body('Filter_array_RemoveUnselectedPageBlanks'))), 190,
if(equals(2, length(body('Filter_array_RemoveUnselectedPageBlanks'))), 170,
if(equals(3, length(body('Filter_array_RemoveUnselectedPageBlanks'))), 150,
if(equals(4, length(body('Filter_array_RemoveUnselectedPageBlanks'))), 130,
if(equals(5, length(body('Filter_array_RemoveUnselectedPageBlanks'))), 110,
90)))))

takolota · ‎09-15-2023

@JamesHinton @Btr1

A Microsoft Staff member just confirmed their Create text with GPT action has been updated to use a 16k token GPT model. So now this template should be able to handle 12-20 pages worth of a given document.

JJ27 · ‎09-26-2023

@takolota I am using the 1.7 version and encountering an issue that didn't appear in the video. I slightly edited the Create text with GPT step to extract data from a specific document I uploaded in the Get file metadata step, just replaced the invoice relevant fields with two new fields: State, PO Box, and made the necessary changes as shown below:

From the text provided below, please extract the...
State
PO Box

Be aware, the text was captured by optical character recognition (OCR) & it may contain some errors like wrong characters or generally miss some formatting of the original file.
If a piece of data can not be found or determined from the text, return N/A.

[Start of text]
@{outputs('Combined_txt_output')}
[End of text]

Create a JSON object with the extracted data that follows the following example format:
{
"State": [data],
"PO Box": [data],
}

However, when I run the flow, I keep getting the error "the create text with GPT action doesn't have a content approval after it" That wasn't the case in the video though, I am not sure why I am getting this error. I tried to add that action but the flow never finished, was stuck forever in that step. Any ideas what is going on here?