cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
naelaiman
Frequent Visitor

Extract Text From Structured PDF

Hi RPA Community,

I have this PDF file that i want to extract its text. The PDF will be in a structured form and the output text file should follow the structure accordingly. Can someone give advice on which approach should i use in order to get the correct output.

 

I will share you the sample PDF and the desired text format once extracted

https://pktgroup-my.sharepoint.com/:u:/p/nael_rashid/EfGkG-1n0KRAuHfFlbuV0q8BgPINNKLcVgNOfzcUqXk2lw?... 

 

Appreciate your time and assistance,

Thanks and Regards,

Nael

 

21 REPLIES 21
yoko2020
Responsive Resident
Responsive Resident

Hi @yoko2020 ,

Thank you for your suggestion, does this mean i need to rely on AI builder or any other 3rd party software service in order to get the extracted text for my situation?

I was hoping that there is a way to get my desired output using commands that are available in PAD.

 

Thanks and Regards,

Nael

I never use parse/regex action or extract text from pdf action  from PAD when dealing with invoice, sales order, custom form document (pdf/image) extraction, always use third party software specialized for this purpose.

 

Things to consider when dealing with this stuff :

1. Does the document always come in text pdf ?

2. What happen if document come in image pdf ?

3. Are we dealing with  =>1000 of documents per month or just 10 documents per month ?

4. What if in 1 document contain multiple invoices that need to be separated ?

    See this video what i mean about document separation/invoice splitting 

     https://www.youtube.com/watch?v=9fFjQn_E8dI

5. And sometimes invoice contain multiple page, so we are facing dynamic invoice pages that need to be processed.

 

 

Most of this software can handle invoice splitting except power automate aibuilder.

If you only process small quantity you can try use internal PAD action, but make notice of those 5 points or else your project will stuck in the future.

 

 

 

 

 

Hi @yoko2020 

If the pdf is constant header means you can directly use regex.

first you need to us e action Extract text from PDF

After use parse text and use regex based on the required data.

 

Regards

Ahammad Riyaz

--------------------------------------------------------------------------------
If this post helps answer your question, please click on “Accept as Solution” to help other members find it more quickly. If you thought this post was helpful, please give it a Thumbs Up.

UK_Mike
Post Prodigy
Post Prodigy

"The PDF will be in a structured form and the output text file should follow the structure accordingly"

This makes no sense, surely your just pulling particular values rather than the whole pdf text ?

 

If particular values, yes it can be done wholly in PAD...

 

Screenshot 2022-03-29 140513.png

Hi @yoko2020 ,

I previously used regex and string manipulation to extract data from this pdf format. However, previously i used Automation Anywhere (AA) it can extract text in structured format so it was easy for me to extract the data line by line with string conditions. Right now I had to migrate to PAD so i find the extract text is not the same as AA and i find that the result is different than i expected. I will share you the output i got from using PAD extract pdf to text command.

https://pktgroup-my.sharepoint.com/:t:/p/nael_rashid/EZ0SpMbDJKhNuXr_s5REAQMB5J-syMWQw4s519g602_dRw?...

Let me answer your question:-

1- Yes, for this file it will always come in readable pdf format as it is generated by a system.

2- If there is an image pdf file during extraction it will extract nothing so an error handling should be able to overcome that.

3- Yes, we are dealing with 1000+ documents per month.

4- No, there wont be any invoice combined together as the system will generate 1 invoice per order.

5- If there is multiple page i should still be able to extract all the necessary information if the text output is in a structured format.

 

Thanks and Regards,

Nael

Hi @Ahammad_Riyaz ,

Yes the pdf will have constant header and will repeat if there is multiple page of the invoice. I tried this approach but however the result i get from "Extract PDF to Text" is hard to implement the regex or string operations. This is probably due to the format of the pdf that's why the text result is cluttered and not in pair. I share with you the output i got from using the PAD Extract PDF to Text. https://pktgroup-my.sharepoint.com/:t:/p/nael_rashid/EZ0SpMbDJKhNuXr_s5REAQMB5J-syMWQw4s519g602_dRw?...

Is there any way for me to extract the text without using 3rd party applications or AI-builder for this?

 

Thanks and Regards,

Nael

Hi @UK_Mike ,

Yes i wanted to pull particular values, but the result i get from Extract PDF to Text is not organized and applying regex or string manipulation can be difficult as the value doesn't seem to be coming in pairs for this PDF file. If the text extracted is written by following the same format as the PDF then it is possible for me to extract the invoice details as well as the item details. I share with you the text output i get from PAD. As you can see, each value is hard to differentiate. https://pktgroup-my.sharepoint.com/:t:/p/nael_rashid/EZ0SpMbDJKhNuXr_s5REAQMB5J-syMWQw4s519g602_dRw?...

Plenty of field that i want to extract and evaluate and if the extracted text is coming in this format i can be quite troublesome for me to extract the details. I was hoping to find a solution for this without relying on 3rd party applications or AI-Builder as additional cost will incur and there are over 1000+ PDF invoices needed to be processed per month.

 

Thanks and Regards,

Nael

@Ahammad_Riyaz 

 

Yes i know that technique very well.

But i never use that method, wasting of time and double work in the future when dealing with >2000 document and 100 vendor (document layout)

@naelaiman 

 

what PAD version you use ? looks like action extract text from pdf has a bug, it does not keep indentation.

In that use you have to go for AI Builder, you can train and use for different vendor.

 

 

Yes, already finish long time a go, but not using AI Builder.

Well I extract the pdfs to a variable, not sure what you're extracting to ?

Line numbers cannot be relied on, one pdf could have the invoice date on line 20, next pdf on line 21, too unreliable.

When I have the read pdf text in a variable I then parse this variable with regex.

One regex per value required, Date, Amount, Customer etc.

The resulting variables from the parse then get written to Excel.

Im sure others have their own way of doing it but im trying to avoid Ai builder or any 3rd parties.

I see in the latest PAD update, pdf tables are catered for, not tried it yet, scared of updating 😂

I have a look now and again on the AI Builder forum, it is not the Holy Grail of Pdf data extraction.

I see @Ahammad_Riyaz  referred to this method, it works for me.

 

@yoko2020  , care to elaborate what software you use, more than 1, costs ?

 

Mike

 

 

 

 

 

@UK_Mike 

 

Chronoscan Advanced version + Nuance Plug-Ins https://www.chronoscan.org/

ABBYY® FlexiCapture® https://www.abbyy.com/flexicapture/

Artsyl’s docAlpha https://www.artsyltech.com/products/docAlpha

 

 

 

 

 

 

 

 

 

Sorry @yoko2020 , as soon as I posted I seen your post further on up mentioning the software you used.

Thanks x 2 😂

np.

 

You also can test using this 2 services if you want getting a headache every 1 minute. 😂

 

Azure Form Recognizer
Amazon Textract

Ermmmmmmmmmmmm...................... no thanks 😂

Hi @yoko2020 ,

I'm using PAD version 2.18.146.22083 free version not licensed, If this is a bug then i can get support for this issue. Can you share me your extract PDF to text result that you get from my file that i shared?

 

Thanks and Regards,

Nael

 

Hi @UK_Mike ,

For my situation, I'm still in researching phase of this project. Trying out the PDF extraction command. Previously i used Automation Anywhere (AA), the extract pdf to text wont store the extracted data to a variable however it writes on a text file. I can extract the details line by line and create conditions based on the counter variable.

What i can see when i was using PAD the extract pdf to text command will extract all details onto a variable. From that variable i can convert it to a list and begin my regex and string operations to find the necessary details in each line. If i am not mistaken do correct me if i'm wrong.

Just that the issue is when i use the extract PDF to text in PAD, the indentation of the pdf file is removed and is placed on a new line. This makes the data inconsistent even if all the pdf is in the same format.

Does this issue only happen to me can you share your Extract PDF to Text result from PAD so that i can confirm my situation?

 

Thanks and Regards,

Nael

Helpful resources

Announcements

Community will be READ ONLY July 16th, 5p PDT -July 22nd

Dear Community Members,   We'd like to let you know of an upcoming change to the community platform: starting July 16th, the platform will transition to a READ ONLY mode until July 22nd.   During this period, members will not be able to Kudo, Comment, or Reply to any posts.   On July 22nd, please be on the lookout for a message sent to the email address registered on your community profile. This email is crucial as it will contain your unique code and link to register for the new platform encompassing all of the communities.   What to Expect in the New Community: A more unified experience where all products, including Power Apps, Power Automate, Copilot Studio, and Power Pages, will be accessible from one community.Community Blogs that you can syndicate and link to for automatic updates. We appreciate your understanding and cooperation during this transition. Stay tuned for the exciting new features and a seamless community experience ahead!

Summer of Solutions | Week 4 Results | Winners will be posted on July 24th

We are excited to announce the Summer of Solutions Challenge!    This challenge is kicking off on Monday, June 17th and will run for (4) weeks.  The challenge is open to all Power Platform (Power Apps, Power Automate, Copilot Studio & Power Pages) community members. We invite you to participate in a quest to provide solutions to as many questions as you can. Answers can be provided in all the communities.    Entry Period: This Challenge will consist of four weekly Entry Periods as follows (each an “Entry Period”)   - 12:00 a.m. PT on June 17, 2024 – 11:59 p.m. PT on June 23, 2024 - 12:00 a.m. PT on June 24, 2024 – 11:59 p.m. PT on June 30, 2024 - 12:00 a.m. PT on July 1, 2024 – 11:59 p.m. PT on July 7, 2024 - 12:00 a.m. PT on July 8, 2024 – 11:59 p.m. PT on July 14, 2024   Entries will be eligible for the Entry Period in which they are received and will not carryover to subsequent weekly entry periods.  You must enter into each weekly Entry Period separately.   How to Enter: We invite you to participate in a quest to provide "Accepted Solutions" to as many questions as you can. Answers can be provided in all the communities. Users must provide a solution which can be an “Accepted Solution” in the Forums in all of the communities and there are no limits to the number of “Accepted Solutions” that a member can provide for entries in this challenge, but each entry must be substantially unique and different.    Winner Selection and Prizes: At the end of each week, we will list the top ten (10) Community users which will consist of: 5 Community Members & 5 Super Users and they will advance to the final drawing. We will post each week in the News & Announcements the top 10 Solution providers.  At the end of the challenge, we will add all of the top 10 weekly names and enter them into a random drawing.  Then we will randomly select ten (10) winners (5 Community Members & 5 Super Users) from among all eligible entrants received across all weekly Entry Periods to receive the prize listed below. If a winner declines, we will draw again at random for the next winner.  A user will only be able to win once overall. If they are drawn multiple times, another user will be drawn at random.  Individuals will be contacted before the announcement with the opportunity to claim or deny the prize.  Once all of the winners have been notified, we will post in the News & Announcements of each community with the list of winners.   Each winner will receive one (1) Pass to the Power Platform Conference in Las Vegas, Sep. 18-20, 2024 ($1800 value). NOTE: Prize is for conference attendance only and any other costs such as airfare, lodging, transportation, and food are the sole responsibility of the winner. Tickets are not transferable to any other party or to next year’s event.   ** PLEASE SEE THE ATTACHED RULES for this CHALLENGE**   Week 1 Results: Congratulations to the Week 1 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge.   Community MembersNumber SolutionsSuper UsersNumber Solutions Deenuji 9 @NathanAlvares24  17 @Anil_g  7 @ManishSolanki  13 @eetuRobo  5 @David_MA  10 @VishnuReddy1997  5 @SpongYe  9JhonatanOB19932 (tie) @Nived_Nambiar  8 @maltie  2 (tie)   @PA-Noob  2 (tie)   @LukeMcG  2 (tie)   @tgut03  2 (tie)       Week 2 Results: Congratulations to the Week 2 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge. Week 2: Community MembersSolutionsSuper UsersSolutionsPower Automate  @Deenuji  12@ManishSolanki 19 @Anil_g  10 @NathanAlvares24  17 @VishnuReddy1997  6 @Expiscornovus  10 @Tjan  5 @Nived_Nambiar  10 @eetuRobo  3 @SudeepGhatakNZ 8     Week 3 Results: Congratulations to the Week 3 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge. Week 3:Community MembersSolutionsSuper UsersSolutionsPower Automate Deenuji32ManishSolanki55VishnuReddy199724NathanAlvares2444Anil_g22SudeepGhatakNZ40eetuRobo18Nived_Nambiar28Tjan8David_MA22   Week 4 Results: Congratulations to the Week 4 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge. Week 4:Community MembersSolutionsSuper UsersSolutionsPower Automate Deenuji11FLMike31Sayan11ManishSolanki16VishnuReddy199710creativeopinion14Akshansh-Sharma3SudeepGhatakNZ7claudiovc2CFernandes5 misc2Nived_Nambiar5 Usernametwice232rzaneti5 eetuRobo2   Anil_g2   SharonS2  

Check Out | 2024 Release Wave 2 Plans for Microsoft Dynamics 365 and Microsoft Power Platform

On July 16, 2024, we published the 2024 release wave 2 plans for Microsoft Dynamics 365 and Microsoft Power Platform. These plans are a compilation of the new capabilities planned to be released between October 2024 to March 2025. This release introduces a wealth of new features designed to enhance customer understanding and improve overall user experience, showcasing our dedication to driving digital transformation for our customers and partners.    The upcoming wave is centered around utilizing advanced AI and Microsoft Copilot technologies to enhance user productivity and streamline operations across diverse business applications. These enhancements include intelligent automation, AI-powered insights, and immersive user experiences that are designed to break down barriers between data, insights, and individuals. Watch a summary of the release highlights.    Discover the latest features that empower organizations to operate more efficiently and adaptively. From AI-driven sales insights and customer service enhancements to predictive analytics in supply chain management and autonomous financial processes, the new capabilities enable businesses to proactively address challenges and capitalize on opportunities.    

Updates to Transitions in the Power Platform Communities

We're embarking on a journey to enhance your experience by transitioning to a new community platform. Our team has been diligently working to create a fresh community site, leveraging the very Dynamics 365 and Power Platform tools our community advocates for.  We started this journey with transitioning Copilot Studio forums and blogs in June. The move marks the beginning of a new chapter, and we're eager for you to be a part of it. The rest of the Power Platform product sites will be moving over this summer.   Stay tuned for more updates as we get closer to the launch. We can't wait to welcome you to our new community space, designed with you in mind. Let's connect, learn, and grow together.   Here's to new beginnings and endless possibilities!   If you have any questions, observations or concerns throughout this process please go to https://aka.ms/PPCommSupport.   To stay up to date on the latest details of this migration and other important Community updates subscribe to our News and Announcements forums: Copilot Studio, Power Apps, Power Automate, Power Pages

Users online (871)