03-11-2024 10:56 AM - last edited 07-04-2024 11:39 AM
Query Large PDFs With GPT RAG
Need to efficiently prompt / query on very large PDFs with dozens, hundreds, maybe even thousands of pages, but can't possibly fit and/or pay for all that in a prompt?
This template builds off a previous Extract Data From PDFs and Images With GPT template. But this template uses Retrieval Augmented Generation (RAG) to essentially do a google search for the file pages with the most relevant text related to a user's query before using those pages to answer the query with a GPT prompt.
It returns the most relevant file page texts to the GPT prompt up to the chosen character limit so the GPT model can use only those most relevant pages to answer the user's query. For example, if there is a question like "According to the file, what happens if goods are damaged?" about a 600 page PDF and the MaxFileTextCharacters parameter is set to 80000 & each page is roughly 4000 characters / 1000 tokens, then the flow will select only the 20 pages' texts most relevant to the question of damaged goods & submit those pages' texts along with the question to the GPT prompt to generate an answer.
(Will not be very useful for more aggregate queries that need the entire document like "Summarize the entire PDF".)
To read more on semantic search & embeddings used in Retrieval Augmented Generation, see these resources
https://platform.openai.com/docs/guides/embeddings
https://www.youtube.com/watch?v=orLGv2LgWDE
Flow Run Example
Overall the flow takes in a query to ask GPT, the MaxFileTextCharacters controlling the amount of file context given to the GPT prompt, & the file relevant to the question.
It then OCRs each page of the file, converts the OCR text & coordinates to a text replica of each page, uses text embeddings values of the query text & each page text to give each page text a score of how relevant it is to the query text, filters to only the most relevant pages that can fit in the MaxFileTextCharacters limit, resorts to a chronological page order to make a combined replica text of the most relevant pages in page order, & then feeds those pages' texts & the query to a GPT prompt to get a response.
Setting of the MaxFileTextCharacters parameter & setting of the Query.
Getting the query text embeddings.
Selecting the file from OneDrive & OCR it for a list of all the document text values & their coordinates.
Run actions to process each page of text values & coordinates into replications of each document page with proper horizontal & vertical spacing between text values.
Take the text replica for each page & get the text embeddings.
So this replicated page text...
{
"input": " xxxxxxxx\n\n xxxxxxxxxx\n\n\n\n\n\n Indefinite Delivery / Indefinite Quantity Subcontract\n\n\n Between\n\n\n xxxxxxxxxxx, Inc.\n\n\n And\n\n xxxxxxx xxxxxxx\n\n\n Hereinafter referred to as the \"Subcontractor\"\n\n For\n\n xxxxx Global Health Supply Chain Program- Procurement and\n\n Supply Management (GHSC-PSM) project\n\n Contract No .: xxx-xxx-xx-xxxxxx Task Order No .: xxx-xxx-xx xxxxxx\n\n\n\n Subcontract number: xxxx\n\n Start Date: 10/01/2019\n\n End Date: 11/23/2020\n - -\n Subcontract Ceiling Price: $212,523,728.56\n\n\n\n ISSUED BY:\n\n xxxxxxxxx, Inc.\n\n xxxx x Street, xx, Washington, DC, xxxxx, United States of America\n\n\n ISSUED TO:\n\n\n A to Z xxxxxxxxxxxx\n\n xxxxxxx xxxxx xxxx, xxxxxx\n\n P.O. Box\n\n xxxxxxxxxx\n Limited\n\n\n Subcontractor Tax ID Number: N/A\n\n Subcontractor DUNS Number: xx-xxx-xxxxxx\n xxxxxxxx\n\n\n O.Box xxx - xxxxx - xxxxxx\n\n\n\n\n Page 1 of 49\n\n\n\n\n\n\n\n\n\n\n\n"
}
Becomes this set of vectors / text embeddings.
[0.0050294143,0.011339636,0.0052986285,0.025061412,-0.017213404,-0.0768811,-0.01173122,0.035634194,-0.030413067,0.044412214,0.01928554,-0.038212124,0.021080302,-0.004319667,-0.024033502,0.008268145,-0.046990145,0.027916715,0.025012463,-0.0068690456,0.040920585,0.03573209,0.041246906,0.025469312,0.014839423,0.004132033,-0.021602415,0.010287252,-0.0076522147,0.016462868,-0.03651526,0.031489924,0.016789189,-0.027459867,0.0599777,-0.008884074,0.021259777,0.041246906,-0.0028777386,-0.020705033,
...
...
...
-0.0130365025,-0.007668531,0.030184643,0.018224997,-0.0099772485,0.0034202463,-0.005804425,-0.0076603726]
With the query text embeddings & the page text embeddings we can now run cosine similarity / dot product calculations across them to get a score of how relevant the page text is to the query text.
Run a Filter array action to filter out all the least relevant page texts that don't fit in the MaxFileTextCharacters limit.
Re-sort all the most relevant pages to be ordered by page number so we can combine all the most relevant page texts into the relevant file text we will feed the prompt.
Example combined page texts (some pages in the middle removed for brevity). This started with page 19 & ended on page 29. And it also skipped pages between, like page 28, if their cosine similarity relevance scores were lower.
xxxxxxxx
arrangements, and the Services to be provided, along with the information specified in
AIDAR 752.7004, EMERGENCY LOCATOR INFORMATION.
(2) The Supplier shall ensure that its personnel, while in a Cooperating Country, abide by all
applicable laws of the Cooperating Country and political subdivisions thereof.
(3) Other than work performed under the Subcontract for which personnel are assigned by
the Supplier, the Supplier's personnel shall not engage, directly or indirectly, either in
their own name or in the name or through the agency of another person, in any business,
profession or occupation in the Cooperating Country, nor shall they make loans or
investments to or in any business, profession or occupation in the Cooperating Country,
without xxxxxxxx' approval. This provision does not apply to personnel who are citizens
or legal residents of the Cooperating Country.
(4) The Supplier shall obtain (a) worker's compensation (Defense Base Act) insurance
pursuant to FAR 52.228-3 and AIDAR 752.228-3, and (b) medical evacuation insurance
for personnel travelling to a Cooperating Country in connection with this Subcontract.
(5) Personnel travelling on the Supplier's behalf for performance of Related Services shall
possess appropriate language skills, if any, stated in the Subcontract, and shall be
physically fit in accordance with AIDAR 752.7033.
(6) In performing Related Services, the Supplier shall comply with USAID guidance, if any,
relating to branding/marking of activities.
(7) FAR 52.246-4 INSPECTION OF SERVICES - FIXED PRICE (AUG 1996) shall apply
to Related Services.
(8) All logistics support, visas, legal compliance matters and taxes in connection with its
personnel overseas shall be the sole responsibility of the Supplier, as will all liability for
the acts and omissions of the Supplier's personnel performing the Related Services.
(9) Compensation for satisfactory performance of Related Services shall be paid upon
completion thereof in compliance with the terms and conditions of the Subcontract and
solely in the form of the firm, fixed, all-inclusive prices.
(10) Notwithstanding any other provisions of this Subcontract, no additional
compensation or reimbursement will be provided to the Supplier for complying with these
requirements concerning provision of Related Services
ARTICLE 8. PACKING, EXPORT MARKING, PREPARATION FOR
SHIPMENT AND PACKAGING
A. All Goods supplied under this Subcontract shall be packed and marked for export as required
by the Subcontract/Orders and by all applicable transportation regulations, carrier tariffs, US
FDA/SRA regulations (if any), and sound commercial practice. Without limiting the generality
of the foregoing, all Goods shall be properly prepared for export according to the best
international packing standards suitable to prevent theft, loss, or damage and to withstand
exposure to the elements, including extreme temperature and water, and rough handling
during air, sea or land shipment.
B. The Supplier shall be solely responsible for complying with all applicable laws and sound
international practices, which includes having all relevant licenses in places at the Supplier's
factory for the Goods and for shipping/loading in accordance with the applicable INCOTERM, for
the packaging and labeling of the Goods (including, if applicable, hazardous materials safeguards).
xxxxxxxx
C. Packaging shall be prepared in accordance with the Subcontract and to ensure that:
(1) All tertiary, secondary, and primary (whenapplicable) packaging for Goods are properly
Page 18 of 49
xxxxxxxx
labelled per Section D below and clearly identifies any special handling instructions and/or
temperature requirements
(2) Preference is for EUR2 pallets (100x120). EUR1 (80x120) pallets are also acceptable.
Other pallet types may be acceptable, in consultation with xxxx-xxx. In cases wherein
the destination requires a specific pallet type/size, it will be specified in the relevant
Purchase Order, and the Supplier will provide goods utilizing the specified pallet type/size.
(3) Pallet height not to exceed 1.25 m (incl. pallet) for shipments using air freight. Pallet height
may not exceed 2.1 m (incl. pallet) for shipments using sea freight.
(4) Partial cartons, including those with batch-end products require an extra label clearly
marking the cartons as "Partial" or equivalent and the quantity of units included within.
(5) Like product and batches should be kept contiguous when loaded into containers and
should not be separated. Corrugated separator sheets should be used between batches when
multiple batches are packed on the same pallet.
D. xxxxxxxx may be implementing xxx labeling requirements on tertiary packaging
(pallet/logistics unit and carton/trade item) and/or secondary packaging and/or on the LLIN care
label during the period of performance of this Subcontract. The Supplier may be required to
comply with GS1 General Specifications for identification and marking details under the
Subcontract. The Supplier may refer to the xxx barcode specifications for detailed requirements
(xxxxxxxxxxxxx). xxxxxxxx will provide
the Supplier with reasonable notice of the implementation requirement applicable to the
Subcontract.
-- - -
E. Transaction and Production Data
For orders with Incoterms other than DAP or DDP, all transaction and production data must be
provided to xxxx-xxx through the xxxxx Logistics System
(xxxx), including but not limited to the SSCC, GTIN, batch/lot number, and expiration date.
For orders with incoterms DAP or DDP, all transaction and production data must be provided to
xxxx-xxx via the Procurement Specialist. Data presented on transaction documents -
including but not limited to the packing list, commercial invoice, and advanced ship notice -
must align with the identifiers used on the shipping label (i.e. once the Subcontractor has
transitioned to using the GTIN as the primary identifier, this must be used on packing lists as
well).
(1) Within 30 days of a request, the Supplier will make serial number data for goods procured
under this subcontract in the format requested by xxxxxxxx.
(2) A complete itemized packing list shall be carried in a secure, durable clearly-marked
"packing list" envelope affixed to the outside of each pallet, shipping container or box
that represents a separate unit of the shipment used to deliver the Goods. Each packing
list must show the specified xxxxxxxx Subcontract/Order number (unless otherwise
required by xxxxxx in writing, a complete narrative description of the Goods, all
applicable part numbers, and the corresponding line item number.
(3) Damage resulting from improper packing, export marking and preparation for shipment
shall be the liability of the Supplier and deducted from amounts due.
xxxxxxxxxxxxxxxxx Page 19 of 49
...
...
...
...
...
xxxxxxxxx
10 business days of a request by xxxxxxx, xxxx-QA, and or xxxx-xxx QA. Finished
product must be retained for at least one year past the expiration date or according to supplier's
retention procedure, whichever is longer.
ARTICLE 13. TITLE AND RISK OF LOSS OR DAMAGE
A. Supplier shall ensure that the title to Goods delivered and supplied hereunder shall pass
directly to xxxxx upon acceptance pursuant to Article Quality Assurance, Testing,
Inspection and Acceptance above.
B. Notwithstanding completion of delivery, Supplier shall bear all risk of loss or damage to the
Goods prior to acceptance, except to the extent that any loss or damage is due to xxxxxxx'
fault, or occurs after delivery and not due to fault on the Supplier's part.
ARTICLE 14. PAYMENT AND PAYMENT TERMS
A. xxxxxxx will pay the total Order price as a lump sum, or in installments for agreed upon
shipments, after the Supplier's delivery of the corresponding Goods and/or Related Services
and xxxxxxx' designated agent's acceptance thereof, or as otherwise provided in the Order,
according to the delivery schedule agreed by the Parties. xxxxxxx will pay the Supplier's
invoice within forty-five (45) net days of receipt of a complete invoice and receipt of the
corresponding evidence of delivery per the INCOTERM. The Supplier's submission must be
in compliance with the Article labeled "Invoice Requirements" below.
In the specific event that the Supplier agrees to hold xxxxxxxx' Orders after quality assurance
processes have been completed but import waivers are still pending, then xxxxxxx will pay
the total price within forty-five (45) net days after xxxxxxxx has issued a Certificate of
,Compliance (CoC) indicating that the Goods have passed the requisite QA testing and informed
, the subcontractor of the pending importation waiver. Invoices for any orders placed with
INCOTERMs other than FCA or ExWorks will also require proof of delivery and acceptance.
xxxxxx will pay the total Subcontract price as a lump sum, or in installments for agreed upon
shipments corresponding to complete Subcontract Order documentation. Should products
require additional Quality Assurance testing prior to shipment, in the event of a batch rejection,
Supplier will refund the pro rata amount of the invoice price applicable to the rejected products
to xxxxxx, within ten (10) business days of notification of rejection in addition to remedies
for non-conforming goods herein including as set forth in Article 13. Quality Assurance Testing,
Inspection and Acceptance.
B. Payments for approved invoices will be made by check or via Electronic Funds Transfer
(EFT) for US bank/financial institution accounts or Wire Transfer for non-US bank accounts.
Payment will be sent to the Supplier's designated recipient account name, account number,
and bank or financial institution as identified in the Subcontract and in the payment account
forms required herein to establish a payment account with xxxxxx xxxxxxxx.
Incomplete or incorrect payment account forms to establish a new account or update an
existing account will delay payment. All costs and risks arising out of, relating to, or resulting
from EFT or Wire Transfer shall be borne by the Supplier. The following account forms are
required to establish or update a payment account;
(1) All US based Suppliers are required to complete the xxxxxxxx Electronic Funds Transfer
Form and W9 Tax form to set up a payment account with xxxxxxxxx.
(2) The Supplier with international banks are required to complete the xxxxxxxx
International Wire Transfer form, including the Domestic (US) Intermediary Bank
xxxxxxxx section. Selecting a US intermediaty/ bank facilitates an efficient transfer of funds and is Page 27 of 49
xxxxxxxxx
INCOTERM:
INCOTERMS
Documents Information Attributes
EXW/FCA CIP/CPT/DDP
Air Freight Dimensions or Volume and Gross
Weight, Airport Departure, Airport
Shipping/Delivery
Doc: Airway Bill Destination, Shipper's Name,
(AWB) X (Provided by Consignee, Carrier Charges
X (Prepared for
xxxxxxx) Supplier) Dimensions or Volume and Gross
Ocean
Shipping/Delivery Weight, Seaport Departure,
Seaport Destination, Shipper's
Doc: (BOL) Name, Consignee, Carrier Charges
Delivery to
Freight
Forwarders X (Supplier
Volume and Gross/Net Weight,
Certificate of collects from Consignee, Shipper's Name,
Receipt Designated X (Provided by
Freight Supplier) Invoice #, PO #, Description of
(Note: for Goods, Packaging Details,
International Forwarder and destination
Provides)
Trucking only the
Freight Forwarders
Certificate of
Receipt is
provided)
Volume and Gross/Net Weight,
End Recipient Consignee, Shipper's Name, Invoice
Goods Receipt X (Provided by #, PO #, Description of Goods,
IN/A Supplier) Packaging Details, destination
Notice or Proof of
Delivery Receipt
D. Invoices determined to be proper will be paid by xxxxxxx in accordance with the Article
labeled "Payment and Payment Term" above and the terms of the Subcontract and the
Order. Invoices determined not to be proper due to the existence of deficiencies will be
rejected and the Supplier promptly notified, generally within ten (10) business days of
submission, with deficiencies noted for correction. In the event that an invoice is submitted,
which is partially proper, xxxxxxx may, in its sole discretion, either reject the entire
invoice for correction or make payment of the proper portion and return the portion deemed
not to be proper."
ARTICLE 16. COOPERATING COUNTRY FEES, TAXES, AND DUTIES
A. This Subcontract is entered into by xxxxxxx on behalf of the xxxx-xxx Project, in Cooperating Country(ies).
As such, the Subcontract is free and exempt from any taxes, VAT, tariffs, duties, or
other levies imposed by the laws in effect in the Cooperating Country(ies). The Supplier
shall not pay any host country taxes, VAT, tariffs, duties, levies, etc. from which this
xxxxx program is exempt. In the event that any exempt charges are paid by the Supplier,
they will not be reimbursed to the Supplier by xxxxxxx unless approved in advance in
writing by xxxxxxxx. The Supplier shall immediately notify xxxxxx if any such taxes
alls Limited * quezue
are assessed against the Supplier of if/subcontractors/suppliers at any tier.
Page 29 of 49
See how the prompt is structured & the run of the GPT action. In this run I used GPT4 Turbo
GPT Response:
"According to the file, if goods are damaged as a result of improper packing, export marking, and preparation for shipment, the liability falls on the Supplier, and the cost associated with such damage shall be deducted from amounts due to the Supplier."
Import & Set-Up
To import the flow, download the GPTRAGQueryLargePDFs_1_0_0_x.zip file at the bottom of this post. Go to https://make.powerapps.com/, go to Solutions on the left-side menu, select Import solution, select Browse, & select the file you just downloaded. Select Next, select Next again. Then provide and/or create the connections for the solution / flow. Select Import & wait for the import to load. Select GPT RAG Query Large PDFs in the list of Solutions. Select the 3 vertical dots to the side of the GPT RAG Query Large PDFs flow title & in the pop-up menu select edit.
Once in the flow, you can delete the Delete after import action.
Then go to the Azure Function DotProduct Code action so we can set up a call to an Azure function to perform cosine similarity / dot product calculations to get our text relevance to query scores. Go to https://portal.azure.com/#view/HubsExtension/BrowseResource/resourceType/Microsoft.Web%2Fsites/kind/.... Select Create and input Resource, Name, & Node.js as the stack, Select Review + create & then select Create. Then select go to Resource.
Select Create Function & select HTTP trigger. Then select Create.
If you get an error, then you may have to refresh the resource page & Select the HTTPtrigger1 link.
Go to Code + Test, remove the placeholder code in the editor, copy the code from the flow Azure Function DotProduct Code action & paste that code into the Azure editor.
On the Azure Function Code + Test editor page, select Get function URL & copy the URL. Go to the flow, inside the "Convert to txt and select most relevant pages" scope & inside the "Apply to each Convert to txt" loop the "HTTP Page text to Query score" action will need that URL pasted to the URI input.
After setting up the Azure Function call for the dot product calculations, we can then check in our Static Variables action & Query action. Adjust MaxFileTextCharacters & the Query to your needs. The larger the MaxFileTextCharacters, the more pages' texts will be sent to the prompt. So the higher the number of characters, the more file context the prompt will use, but the larger the prompt & token count will be.
So the higher the number, the higher the accuracy, but the lower the number, the lower the cost per query.
Next we will set up our call to a custom text embeddings model to get the text embeddings / vectors for the text we are using in our Query.
Go to https://portal.azure.com/#view/Microsoft_Azure_ProjectOxford/CognitiveServicesHub/~/OpenAI. Select Create. Input Resource, Name, & Price tier. Select Next 3 times. Select Create & then select Go to Resource.
Go to OpenAI Studio. Select Models on the left-side menu. Then select text-embedding-3-large & select Deploy. Then input Name for your deployment, in Advanced options set the rate to 200k+ Tokens & select Create.
Go back to the Azure AI Services resource page, & copy the name of the resource you just created. Go to the "HTTP Query text reference" flow action & paste that resource name over where the URI input says YOUR_RESOURCE_NAME.
Then go to the OpenAI Studio page again. Go to Deployments on the left-side menu. Copy the name of the deployment you just made to the clipboard & paste that deployment name to the same URI input, but this time paste over where it says YOUR_DEPLOYMENT_NAME.
Go back to the Azure AI services resources page, select the resource you just created, then select Keys and Endpoint on the left-side menu & copy KEY 1 to the clipboard. Go back to the "HTTP Query text reference" flow action, remove all the text in the api-key value input & paste in the API key.
Go to the "Get file metadata" flow action and select a large PDF for your use-case / for your test.
Next we will set up the final call to the custom Azure GPT4 Turbo model we create.
Go to https://portal.azure.com/#view/Microsoft_Azure_ProjectOxford/CognitiveServicesHub/~/OpenAI. Select Create. The input Resource, Name, & Price Tier. Select Next 3 times. Select Create & then select Go to Resource.
Go to OpenAI Studio. Select Models on the left-side menu. Select what GPT instance you want to use (at the time of writing this GPT4 Turbo 0125 Preview was the latest model). Select Deploy. Then input the Name for your deployment, to to the advanced options & set the rate to 30k+ Tokens. Then select Create.
Go back to the Azure AI services resources page & copy the name of the resource you just created. Go to the flow "HTTP LLM Prompt" action & paste the resource name over where the URI input says YOUR_RESOURCE_NAME.
Then go to the OpenAI Studio page again. Go to Deployments on the left-side menu. Copy the name of the deployment you just made to the clipboard & paste that deployment name to the same URI input, but this time paste over where it says YOUR_DEPLOYMENT_NAME.
Go back to the Azure AI services resources page, select the resource you just created, then select Keys and Endpoint on the left-side menu & copy KEY 1 to the clipboard. Go back to the "HTTP Query text reference" flow action, remove all the text in the api-key value input & paste in the API key.
Thanks for any feedback,
Please subscribe to my YouTube channel (https://youtube.com/@tylerkolota?si=uEGKko1U8D29CJ86).
And reach out on LinkedIn (https://www.linkedin.com/in/kolota/) if you want to hire me to consult or build more custom Microsoft solutions for you.
hiya! impressive flow, thanks for the detailed exposition.
i imported into my project and made the changes you described, but get the following error (in the node "HTTP Query Text Embeddings") when the flow runs:
"The completion operation does not work with the specified model, text-embedding-3-large. Please choose different model and try again."
I put a company report pdf into one drive, and my query is "what companies are mentioned in the document?"
@mm00 What did you use for the URI?
It should be something like
https://xxxxxxx.openai.azure.com/openai/deployments/xxxxxxxxx/embeddings?api-version=2023-05-15
There should not be a /completion part of the uri
@mm00 Ah good catch. Must have copied something from the LLM prompt when resetting the templates. I adjusted the downloads & pictures to fix that.
Thanks,
Bug Fix
I found the Convert to txt loop inside the Convert to txt scope would error if it was passed a page that the Recognize text action found no text on (so the lines parameter was an empty array [ ]).
I changed the "Filter array RemoveUnselectedPageBlanks" action logic to...
@And(greater(length(string(item())), 0),not(equals(empty(item()?['lines']), true)))
to remove any blank pages & avoid this error.