LongCut logo

Schema-driven Agentic Document Field Extraction | Extract Structured Data with ADE from LandingAI

By LandingAI

Summary

Topics Covered

  • Iterate your schema until edge cases disappear
  • Send PDFs, scans, and photos in one request
  • Every extracted value links to its source

Full Transcript

[Music] In this video, we'll look at Agentic Document Extraction use cases that focus on field extraction when you have many different input documents that contain similar details, and you want to map all of those details into structured output.

This can present in a couple of different ways.

For example, with utility bills, the originals may be digital PDFs, scanned PDFs, scanned images, or even photographs. Or, in the case of continuing education certificates, the original documents may also contain important logos, figures, and signatures.

In the case of documents such as invoices, you need to expect that the documents may have different currencies or different standard date formats. And lastly, when dealing with documents such as identity documents, we need to be able to plan for documents that contain multiple languages.

Let's see how Agentic Document Extraction accomplishes the task of mapping input documents with similar details to structured output using these four as our test use cases.

For continuing education certificates, I've selected three of them to test in the Visual Playground. I've uploaded this schema, and the schema is requesting just eight items from the certificate, including the number of credits as a numeric and booleans for which particular accreditation categories it belongs to.

Underneath the data, you see all of the metadata, including the reference chunk for where a particular piece of information was found. Let's check this same schema on a few other certificates.

This is the same schema on a different certificate, and I'm happy with the extractions.

Let's check one more. And a third certificate—same schema—happy with those extractions.

For this use case, it would now be possible to switch to the API to create structured outputs such as this, where every row is one certificate and the results can be loaded into an enterprise database very easily.

In the case of identity documents such as passports, the information that they contain is highly standardized. I've written an initial schema in English, and let's see how it performs. If you look carefully at the original, you'll see that values such as the surname and the given names are printed both in Greek and in English. So for the surname in my extraction,

I actually have a combination of the two languages, which is not what I want. What I need to do is revise my schema to be more specific. Let's start over and upload a second attempt.

This looks much better. You'll note that in the schema itself, I've now created two values for the surname. In the first value, I've asked for it to be returned in the issuing language for the passport. Then I've created surname_english and asked for that to be

returned in English. I've repeated that process for the given names and for the nationality.

Let's try this second schema on another passport.

Here's another example where names are printed both in the primary language of the issuing country as well as in English. Let's apply that second extraction schema that we developed. Excellent. This is exactly what I was looking for in the extraction.

we developed. Excellent. This is exactly what I was looking for in the extraction.

Just to confirm that everything is working as expected, let's take a look at a few more examples. Here's one from the Republic of Sudan and another example from the United Arab Emirates.

examples. Here's one from the Republic of Sudan and another example from the United Arab Emirates.

In this example, I've learned that not every passport separates the given name and the surname, so this might cause me to revise my schema again until I get exactly what I want.

Let's take a closer look at utility bills such as these and understand how we can convert these into structured output. As I click through the images, you’ll notice that some of them were images and some of them were PDFs. The PDFs then also vary

in their page count. In this example, we sent them all to the API simultaneously, and you can see that they all finished within about 20 seconds of each other.

There are some other interesting things to note in this schema. First, the schema contains a boolean to ask whether the bill includes a usage bar chart for the amount of electricity or gas used. Some of these are electric-only bills versus some of them that are combined electric and gas.

used. Some of these are electric-only bills versus some of them that are combined electric and gas.

Let's look at these in the Visual Playground.

Here's one of the bills in the set. It is four pages, and you can quickly glance at the parsing results. Notice that it has both electric and gas, and it does include a usage bar chart for both.

results. Notice that it has both electric and gas, and it does include a usage bar chart for both.

On the Extract tab, we've already applied the extraction schema. Here

you can see that the usage bar chart is true. We have a total amount due, and that’s broken down into electric and gas charges. Notice that there are also separate meter numbers for both of those, but I've been able to reuse that variable name in the metadata.

Observe that for something like account number, there are actually five different reference chunks. This is because the account number appears multiple times—at least once per page.

chunks. This is because the account number appears multiple times—at least once per page.

Notice also that our usage bar chart has two chunk references, and indeed, we did see two different bar charts in the original.

Here's a very different-looking document. It's a photograph and it's only one page. We've applied the extraction schema. There is no bar chart present and there are no gas charges because this is electric-only.

This particular photograph is clear and decluttered, but the extraction schema works just as well with a lower-quality photograph such as this one. Note that because this is a photograph of just the first page, there are no specific electric or gas charges detailed on page one.

Let's take a look at invoices such as this one from a metal company in India, or this one from a music provider in Germany. Since every organization has invoices, we'll use this section to also share some of the resources to help you get started quickly with this invoice example or another field extraction example of your own.

As you look at these invoices, you may be wondering, What can I extract from them?

The answer to that question is found here on the Extraction Schema Help page in the documentation.

Here you can learn about the supported field types, restricting results to specific values, arrays and nested objects, and get some tips and best practices.

If you're inspired to try field extraction for your own use case, definitely check out this invoices workflow demo. In this folder, you'll also find the schema.

This invoice schema was created using Pydantic, and we do support both Pydantic and JSON schemas. In the schema, you'll see certain invoice-level information such as dates, customers, or suppliers, and you'll see line-item information such as the specific SKU purchased, quantity, and unit price.

When you follow the accompanying notebook, the output is multiple tables, of which Table 3 contains that invoice-level information such as the date, shipment terms, or the currency associated with the invoice, and Table 4 returns all of the

individual line items. Here we see three line items associated with Invoice #3.

Having processed these invoices and generated these tables—perhaps as part of a daily batch—the logical next step would be to insert that information into a database such as Snowflake.

Here my table named invoices_main is capturing the invoice-level information such as the invoice number, who it was sold to, perhaps the account representative, or the payment terms. Notice in the first couple of columns here, we're also capturing when we processed the

invoice and which version of the Agentic Document Extraction library was used.

Then the next table contains the invoice line items. If there's just one row, it means there was just one line item on that invoice. But down here at the bottom, we can see that invoice number 25 has multiple line items, and we're capturing the SKU,

description, quantity, and unit price for each one of those.

Hopefully, this gives you enough information and inspiration to get started with Agentic Document Extraction on your own. Consult the documentation, follow the workflows, and have fun building out your own document extraction use case.

[MUSIC]

Loading...

Loading video analysis...