Technical Articles

Review Cloudmersive's technical library.

Extract Fields and Tables from Documents in Python using Document AI
4/30/2026 - Brian O'Neill


A lot of the documents that move through enterprise document processing workflows are structured (or at least semi-structured). For example, invoices tend to follow predictable layouts, and tax forms always have labeled fields. Financial data shows up in reports and spreadsheets with consistent row and column arrangements. The information is all there, but extracting data from these documents in a way that’s reliable across varying formats and document quality levels is a harder problem than it seems.

Rules-based extraction can work if the input is perfectly consistent. In practice, however, it rarely is. Brittle parsing logic tends to fail on difficult edge cases, like layout shifts and fundamental differences in how vendors or institutions format their documents. What works better is a solution that can reason through document structure rather than simply pattern-match against it.

Extracting fields and tables with Cloudmersive Document AI

The Cloudmersive Extract Fields and Tables API applies intelligent reasoning to “simple” data extraction. It identifies both labeled (and unlabeled) fields and structured tables across a document and returns extracted content in a clean, hierarchical format which directly follows the original data organization. It can handle a range of common document formats, including DOCX, PDF, XLSX, PPTX, EML, MSG, JPG, PNG, and WEBP.

Extract Fields Tables hero Graphic

Walking through an example API call in Python

In this brief walkthrough, we’ll use code from the Document AI Swagger page to build an example API call in Python and walk through the API response structure. We’ll use Google Colab (Python 3) as a staging ground to simulate a live environment.

To follow along with this walkthrough, you’ll need a Cloudmersive API key, which you can get by signing up on the Cloudmersive website. Signing up for a free account gets you 800 API calls per month with no commitments, and that’ll be enough to test small documents. Note that the Extract Fields and Tables API consumes 100 API calls per page of any input document, and you’ll need to factor that in when choosing test documents (a 1-page example invoice will do the trick).

Installing the SDK

To get started, we’ll first run the below pip command to install the Document AI SDK in our environment.

pip install cloudmersive-documentai-api-client

Importing Resources

Next up, we’ll add the imports we need to pull resources into our file. Note that nothing in our request depends on the print or time imports, so we can remove those if we want.

from __future__ import print_function
import time
import cloudmersive_documentai_api_client
from cloudmersive_documentai_api_client.rest import ApiException
from pprint import pprint

Structuring the Request

We’ll now build out the meat of our request body using the below code:

# Configure API key authorization: Apikey
configuration = cloudmersive_documentai_api_client.Configuration()
configuration.api_key['Apikey'] = 'YOUR_API_KEY'

# create an instance of the API class
api_instance = cloudmersive_documentai_api_client.ExtractApi(cloudmersive_documentai_api_client.ApiClient(configuration))
recognition_mode = 'recognition_mode_example' # str | Optional; Recognition mode - Advanced (default) provides the highest accuracy but slower speed, while Normal provides faster response but lower accuracy for low quality images (optional)
preprocessing = 'preprocessing_example' # str | Optional: Set the level of image pre-processing to enhance accuracy.  Possible values are 'Auto' (default), 'Paged', and 'Compatability'.  Use 'Paged' to treat each page as a separate document for extraction (requires Advanced recognitionMode).  Default is Auto. (optional)
input_file = '/path/to/inputfile' # file | Input document, or photos of a document, to extract data from (optional)

try:
    # Extract All Fields and Tables of Data from a Document using AI
    api_response = api_instance.extract_all_fields_and_tables(recognition_mode=recognition_mode, preprocessing=preprocessing, input_file=input_file)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ExtractApi->extract_all_fields_and_tables: %s\n" % e)

Before we handle our request parameters, we’ll first deal with the configuration block. This is where we set our API key (and optionally our host):

configuration = cloudmersive_documentai_api_client.Configuration()
configuration.api_key['Apikey'] = userdata.get('freekey')
configuration.host = "api.cloudmersive.com"

We’ve set configuration.host to “api.cloudmersive.com” in this instance, which is the correct endpoint for free-tier subscriptions like the one we’re using in this walkthrough.

Now we’ll take a closer look at our request parameters. It’s worth understanding how these work before we start setting them.

Our first parameter, recognitionMode, works the same way here as it does across most other Cloudmersive Document AI endpoints. Advanced is the default value, and it gives the highest accuracy. Normal trades some accuracy for faster processing, and it’s a reasonable choice if we’re dealing with high-resolution, clean inputs. If our inputs have unpredictable levels of quality, we should keep recognitionMode at its default setting.

Our second parameter, preprocessing, is specific to this endpoint. It gives us more control over how the API handles multi-page documents. The default value is Auto, which tends to work well for most inputs. Paged, on the other hand, treats each page as a separate document for extraction purposes, which is useful when different pages contain independent structured data rather than a single continuous document. Note that Paged requires Advanced recognition mode to function. Compatibility is also available for cases where standard preprocessing produces inconsistent results (e.g., on unusual input).

The example document we’ll use in this walkthrough is an invoice, and we’ll leave all settings at their defaults in our request. Here’s the completed request code:

from __future__ import print_function
from google.colab import userdata
import time
import cloudmersive_documentai_api_client
from cloudmersive_documentai_api_client.rest import ApiException
from pprint import pprint

# Configure API key authorization: Apikey
configuration = cloudmersive_documentai_api_client.Configuration()
configuration.api_key['Apikey'] = userdata.get('freekey')
configuration.host = "api.cloudmersive.com"

# create an instance of the API class
api_instance = cloudmersive_documentai_api_client.ExtractApi(cloudmersive_documentai_api_client.ApiClient(configuration))
recognition_mode = '' # str | Optional; Recognition mode - Advanced (default) provides the highest accuracy but slower speed, while Normal provides faster response but lower accuracy for low quality images (optional)
preprocessing = '' # str | Optional: Set the level of image pre-processing to enhance accuracy.  Possible values are 'Auto' (default), 'Paged', and 'Compatability'.  Use 'Paged' to treat each page as a separate document for extraction (requires Advanced recognitionMode).  Default is Auto. (optional)
input_file = 'Invoice.pdf' # file | Input document, or photos of a document, to extract data from (optional)

try:
    # Extract All Fields and Tables of Data from a Document using AI
    api_response = api_instance.extract_all_fields_and_tables(recognition_mode=recognition_mode, preprocessing=preprocessing, input_file=input_file)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ExtractApi->extract_all_fields_and_tables: %s\n" % e)

Interpreting the API Response

After a few moments, we’ll get our API response structured like this:

{
    'field_results': [
        {'additional_field_string_values': None, 'field_name': 'Invoice No.',     'field_string_value': 'INV-2024-00847'},
        {'additional_field_string_values': None, 'field_name': 'Invoice Date',    'field_string_value': 'March 14, 2024'},
        {'additional_field_string_values': None, 'field_name': 'Due Date',        'field_string_value': 'April 14, 2024'},
        {'additional_field_string_values': None, 'field_name': 'Bill To',         'field_string_value': 'Northgate Logistics LLC\n880 Industrial Pkwy, Chicago, IL 60601\nAttn: Accounts Payable'},
        {'additional_field_string_values': None, 'field_name': 'Subtotal',        'field_string_value': '$2,180.00'},
        {'additional_field_string_values': None, 'field_name': 'Tax (6.25%)',     'field_string_value': '$136.25'},
        {'additional_field_string_values': None, 'field_name': 'Amount Due',      'field_string_value': '$2,316.25'},
        {'additional_field_string_values': None, 'field_name': 'Payment Terms',   'field_string_value': 'Net 30. Please remit payment via ACH or check payable to Acme Supply Co.'},
    ],
    'successful': True,
    'table_results': [
        {
            'title': 'Line Items',
            'rows': [
                {'cells': [{'cell_header': 'Description', 'cell_value': 'Industrial Steel Brackets (12")'},  {'cell_header': 'Qty', 'cell_value': '50'}, {'cell_header': 'Unit Price', 'cell_value': '$14.00'},  {'cell_header': 'Total', 'cell_value': '$700.00'}]},
                {'cells': [{'cell_header': 'Description', 'cell_value': 'Heavy-Duty Mounting Hardware Set'}, {'cell_header': 'Qty', 'cell_value': '20'}, {'cell_header': 'Unit Price', 'cell_value': '$22.50'},  {'cell_header': 'Total', 'cell_value': '$450.00'}]},
                {'cells': [{'cell_header': 'Description', 'cell_value': 'Warehouse Shelving Unit (72"H)'},   {'cell_header': 'Qty', 'cell_value': '5'},  {'cell_header': 'Unit Price', 'cell_value': '$189.00'}, {'cell_header': 'Total', 'cell_value': '$945.00'}]},
                {'cells': [{'cell_header': 'Description', 'cell_value': 'Freight & Handling'},               {'cell_header': 'Qty', 'cell_value': '1'},  {'cell_header': 'Unit Price', 'cell_value': '$85.00'},  {'cell_header': 'Total', 'cell_value': '$85.00'}]},
            ]
        }
    ]
}

We’ll notice this response splits extracted invoice content in two arrays: FieldResults for key-value data, and TableResults for structure tabular data. In practice, this is the difference between extracting things like invoice numbers, pay-to names, due dates, etc. and invoice items.

Each entry in FieldResults contains a FieldName and a FieldStringValue, which map to a specific field and its extracted content respectively. While it’s rare to find unlabeled fields in invoices, the API will intelligently infer field names in those cases.

TableResults is where the response structure really shines. Each table entry carries a Title value (if one was detected), followed by a Rows array. Each row then contains a Cells array, and each cell contains both a CellHeader and a CellValue. This structure preserves the column context for every piece of data in the table, which means we don’t need to do any positional inference to understand what a given value represents.

Conclusion

The Cloudmersive Extract Fields and Tables API makes structured data extraction from documents straightforward. It’s fully automated, adaptive to a variety of layouts, and customizable depending on the expected quality of input.

Rather than building and maintaining custom parsing logic for every document layout you might encounter, you can use a single API call to handle structure recognition and hand back data that’s ready to use.

For help fitting this API into a larger document processing pipeline, feel free to reach out to our sales team.

600 free API calls/month, with no expiration

Sign Up Now or Sign in with Google    Sign in with Microsoft

Questions? We'll be your guide.

Contact Sales