AWS Comprehend — Using the newest Layout Aware feature

Michael Tran
3 min readOct 10, 2023

--

Introduction

In the realm of Intelligent Document Processing (IDP), the integration of Amazon Comprehend has proven to be a powerful tool for extracting valuable insights from unstructured data. However, as I delved into this task, I encountered a unique challenge that demanded a meticulous approach to ensure accurate results.

The Challenge

My objective was to create a Comprehend custom classification model that seamlessly integrated with my IDP pipeline. To achieve this, I turned to Amazon’s detailed documentation on how to leverage their Comprehend Document Classifier, particularly focusing on the newly added layout support for enhanced accuracy. (https://aws.amazon.com/blogs/machine-learning/amazon-comprehend-document-classifier-adds-layout-support-for-higher-accuracy/)

The Manifest File Conundrum

One crucial aspect of this process was the requirement for a manifest file in a specific format. This file served as a roadmap for Comprehend, detailing the class of each document and its corresponding file name. An example entry looked like this:

discharge_summary,summary-1.pdf,1
discharge_summary,summary-2.pdf,1
invoice,invoice-1.pdf,1
invoice,invoice-1.pdf,2
invoice,invoice-2.pdf,1

The Excel Twist

However, a curveball was thrown my way when a client provided an Excel document referencing classes and file locations. To add to the complexity, I discovered that the files were in TIF format, necessitating an entirely different approach for semi-structured data within Comprehend.

Navigating the Solution:

To tackle this, I embarked on a multi-step process.

  1. I had to parse the Excel document, extracting the necessary information.
  2. I queried the all the objects in the designated folder, preparing for the TIF file treatment.
  3. Split each individual TIF file into single pages.
  4. Save only the first 4 pages into a new folder. I found that these to be the most unique identifying pages in an envelope of documents.
  5. Keeping the S3 url reference, I would tie this new object with it’s class to a list in memory.
  6. Finally, I would need to convert that list to a CSV format without a header and number “1” for the last column. The documentation requires that split TIFs reference it’s page number as “1”

TIF File Challenges

TIF files presented a unique set of challenges. Determining whether they were multipage was critical, as Comprehend required a specific treatment for image formats. I learned that for multi-frame TIF files, they needed to be split into separate TIFs to be used effectively in the training process.

Selective Extraction

Considering documents like ‘invoice-1.pdf,’ which had multiple pages, it was essential to include all relevant pages in the classification dataset. Given that PDFs, PNGs, and TIFFs are image formats, the page number value must always be 1. This meant that I had to meticulously extract the relevant pages from multi-page documents.

Streamlining for Comprehend

Once I had performed these operations, the next step was to consolidate the newly extracted files into a common directory for Comprehend to access. This ensured a seamless flow of data for classification.

The Final Pieces

In the end, I had two critical components ready for Comprehend: the directory space containing the prepared files and the meticulously crafted manifest file. With these in hand, I was poised to achieve accurate and meaningful results in my document classification efforts.

Some other hiccups were that the folder were related to the quota:

  1. You cannot have more than 10,000 number of pages across all documents
  2. Your annotation (CSV) file cannot exceed 5MB.
  3. Your document corpus size (meaning your directory where you save all your individual files) cannot exceed 10 GB.
  4. Each file size (TIFF, PDF, etc) cannot exceed 10 MB

Conclusion

The journey to seamlessly integrate Amazon Comprehend into my IDP pipeline was certainly filled with twists and turns. However, by navigating through the intricacies of manifest files, Excel documents, and TIF files, I emerged with a robust system that promises accurate and insightful document classification. This experience underscored the importance of adaptability and attention to detail in the world of Intelligent Document Processing.

--

--