Introducing Ingestum: A Unified Content Ingestion Framework

Today, Sorcero is announcing the open-sourcing of Ingestum™ (pronounced “ingest’em”), our unified content ingestion framework that supports sourcing and transformation of a wide variety of data and document types into a uniform document format.

Written in Python and built around reusable, programmable pipelines, Ingestum is largely agnostic of both source and output formats and is designed to be extended through the use of plugins. It can be deployed as a command-line tool or as a web service. Ingestion of content from a variety of formats is a challenge the Ingestum framework meets: it is methodical, reusable, extensible, and scalable.

We're excited about making this project available, and we hope you are, too.

Let's talk about ingestion. More specifically, let's discuss the ingestion of unstructured content: those endless PDFs received from suppliers or retrieved from databases - no two alike, some born digital, others containing an unindexed scan image.

What about Word® documents—what's in them? XML files with unique tags? How about an audio recording of a Zoom meeting? Email threads? Twitter or RSS feeds? Or even those paper archives that might bring insights about past trends?

Everyone agrees that AI can provide insights from processed text, at scale. But how do you get your text into the AI?

Watch Sorcoro's Walter Bender, Martín Abente Lahaye, and Juan Pablo Ugarte present Ingestum to the LibrePlanet 2021 conference.

Ingestum: A new approach to structured and unstructured data

An oft-repeated Gartner figure from 2015 is that 80% of enterprise data is in unstructured documents.[1] McKinsey has said that digitizing the last 20% of documents is particularly challenging.[2] Everyone has encountered today's widely used formats—the ubiquitous Adobe PDF (Portable Document Format), an extension of PostScript; the popular Microsoft Office formats (Word, Excel®, PowerPoint®), XML-based or the earlier pre-XML versions; HTML web pages (usually containing scripts); XML may be well-formed but by design has arbitrary tags.

Our email accounts, laptops, and the cloud are full of documents in these formats that fulfill different roles: screen and print presentation, or authoring, or lightweight screen presentation, or data transmission or export, or scans of paper documents.

However, most of today's formats were not designed to facilitate the extraction of text, or more precisely, of structured text.

Of course, XML could potentially fulfill this role. But its extensibility poses a challenge: ever try building out a schema for a mysterious XML file that doesn't refer to one? And what about newer formats useful in the enterprise, such as a transcript of the audio from an H.264/AAC encoded .mp4 format video presentation file?

Naturally, you could write a one-off script to, say, pull text out of similar PDFs (invoices from a supplier), or an HTML page, or even a stream like a social media feed. But that's inefficient and doesn't scale.

What if there was a way to model the ingestion process, to easily drop in new models, to update existing ones? What if output was structured and flexible? Best of all, what if such a framework was available to everyone needing to extract text from unstructured documents?

How we built the Ingestum framework

At Sorcero, we built the Ingestum framework ourselves after looking and not finding an ingestion product that suited our needs.

We studied the ingestion market in depth—over 170 potential suppliers—and made some surprising discoveries:

The market is extremely fragmented.
There are very, very few “pure players” focused solely on ingestion.
There are a dozen AI suppliers who market their ingestion platform separately.
There are many dozens of AI suppliers who tie their ingestion to their AI offer, sometimes calling it “data preparation”.
A common approach is to “PDF” everything, then apply optical character recognition (OCR).
There are very, very few AI suppliers who have open-sourced any part of their ingestion code.

It seems the usual case is this: When you choose an AI supplier, you get whatever ingestion they have. If that doesn't suit you, you'll have to look elsewhere. Garbage in, garbage out.

At Sorcero, we want to turn this on its head. How about a truly useful (and free and open) ingestion platform that can integrate any code as a plugin, or be integrated into any workflow as a preprocessor, or parallel processor?

What if the project has a principal corporate sponsor, but has a copyleft license—the GNU Lesser General Public License (LGPL)—and is available to anyone to run for any purpose, to study how it works and change it, to redistribute it, to distribute copies (if they wish) of their modified version?

Ingestion is not Sorcero's core business. Enrichment of ingested content through natural language understanding (NLU) is.

We believe that the promises of data science will be better served by building an open-source ingestion framework with great open-source Python components like Beautiful Soup, Camelot, PDFMiner, Pyexcel, Twython, Python-tesseract, Deep Speech, and so on.

Code in C or C++ modules? No worries, drop them in with a Python binding. If you have proprietary ingestion code you don't want to publish, the LGPL does allow you to add your code as a plugin in your implementation. But we believe all boats will rise if AI firms contribute plugins to the framework.

Why we designed Ingestum

Ingestum—from the Latin word to ingest, to carry in, to throw in—was designed to address three challenges:

To facilitate the writing of scripts to extract unstructured content from arbitrary sources and formats
To provide a framework for extracting content from the diverse universe of source formats
To allow for the integration with both Python scripts and services at many levels of granularity

What's the Ingestum approach?

The Ingestum approach, detailed in our documentation [3], has six major concepts:

1. Sources: source files or data streams, converted to Ingestum Documents, extensible JSON-encoded formats for further processing;

2. Documents: intermediaries to which transformations are applied, for example tabular data, free text data, or Collection documents;

3. Transformers: specific operations on all or part of its input, returning an output Document;

4. Pipes & Pipelines: a Pipe is a sequence of Transformers, a Pipeline is a collection of Pipes, the tried-and-true UNIX® approach;

5. Conditionals: logical conditions to apply a Transformer selectively; useful with complex unstructured data to e.g. extract only tables, or text without tables;

6. Manifests: expressed in JSON, these describe Sources and Pipelines and their parameters, simplifying command-line invocations of Ingestum.

These components form a unified framework that has been used for the ingestion of sources as diverse as PDF datasheets from a casualty insurance company, research papers from Proquest and PubMed, XML documents from US government agencies, email threads, Twitter feeds, XLS files from a pharmaceutical company, Youtube and Vimeo videos, and audio recordings of meetings.

Now, you can use the Ingestum framework on your unstructured data.

>> Want to see how Ingestum processes unstructured data? Download the free Ingestum White Paper

Ingestum is free and very easy to install, prototype, test, and deploy.

By its nature, processing unstructured data requires some experimentation and iteration. Ingestum shines at this since the process is broken down into small steps. Today, there are dozens of example Pipelines in the Ingestum repository on GitLab covering a broad range of sources.

These provide a starting point for building new, custom pipelines. (The plethora of Pipeline examples simplifies developer onboarding, as it is easier to modify an existing Pipeline than to create a new one.)

The Transformer library has broad coverage and adding new Transformers is typically a matter of just a few hours of effort. Ingestum can be made available through a services framework such as FastAPI and incorporated into low-code and no-code environments.

Of course, Ingestum is not perfect. For example, the default OCR library may struggle with handwritten documents; the default speech-to-text transcoding can always be improved. But it is easy to drop a preferred library in through Ingestum’s plugin mechanism. All that said, Ingestum's modular architecture means Modifiers can be added for all sorts of document and stream types and formats, and the platform will grow as it is deployed.

Our vision for Ingestum is to make unstructured text easily available for natural language processing (NLP), facilitating the creation of a “knowledge fabric” that enables further enrichment by AI technology. Download Ingestum today [4] and join us!

Read our white paper that goes into a more detailed discussion of Ingestum here.

[1] Organizations Will Need to Tackle Three Challenges to Curb Unstructured Data Glut and Neglect. Published 17 June 2015. ID G00275931 (See https://www.gartner.com/en/documents/3077117/organizations-will-need-to-tackle-three-challenges-to-cu)

[2] Best-in-class digital document processing: A payer perspective (See https://www.mckinsey.com/industries/healthcare-systems-and-services/our-insights/best-in-class-digital-document-processing-a-payer-perspective#)

[3] https://sorcero.gitlab.io/community/ingestum/

[4] git clone https://gitlab.com/sorcero/community/ingestum.git

[5]Ingestum White Paper

Contact Sorcero to learn more about what Language Intelligence can do to empower experts at your enterprise.