<img alt="" src="https://secure.doll8tune.com/223489.png" style="display:none;">
August 31, 2021

New Sources Added to Sorcero's Content Ingestion Library

Endless PDFs, no two alike -- some created digitally and others produced as scanned images. How about Word documents or XML files with unique tags? Audio recordings of Zoom meeting are certainly more popular than ever. Email threads, Twitter, RSS feeds, paper archives? Our email accounts, laptops, and the cloud are full of content in these formats.

Unfortunately, most of these formats were not designed to facilitate the extraction of text or, more precisely, structured text. Many people acknowledge the benefits and insights artificial intelligence (AI) can provide from processed text, but how do you get the text into the AI?

What is Ingestum? 

Sorcero's Ingestum is an extensible, scalable, easy-to-use content ingestion framework. Ingestum enables a wide variety of data and document types to be sourced and transformed into a uniform document format. For example, it can take an HTML file, remove all of its HTML tags, and extract only the human-visible text. It can also take a PDF file, again extracting only the human-visible text. The resulting documents are indistinguishable from any other text documents, regardless of the source. This process is called "ingestion."

Did we mention Ingestum is free and open source? While Sorcero's core business is the enrichment of ingested content through natural language intelligence, we believe the promises of data science will be better served with an open source content ingestion framework.

Sources and Use Cases

Today, there are dozens of example pipelines in the Ingestum repository covering a broad range of sources and new sources continue to be added. They provide a starting point for building new, custom pipelines. Current sources supported by Ingestum include (additional sources can be added through a plugin mechanism):

In addition to adding new content sources, Ingestum also continues to add various outputs. Recently, users were given the ability to specify a destination for the output, whether it be in a file or in a data lake on the cloud. Ingestum has been deployed across a wide variety of tasks and use cases, such as:

  • Ingesting ProQuest searches for mentions of medical tests for pharmaceutical companies. The output of the ingestion process is passed to an AI system that monitors the documents for mentions of adverse effects.
  • Monitoring Twitter for mentions of products. The output of the ingestion process is passed to an AI system that does sentiment analysis.
  • Regular ingestion of regulations from PDF and XML sources. The output of the ingestion process is passed to an AI system that does document comparison, surfacing changes in guidelines over time.

Our vision for Ingestum is to make unstructured text easily available for natural language processing (NLP), facilitating the creation of a “knowledge fabric” that enables further enrichment by AI technology.

Click here to learn more about Ingestum

Elizabeth Venafro
Elizabeth Venafro Elizabeth is a B2B and B2G modern marketing technologist with experience at small and large organizations across diverse industries. Prior to joining Sorcero, Elizabeth was the Director of Marketing at Omnilert, pioneer of the next generation in emergency management, and Unison, the leading provider of acquisition management software to government agencies and contractors. She also co-founded and was the managing partner of the Konvergent social engagement platform, as well as the Director of Global Marketing Communications for medical device manufacturer K2M (acquired by Stryker) from start-up through IPO. Elizabeth graduated magna cum laude from James Madison University with a Bachelors in Business Administration and is the proud wife and mother of two girls.


Magnify your potential.

Employee Spotlight

Welcoming Our New Team Members

Sorcero is continuing to scale our team to meet the rapidly increasing demand for our..

Read More Culture
Artificial Intelligence

Medical Affairs Expert Q&A: Tim...

We recently had the opportunity to interview Sorcero advisor, Tim Mikhelashvili,..

Read More Culture
Artificial Intelligence

Why We Love Dragons and You...

I have a confession to make -- Much to my father-in-law's dismay, I have never been a..

Read More Culture
Artificial Intelligence

The Next Generation of...

At Sorcero, we have been blown away by the incredible talent of our high school and..

Read More Culture


The latest resources delivered.

Stay on top of the latest from Sorcero’s resource center.