Endless PDFs, no two alike -- some created digitally and others produced as scanned images. How about Word documents or XML files with unique tags? Audio recordings of Zoom meeting are certainly more popular than ever. Email threads, Twitter, RSS feeds, paper archives? Our email accounts, laptops, and the cloud are full of content in these formats.
Unfortunately, most of these formats were not designed to facilitate the extraction of text or, more precisely, structured text. Many people acknowledge the benefits and insights artificial intelligence (AI) can provide from processed text, but how do you get the text into the AI?
What is Ingestum?
Sorcero's Ingestum is an extensible, scalable, easy-to-use content ingestion framework. Ingestum enables a wide variety of data and document types to be sourced and transformed into a uniform document format. For example, it can take an HTML file, remove all of its HTML tags, and extract only the human-visible text. It can also take a PDF file, again extracting only the human-visible text. The resulting documents are indistinguishable from any other text documents, regardless of the source. This process is called "ingestion."
Did we mention Ingestum is free and open source? While Sorcero's core business is the enrichment of ingested content through natural language intelligence, we believe the promises of data science will be better served with an open source content ingestion framework.
Sources and Use Cases
Today, there are dozens of example pipelines in the Ingestum repository covering a broad range of sources and new sources continue to be added. They provide a starting point for building new, custom pipelines. Current sources supported by Ingestum include (additional sources can be added through a plugin mechanism):
- Audio
- bioRxiv (New!)
- CSV
- Document
- DOCX
- Europe PMC (New!)
- HTML
- Image
- Litcovid (New!)
- medRxiv (New!)
- ProQuest
- PubMed
- Text
- XLSX
- XML
In addition to adding new content sources, Ingestum also continues to add various outputs. Recently, users were given the ability to specify a destination for the output, whether it be in a file or in a data lake on the cloud. Ingestum has been deployed across a wide variety of tasks and use cases, such as:
- Ingesting ProQuest searches for mentions of medical tests for pharmaceutical companies. The output of the ingestion process is passed to an AI system that monitors the documents for mentions of adverse effects.
- Monitoring Twitter for mentions of products. The output of the ingestion process is passed to an AI system that does sentiment analysis.
- Regular ingestion of regulations from PDF and XML sources. The output of the ingestion process is passed to an AI system that does document comparison, surfacing changes in guidelines over time.
Our vision for Ingestum is to make unstructured text easily available for natural language processing (NLP), facilitating the creation of a “knowledge fabric” that enables further enrichment by AI technology.