Over the course of its existence, a new technology tends to go through a brief period of flourishing democratization, followed by a strong push towards centralization. This has occurred in everything from radio to the internet, and we’re seeing it move quickly in AI. Just a few years ago, the best path to solving a task through AI seemed to be building your own frameworks and models, and the race was on to build the best. But along with the radical pace of improvement of the technology through ever-larger neural networks has come increased barriers to entry for the best-performing software.
The current leader sitting atop the SuperGlue natural language understanding (NLU) benchmark is T5, from Google, which cost an estimated $10 million to train. At just over a year old, it is already a seasoned model, and will likely be dethroned within the next year by something even more expensive. The necessity of compute time in order to create and train advanced AI models, combined with their relatively short lifespans, puts this strategy out of reach for all but the largest organizations, whether they are governments, corporations, or research institutions.
Finding enough training data to teach AI models is one of the bigger challenges in natural language processing (NLP). Most human-labeled task-specific data sets are far too small to properly train modern deep-learning-based NLP models, which require millions or even billions of annotated training examples.
But with the rise of BERT for NLU, and equivalent image classification models such as AlexNet and ResNet, the industry has been tackling this problem using transfer learning. Transfer learning takes models that are pretrained on vast quantities of open-sourced content, such as Wikipedia or the open web in order to create the largest possible data lake, before fine tuning the pretrained general model with domain-specific data. However, this technique still requires large quantities of domain-specific data, making it difficult to tune these models to relatively narrow tasks that don’t already contain significant amounts of pre-labeled data. As a result, it is often difficult to achieve the high degree of accuracy required for most corporations to effectively apply AI solutions to their problems.
It has become a truism in data science that 80% of the total time spent on most projects is spent cleaning and preprocessing the data. That number is likely inflated, but most data scientists report that it is one of the most time-consuming (not to mention mind-numbing) parts of their job; automating it would go a long way towards increasing the efficiency of their work.
Ensuring that AI remains a fairly democratic, decentralized technology requires that we make the process of fine tuning pretrained models as simple and inexpensive as possible. To that end, the Sorcero Language Intelligence Platform is designed to be a top-of-the-line workshop to assemble cutting edge, customizable AI solutions in a way that is an order of magnitude faster, cheaper, and more durable than training a custom AI model.
There are three key elements of Sorcero’s platform that provide for this sort of democratization. First, it is powered by the latest pretrained models on the open market, ensuring that our customers are working with the most powerful technology available. Next, our ingestion engine allows the ability to create custom pipelines through a library of more than 100 transformers, enabling source and format-agnostic document processing. Finally, our Cognitive Tower allows for easy fine tuning from domain and customer-specific ontologies, or using customer content to auto-extract ontologies and tune without pre-labeled data sets. This allows the model to be precision-tuned at an enterprise or even workflow level.
One of our customers—a leading insurer for a European Union country—is employing Sorcero for several use cases, one of which is to turn its documents into a question-answering service for its customers, after initially attempting to build the product in-house. It used the Universal Sentence Encoder, and BERT models, which scored an 69.0 on the SuperGLUE Benchmark leaderboard.
However, due to the highly technical language of the customer’s content, BERT achieved only 35% accuracy in matching customer queries to appropriate sections of their documents. By using the Sorcero Language Intelligence Platform to fine tune the model, performance shot up to 93%—and a remarkable 99% for exact matches—based on the benchmarks for the task provided by our customers.
As more customers use the Sorcero Language Intelligence Platform, we expect performance will continue to improve as we leverage the general-language models developed by Google, Facebook, Open AI, and other behemoths, while we accommodate the specialized language used by our customers. As these models are continuously replaced by more recent advances, Sorcero’s platform will provide access to constantly-improving assets worth tens of millions of dollars.
Contact Sorcero to learn more about what Language Intelligence can do to empower experts at your enterprise.