Skip to content
Measuring Progress in Natural Language Understanding
Walid Saba
Apr 19, 2022

Towards Meaningful Benchmarks for Measuring Progress in Language Understanding

The following piece, written by natural language understanding expert and Sorcero Senior Scientist Dr. Walid Saba, explores the "gap between the claims being made and the reality of actual progress in NLU technology."

The Path Towards Natural Language Understanding

The statistical revolution that took artificial intelligence (AI) by storm in the early 1990s culminated in the early 2010s with the resurgence of Neural Networks under the new label of Deep Learning (DL). Triggered by the superb performance in the famous ImageNet competition in 2012, DL gained unprecedented acceptance in both industry and academia. The supremacy of DL continues today and it has reached a point where, what was once just an eccentric subfield of AI, has now become almost synonymous with AI.

The success of DL in pattern recognition tasks (such as voice and image recognition) is understandable since what these models do is essentially learn to find patterns by training on massive amounts of labeled datasets. But, as eloquently argued by Kenneth Church, one of the pioneers of using machine learning and other empirical methods in natural language processing (NLP), the pendulum has swung too far [1].

Church argues that the motivation behind employing empirical (statistical, data-driven and machine learning) methods was to pick up the low hanging fruits: until we have a better understanding of how natural language works and until we develop working models for true language understanding, developing practical applications using empirical methods was better than doing nothing and waiting for computational linguists to crack the language problem.

Unfortunately, however, we are now beyond a pendulum that swung too far, but a pendulum that is broken. Forget picking up the low-hanging fruits. Data-driven empirical methods are now the standard technology in NLP.

Since Google introduced BERT (Bidirectional Encoder Representations from Transformers) large language models (LLMs) have become the de-facto technology in NLP. Such techniques require training on vast amounts of text and several entities compete by raising up the ante in the amount of data they consume and the size of the networks they build and thus the amount of hardware required to train such models (we’re now in the billions of parameters – i.e., weights in massive neural networks). Their hope, of course, is to discover meaning by harvesting a massive amount of text, something that I described elsewhere as “an attempt at chasing infinity”.

Natural languages are the external artifacts we use to express our thoughts, and so natural languages are infinite objects since the thoughts we might want to express are infinite. Thus, and as the prominent linguist and cognitive scientist Noam Chomsky once remarked, speaking of the probability of a sentence is a meaningless notion, under any interpretation of the term. Trying to uncover meaning from processing raw text, in other words, is a futile effort.

But what has made these large language models so popular?

One of the main reasons behind the proclamation of these “successes” is the continuous publication of record-breaking results that (almost always) claim above 90% accuracy on some benchmarks such as SQuAD (Stanford Question Answering Dataset), GLUE (General Language Understanding Evaluation), etc. Such claims have often prompted some media pundits to proclaim that these models have even surpassed human language competency.

However, none of these claims are even remotely close to the truth, as reported by many who tested these systems (for example see [2] and [3]).

The huge gap between the claims being made and the reality of actual progress in NLU technology is a byproduct of inadequate benchmarks.

While existing benchmarks are appropriate to test progress in NLP and text processing (search, text clustering, text categorization, text similarity, etc.), when it comes to measuring progress in developing systems that truly understand ordinary and everyday spoken language, such benchmarks are almost irrelevant.

In his classic Computing Machinery and Machine Intelligence, the computing pioneer Alan Turing introduced “the imitation game” [4] for testing if a machine is intelligent. What became later known as “the Turing Test” was essentially a dialogue between a human interrogator and two unknown entities A and B, one of which is a human and the other is a computer. If the interrogator could not, after a considerable amount of time, tell which of A and B is the machine and which is the human, then that machine was considered to be an intelligent machine.

However, as Levesque et. al. [5] pointed out later, systems that have participated in such competitions usually use deception and trickery by throwing in “elaborate wordplay, puns, jokes, quotations, clever asides, emotional outbursts,” while avoiding clear answers to questions that a 5-year old would be very comfortable correctly answering. Levesque suggested instead what he called the Winograd Schema Challenge where the following is a template test:

The trophy did not fit in the suitcase because it was too
a. small
b. big

Since ‘big’ and ‘small’ co-occur in similar contexts with the same probability, determining what “it” refers to here would require background knowledge that is not expressed in the words of the sentence and it is precisely bringing that background knowledge to bear that we informally call thinking.

However, as noted in [6] (and more recently in [7]), the Winograd Schema Challenge still lacks since the test is a binary choice where even a random selection of referent guarantees 50% accuracy. Moreover, this test is one-dimensional in that it only tests competence in reference resolution while there are so many other linguistic challenges that would require reasoning using what is called commonsense background knowledge.

A better test would be one that shows how the machine can “uncover” all the missing text that is often not explicitly stated in our everyday communication but is implicitly assumed since it is part of our commonsense background knowledge. Here are some examples:

  1. Mary enjoyed the book/movie/sandwich
    (means Mary enjoyed reading the book/watching the movie/eating the sandwich)

  2. BBC has a reporter in every Asian country
    (does not mean BBC has “a reporter” but that “in every Asian country, BBC has a reporter”)

  3. The White House issued a strong statement against the North Korean nuclear tests  
    (it is not the White House building, but the administration working in the White House that issued the statement)

The above are just examples of very subtle yet complicated reasoning using our commonsense knowledge that we perform in truly “understanding” our everyday spoken language. Even a 5-year-old performs this complicated reasoning and, until we have systems that show that capability, we should tone down our claims of making progress in one of the most difficult challenges in AI: fully understanding human communication by machines.

AI has historically been awash with hype, and grandiose claims have been all too common in the field. When the hype cycle reached its peak in the early to mid-1980s without anything to show for it, we went into what is now called an AI Winter. No one wants to see another AI winter, so we need to be careful in announcing progress.

Progress should also be measured objectively and not by establishing ad-hoc benchmarks and publishing groundbreaking results that also prompt outrageous claims in the public media. This technology is too important to be left to hackers and media pundits.



  1. Church, Kenneth (2007), A Pendulum Swung Too Far, LILT, Vol. 2, No. 4 (available here)
  2. Baeza-Yates, Ricardo (2022), Large Language Models Fail to Say What they Mean and Mean What they Say, Venture Beat, March 29, 2022 (available here)
  3. Marcus, Gary (2022), Deep Learning is Hitting a Wall, NAUTILUS, March 10, 2022 (available here)
  4. Turing, Alan (1950), Computing Machinery and Machine Intelligence, MIND, vol. LIX (available here)
  5. Levesque, Hector, Davis, Ernie, and Morgenstern, L. (2012), The Winograd Schema Challenge, In Proc. of the 13th Int. Conf. on Principles of Knowledge Representation and Reasoning, (available here)
  6. Saba, Walid (2019), On the Winograd Schema: Situating Language Understanding in the Data-Information-Knowledge Continuum, FLAIRS Conference, AAAI Press. (available here)
  7. Kocijan, Vid al. (2022), The Defeat of the Winograd Schema ChallengeThe Defeat of the Winograd Schema Challenge. (available here)


Learn more about Dr. Walid Saba and his innovative work in this recent interview. 


Walid Saba

Walid Saba Ph.D. is an NLU/AI expert with over 20 years of experience (AT&T Bell Labs, IBM, and AIR) and 30+ publications.