Blog & News

Why neural language models don’t work well in NLG

By Ehud Reiter | September 2, 2021
Blog Image - 31 Why neural language models don’t work well in NLG

I am often asked about neural language models such as BERT and GTP3, and whether Arria uses such models to generate text. I usually explain that while neural models have been very successful in NLP applications such as machine translation (MT), they are much less useful in data-to-text NLG (i.e., the kind of systems Arria builds). But why is this?

One reason is that hallucination (i.e., neural language models producing narratives that are factually wrong) is a much more serious problem in NLG than in MT. In NLG, hallucinations happen on a grand scale. We recently did a study of mistakes in basketball stories produced by neural NLG systems, and found that each story contained, on average, 20 factual errors (hallucinations). This is a huge number, and far beyond what is acceptable in Arria-type systems. Furthermore, many of these hallucinations were fairly subtle (e.g., “scored 8 points” instead of “scored 12 points”) and hence hard for readers to detect. The number of hallucinations would need to drop by at least a factor of 100 before Arria would consider using this technology in its systems.

In machine translation, though, hallucination happens less frequently, and when it does occur, it is usually in edge cases or situations where there is a serious mismatch between what the system is asked to translate and what it was trained on. In other words, when an MT system is operating in its “comfort zone”, hallucination is rare. Hallucination does occur when the MT system is pushed beyond its comfort zone, but even here most of the hallucinations result in texts that are obviously wrong to readers, because they contain nonsensical or repeated phrases. Hence, hallucination in MT is mostly a concern to MT vendors who sell systems to users in safety-critical applications (especially when the users don’t understand what the system’s comfort zone is).

To give an analogy, an MT system based on a neural language model is like a human translator who does a good job when translating news articles (which are well written in everyday language), but makes mistakes when translating clinical notes from a doctor (which are very technical and often poorly written). Such a translator is certainly useful provided we understand her limitations. A data-to-text NLG system based on a neural language model, though, is like a journalist who takes the philosophy “Don’t let facts get in the way of a good story” to an extreme, and writes stories that are plausible and well-written but have little connection to reality. This may be acceptable when generating fiction or fake news, but it is not acceptable when NLG texts are used to help people make important decisions!