Blog & News

Finding creative solutions to detect mistakes in neural-NLG narratives

By Ehud Reiter | July 14, 2021

There is a lot of excitement in the NLG world about using neural (deep-learning) techniques to build NLG systems. However, one big problem with current neural NLG systems is that the narratives they generate can contain a lot of factual errors, especially if the narratives are longer than 10-20 words.

I am working with an Aberdeen University PhD student, Craig Thomson, on trying to better understand the number and type of mistakes made, and how they can be detected. After all, if an NLG system can detect when it has produced a flawed narrative, it could try again, or ask a person (“human in the loop”) to fix the narrative.

Craig and I have our own ideas about this, but we’ve been keen to get ideas from other people as well, so we decided to run a “shared task” on detecting mistakes. Shared tasks are basically contests, where the organizers (i.e., Craig and I) propose a well-defined problem, and ask participants to submit solutions (e.g., algorithms). Anyone can participate, so it’s a great way to get ideas from people with very different backgrounds!

We described the shared task in a paper, and full details are available at our GitHub site (warning: the site is pretty technical). Basically, we asked participants to develop algorithms and protocols that can find mistakes in summaries of basketball games produced by neural NLG systems. We manually analyzed the mistakes in 60 such summaries, and the participants used this “training data” to build and tune their algorithms and protocols, which they sent to us on 15 June. We’re now in the process of checking how well these approaches work on a “test set” of 30 additional summaries of basketball games. Results will be announced at the INLG conference in Aberdeen in September.

It’s exciting and I’m looking forward to announcing the results! We have participants from three continents who are exploring very different approaches. It’s fascinating to see which approaches work well and which do not; and if some of the approaches are successful, this will be a huge help in building useful neural NLG systems!

I’d also encourage anyone who is interested in accuracy of neural NLG to look at the GitHub site for our shared task, especially gsml.csv (which is the manually produced list of mistakes in the 60 original narratives). I’ve seen a lot of discussion of accuracy problems, which is vague and not based on real data. Our gsml.csv file lists 1,214 mistakes in 60 narratives, and I think it’s very helpful in giving a concrete understanding of the number and type of errors produced by neural NLG systems in this kind of task.