Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

Data collection is a major obstacle to the development of high quality NLP models.

This problem is especially relevant when attempting to learn tasks that either 1. require a large number of training examples (e.g. grounded language learning) or 2. have a high cost per training example (e.g. semantic parsing). Although crowdsourcing platforms like Amazon’s Mechanical Turk are an option (albeit an expensive one) for case 1, they are often not viable for case 2 --- semantic parsing, for example, requires the human annotators to have working knowledge of the target language (e.g. SQL), a skill that demands substantial domain expertise. 

NLP isn’t the only area that suffers from data scarcity. Robot learning, for example, is characterized by sample inefficiency, in that a large number of training examples is required to learn to perform even simple tasks. Domains which rely on real-world data (robotics being one of them) often turn to simulation instead, since simulated data is virtually unlimited in quantity. This practice, however, creates new challenges in domain adaptation, manifesting itself as a performance gap when the agent is deployed in the wild. Machine learning models trained on synthetic data will often fail to generalize to natural language data at test time due to catastrophic distributional shifts in the inputs to the model. This test-time accuracy gap between models trained and tested on synthetic data as compared to models trained on synthetic data and tested on “real” data can be upwards of 54%. 

To address this, practitioners commonly use sim-to-real transfer: they train agents in simulation, since simulated data is virtually unlimited, and then use additional techniques to “bridge the reality gap”  that inevitably arises due to discrepancies between simulation and the real world. In our recent paper, we explore how the notion of sim-to-real transfer can be adapted to NLP to address data scarcity issues, introducing the first general-purpose techniques for language understanding problems. 

Let’s start with the basics: what does a simulator for NLP look like?

Screen Shot 2020-04-29 at 10.09.21 AM.png

Generally speaking, a simulator is an automated procedure for generating labeled input-output examples. Consider the case of robot instruction following: here, we require a mapping from natural language inputs to concrete interpretations of those inputs, which are then executed to produce a sequence of low-level actions.  Our simulator can therefore be defined as an expert-designed inverse function which maps each program to a set of distinct input sentences that induce that program. 

There are a few things to note about this type of simulator, which we refer to as a synthetic grammar. The first is that the synthetic grammar can be engineered to provide full coverage of the space of target programs by ensuring that there is a synthetic instruction for every target program. The second is that despite having the potential to generate large amounts of training data cheaply, the full spectrum of linguistic variation isn’t represented by this generated data — there are more ways to say the same thing that any grammar engineer can reasonably hope to come up with any different sentences could be mapped to the same program (e.g. “find a red door”, “navigate to the red door”, etc).

Our goal was to demonstrate the feasibility of training ML models on exclusively synthetic training data while still successfully generalizing to natural language at test time.

Synthetic training data is cheap to acquire, and virtually unlimited in quantity. Demonstrating that models trained on synthetic data can still successfully generalizing to natural language inputs when the model is deployed in real-world environments would mean getting the best of both worlds. The core of this challenge is figuring out how to handle the linguistic variation present in natural language inputs, despite training on synthetic data with minimal variation.

An alternative view of this problem is one of finding paraphrases: we have access to a set of synthetic sentences representing the full space of possible programs and are given a natural language input to translate into a program. Our task can therefore be construed as finding a synthetic sentence with the same semantic meaning as the input, then outputting the program corresponding to that synthetic sentence.

fig.png

We propose to find projections from the natural language distribution onto the synthetic language distribution as a proxy for paraphrasing using pretrained sentence embeddings (i.e. from BERT). More concretely, we observe that semantically similar sentences are often closer together in embedding space than semantically dissimilar sentences are. We can rank the synthetic sentences by their similarity to the natural language input in this embedding space. Empirically, we find that the closest synthetic sentence is a paraphrase of the natural language input (again, assuming our grammar fully covers the space of desirable programs), and so when we feed the synthetic paraphrase as input to our model for inference, we can induce the desired target behavior, since the initial meaning of the user’s instruction is unchanged. The only difference is that the synthetic utterance is sampled from the distribution that our model was trained on, and so we are able to avoid the aforementioned distributional shifts that previously led to performance degradation. See our paper for a more in-depth discussion of different projection techniques and their performance and efficiency profiles! 

How well did we do?

tb1.png
tb2.png
fig3.png

On many of the tasks that we evaluated our approach on, we were able to meet or exceed the performance of models that were trained on natural language data while only using a simulator.

These results are promising, but there is a lot of work left to do. Improvements in sim-to-real transfer could have wide-ranging positive effects on downstream tasks. Not only does sim-to-real offer a better way to evaluate models trained on synthetic data, but it could also serve as the basis for new natural language interfaces to complex systems that would previously have been impractical to build due to data scarcity and prohibitive engineering costs.

Check out our paper, code, and dataset for more information!