An overview of 2018 language models

Tung Nguyen2019-01-16

I became a deep learning enthusiast the first time I read about Deep Belief Network in 2013. Currently, I am working on NLP for Clova, but I also interested in multimodal learning, reinforcement learning, and both research and development of production system.

Hello, this is Tung from Clova team. Clova is our AI platform that comes with the first generation smart devices as Clova Friends and Clova Wave. You can find more information on the products here (Sorry, the site is in Japanese).

Today, as we start 2019, I want to invite you to reflect on a very hot topic of 2018 — the emergence of multiple competitive language models, ELMo, ULMFit, OpenAI Transformer, and BERT. There are articles on the internet explaining the internals of each model. So I will try to bring you a different perspective by putting all the four models side by side, making comparison and going into detailed aspects when necessary. Are you ready? Let's go!

Brief introduction of language model

What is a language model?

Simply speaking, a language model is a model to calculate probability distribution over sequences of tokens in a (natural) language. Older models such as n-gram language models have achieved good results but are severely limited by the curse of dimension due to many possible ways to combine n tokens. Newer models based on neural networks make use of distributed representations, that is, low-dimension vector representations of tokens that can mitigate the curse of dimension. Word2vec is a perfect example which you may have heard about. It provides a distributed representation of words that can be conveniently plugged in almost every NLP (Natural Language Processing) models with success.

Why should we care about language models?

Because they can empower a very efficient way of learning, called transfer learning. Transfer learning helps you to train, often unsupervisedly on a lot of data, then to apply the pre-trained model to efficiently learn downstream tasks. In real world situations, those downstream tasks are often supervised learning tasks but lacking enough labelled data to train on. Transfer learning can help this by leveraging knowledge learned from vast data in pre-trained model. (You can read more about transfer learning on Sebastian Ruder's blog here). Word2vec which I mentioned earlier is actually one example of transfer learning, a universal one. Researchers since then have been looking for even better way of transfer learning, which finally results in 2018 language models of our main topic; ELMo, ULMFit, OpenAI Transformer, and BERT.

Language models in 2018

How good are the 2018 language models?

They are said to possess univesality, for the fact that they achieved state-of-the-art scores on various NLP tasks. I combined available scores from their original papers and official website to one place for easy comparison.

Scores of various NLP tasks
△: Smaller than previous state-of-the-art at published timing
*: Not comparable exactly due to different setup

Task	Description	ELMo	OpenAI Transformer	BERT_base
GLUE (consists of 8 small tasks below)	A set of 8 tasks of language understanding	71.0	72.8	79.6
⊢ 1. MNLI	Predict entailment/contradiction/neutral of a sentence pair	76.4/76.1	82.1/81.4	84.6/83.4
⊢ 2. QQP	Determine if two questions asked on Quora are semantically equivalent	64.8	70.3	71.2
⊢ 3. QNLI	Binary classification version of SQuAD: whether sentence in (question, sentence) pair contains answer to question	79.9	88.1	90.1
⊢ 4. SST-2	Binary single-sentence classification of movie reviews	90.4	△91.3	93.5
⊢ 5. CoLA	Binary single-sentence classification of linguistical acceptability	36.0	45.4	52.1
⊢ 6. STS-B	Score 1 to 5 for semantic similarity of sentence pairs from news headlines and other resources	73.3	82.0	85.8
⊢ 7. MRPC	Binary classification for semantic similarity of sentence pairs from online news	84.9	△82.3	88.9
⊢ 8. RTE	Binary entailment task similar to MNLI, but with much less training data	56.8	△56.0	66.4
SQuAD	Find answer for a question in a text span from Wikipedia	85.8	-	88.5
CoNLL NER	Named entity recognition	92.2	-	92.4
SWAG	Predict sentence entailment from multiple choice question	59.2	-	*81.x
Coref	Clustering mentions in text that refer to the same underlying real world entities (OntoNotes coreference from CoNLL 2012 shared task)	70.4	-	-
SRL	"Who did what to whom" tagging	84.6	-	-
SNLI	Predict entailment/contradiction/neutral of a sentence pair	88.7	89.9	-
SciTail	Multiple choice question-answering given candidate knowledge sentences	-	88.3	-
RACE	Multiple choice for sentence completion after reading comprehension	-	59.0	-
ROCStories	Binary choice question-answering after reading comprehension of short stories	-	86.5	-
COPA	Common sense binary choice question-answering	-	78.6	-

Architecture comparison

Of the four language models, two are LSTM-based and the other two are Transformer-based.

So why are LSTM and Transformer used for making language models? We will investigate what their linguistic capabilities are in the sections following.

LSTM

LSTM stands for Long Short Term Memory, which is a special type of recurrent neural network. (You can find a very good introduction to LSTM here). Architecture-wise, its recurrency is suitable for sequential processing which seems a natural fit to handle text sequence. Additional mechanisms such as bi-direction and attention have helped to make LSTM even more successful for many NLP tasks. There are recent active researches to un-blackbox LSTM learning capabilities. Today, I want to introduce some key results from papers mentioned in "Trying to Understand Recurrent Neural Networks for Language Processing" (Goldberg, Blackbox NLP workshop EMNLP 2018 - might take a while to be loaded). Please refer to the original papers below for more details.

LSTM can learn basic, simple linguistic features (Conneau et al., ACL 2018)
Unsupervised LSTM can learn syntax-sensitive dependencies (Kuncoro et al., ACL 2018)
There are effective context spans that affects LSTM learning (Khandelwal et al., ACL 2018)
- Nearby and far context
- Different types of tokens need more or less context
- Dropping and replacing target token affect perplexity differently

Practical takeaways

LSTM, especially unsupervised LSTM which is used in language model case, can learn linguistic features
Different context and token types have different impact on LSTM learning. We think this is a very important implication for designing task objectives like BERT (will get back to this later)

Transformer

Despite the catchy title of the original paper, "Attention is all you need" (Vaswani et al., 2017), deeper understanding of attention mechanism in Transformer architecture still requires more researches. There are several papers looking deeper into Transformer with interesting results; I've listed a couple below. Note that these papers are about NMT (Neural Machine Translation), a task that is different from language models, but they still provide useful information for analysis of linguistic learning capability.

In the "Attention is all you need" paper, ability of modeling distant dependency efficiently is considered the key attribute of attention. But does it imply that Transformer can learn longer context than RNN/LSTM?

"Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures" (Tang et al., EMNLP 2018)
- Result: Transformer outperforms RNN-bideep (implemented in Junczys-Dowmunt et al.,2018) for distances 11-12, but RNN-bideep performs equally or better for distance 13 or higher
- Thus, we cannot conclude that Transformer models are particularly stronger than RNN models for long distances

Attention was used extensively in Transformer, but does it have the same effect every where?

"How much attention do you need?" (Tobias, ACL 2018)
- Result: Attention importance is not the same throughout Transformer architecture
- Other techniques like residual connection, layer normalization also play important roles

This leads to the question, can we combine RNN/LSTM with other techniques in Transformer paper for a better result? An affirmative answer was given in the paper below.

"The best of both worlds" (Mia Xu Chen et al., ACL 2018)
- Result: RNMT+ architecture scored even better than Transformer

Another natural question is, given their similarity based on Transformer: what makes OpenAI Transformer and BERT different? It is believed that the bi-directional characteristic in BERT architecture as well as BERT's task objectives, as listed in the table below, make BERT learn context better. (Refer to Google AI blog and code on Github)

You may have recognized by now that the setting task of BERT using masked language model sounds very familiar to the conclusion of "different context" in the LSTM section. We hypothesize that Transformer architecture has a similar dependency on different contexts, which implies alternating context was what helped BERT learn better than OpenAI Transformer. Let's expect more researches on this topic this year.

How to transfer learning

Once you have pre-trained a language model, there are two general ways to apply it to other downstream tasks: one is feature-based, and two, fine-tuning. In principle, both ways can be used for all the four models.

Feature-based

This method utilizes task-specific neural network models. It usually adds a latent output vector, composed of appropriate layers from the pre-trained language model. For example, you can add a latent output vector from ELMo to the CNN model for text classification as shown in the diagram below.

Source: "Convolutional Neural Networks for Sentence Classification" (Kim, 2014)

The pros and cons of this model are:

Pros
- Requires less resources than fine-tuning
Cons
- Requires a customized model for each downstream task
- (Generally) Scores lower than fine-tuning

Fine-tuning

A unified architecture of a pre-trained network plus several simple linear layers can be used for any task like example from OpenAI Transformer as shown below. The pre-trained network will be trained again (fine-tuned) together with extra linear layers on the task specific data set.

Source: "Improving Language Understanding by Generative Pre-Training" (Radford et al., 2018)