LyGuide Series: Multilingual Language Models
Jan 22, 2023 12:00:00 AM LyRise Team 2 min read
This post will walk through how to train, convert, and use multilingual language models in a Google Cloud Translation API project. The examples here focus on English-to-Spanish translation. However, the principles apply equally well to other languages or language pairs.
What are Multilingual Language Models?
A multilingual language model is a language model that has been trained on multiple languages. The idea is to capture the commonality between languages, which can be represented in the form of a single probabilistic model. This leads to a performance boost compared to training monolingual models.
The benefits of multilingual modeling are well-known: it works better than monolingual models when you have data from multiple languages but only want to train one model (e.g., if you only have access to one language corpus), or it works better than training separate monolingual models when you have large amounts of data available for each target language.
Training a multilingual language model
In this section, we will walk through the training process for a multilingual language model. The first thing to note is that the training process itself is nearly identical to that of single-language models:
-
First, you need to combine your input data into a single column (if it isn’t already) and split it into train and test sets.
-
You then run your bag of words classifier on both sets. For each tokenized word in each sentence from the train set (and each sentence from the test set), you calculate its frequency counts in both languages. These two value pairs are stored as four elements in an array called “freq” for each sentence in each language.
-
Then, during training time, you iterate over all tokens in all sentences from both languages and update their values using gradient descent on their corresponding freqs arrays; that’s why these arrays are called “weighted frequencies”!
Converting a multilingual Language Model to single-language models
Converting a multilingual language model to single-language models
The final step in creating a single-language model is to convert the multilingual language model into a single-language model. This conversion is done by using a language model probability distribution, which is the probability of each word occurring in the corpus. First, we have to calculate this distribution for each language. Then, we multiply it by the word embedding vector and add them up together to get our final single-language word embeddings!
Predicting on test data
At this part, you’ll apply the single-language model to your test dataset. The single-language model should be able to generalize from the training dataset to the test dataset. In other words, if you trained on a collection of French and German documents, the single-language model should be able to predict words in English without any additional training.
You can also use this approach for multi-lingual models; simply train on one language at a time and then test on another language (for example, train with English documents only, then test with Spanish documents only).
Inference using the single-language model
Inference using the single-language model is also pretty simple. You use the single-language model to predict on test data. The only difference here is that since you don't know what language your test data are written in, you have to apply the same weights used during training to the new text before generating a probability distribution over possible languages.
Takeaway:
-
Multilingual language models are powerful.
-
They can be used for many applications.
-
They can improve the accuracy of other models and their performance.