Researchers propose new method based on parameter differentiation that can automatically determine which parameters should be shared and which should be language specific


In recent years, Neural Machine Translation (NMT) has gained a lot of attention and been very successful. While traditional NMT is able to translate a single language pair, training a separate model for each language pair takes time, especially given the thousands of languages ​​around the world. Therefore, the Multilingual NMT is designed to handle many language pairs in a single model, greatly reducing the cost of offline training and online deployment. In addition, the sharing of parameters in multilingual neural machine translation promotes positive knowledge transfer between languages ​​and is beneficial for low-resource translation.

Despite the benefits of cooperative training with a fully shared model, the MNMT approach has a model capacity problem. Shared settings are more likely to preserve general knowledge while ignoring language-specific knowledge. To improve the capacity of the model, researchers use heuristic design to create additional language-specific components and create a multilingual neural machine translation (MNMT) model with a mix of shared and language-specific characteristics, such as language-specific attention, a lightweight language adapter, or language-specific routing layer.

The National Institute of Pattern Recognition research group recently suggested a new strategy based on parameter differentiation that allows the model to automatically select which parameters should be shared and which should be language specific during training. Our method is inspired by cell differentiation, which occurs when a cell changes from a general cell type to a more specialized type. It allows each parameter shared by multiple tasks to dynamically differentiate into more specialized types.

The model is set up to be fully shared initially, and it regularly finds shared settings that should be language specific. To increase the capacity for language-specific modeling, these parameters are duplicated and redistributed to different tasks. The inter-task gradient similarity is the differentiation criterion, which reflects the consistency of the direction of optimization across the tasks on a given parameter. As a result, only parameters with conflicting inter-task gradients are chosen for differentiation, while those with more similar inter-task gradients are kept together. In general, without multi-step training or manually developed language-specific modules, the technique MNMT model can gradually improve its parameter sharing setup.

The main contributions of the study can be summarized as follows:

• A method is presented for determining which parameters in an MNMT model should be language-specific without the need for manual design, as well as dynamically changing shared parameters to more specialized types.

• The similarity of the inter-task gradient is used as a criterion of differentiation, which makes it possible to reduce inter-task interference on the common parameters.

• The configuration of method parameter sharing is closely related to linguistic characteristics such as language families.

The main objective of the research is to identify the shared parameters in an MNMT model that should be language specific and dynamically transform them into more specialized types during training. To do this, the team proposes a unique MNMT technique based on the differentiation of parameters, with the inter-task gradient similarity as the criterion of differentiation. The team proposes parameter differentiation, dynamically converting task-independent parameters in an MNMT model into task-specific ones during training, inspired by biological differentiation.

The paradigm is used to establish the fully shared MNMT model. The model examines each shared parameter and identifies which parameters should become more specialized under a given differentiation criterion after several stages of training. The model then replicates these marked parameters and reassigns the replicas to alternate tasks. The model creates new connections for these replicas after duplication and reassignment to create alternative calculation charts. After many stages of training, the model differentiates and specializes dynamically.

The definition of a differentiation criterion which aids in the detection of shared parameters which should be differentiated into more specialized types is an essential subject in the differentiation of parameters. The team bases its differentiation criterion on the cosine similarity of the intertask gradient, with parameters encountering opposite gradients more likely to be language specific. On the held validation data, the gradients of each task on each shared parameter are evaluated. The validation data is created with multidirectional alignment, i.e. each sentence has translations in all languages, in order to minimize gradient variation caused by inconsistent sentence semantics between languages.

In many-to-one and one-to-many translation situations, the OPUS and WMT public multilingual datasets are used, as well as the IWSLT datasets for the many-to-many translation scenario. The original OPUS-100 dataset was used to create the OPUS dataset, which includes English in 12 languages. The WMT dataset includes data from the WMT’14, WMT’16 and WMT’18 benchmarks with an unbalanced data distribution. With data sizes ranging from 0.6M to 39M, five languages ​​were chosen. The many-to-many scenario is assessed using the IWSLT’17 dataset, which includes English, German, Italian, Romanian, and Dutch and gives 20 translation directions between five languages.

Technique consistently outperforms bilingual and multilingual baselines in one-to-many and many-to-one directions, improving over multilingual baseline to +1.40 and +1.55 BLUE on average. The method outperforms existing parameter sharing methods in 20 of 24 directions of translation and improves the mean BLUE by a significant margin. The size of the model is not related to the number of languages ​​in the approach, which allows for greater scalability and flexibility. Since multiple granularities are used instead of differentiation on each parameter, the actual sizes of the method range from 1.82 times to 2.14 times, which is close but not quite equal to the preset 2x.


The authors of this work suggest a unique strategy based on parameter differentiation to determine which parameters should be shared and which should be language specific. During training, shared parameters can be dynamically distinguished into more specific categories. Extensive testing on three multilingual machine translation datasets shows that our strategy is working. The method’s parameter sharing configurations are substantially linked to linguistic proximities, according to the evaluations. The team hopes to enable the model to understand when to stop differentiating itself in the future, as well as to study various differentiation criteria for multilingual scenarios such as zero translation and progressive multilingual translation.




Comments are closed.