GreynirSeq Domain Translation Pipeline (22.06)
This is a pipeline for creating GreynirSeq domain-aware translation models. A valid checkpoint of a base translation model based on mBART25 can be finetuned as a domain translation model. The resulting model can be queried using a label for the requested domain.
We recommend the English -- Icelandic translation models available in https://repository.clarin.is/repository/xmlui/handle/20.500.12537/125 .
The included preprocess script expects a .tsv input file with the three fields (domains, english, icelandic), this is the training corpus. The script finetune.sh can be run to fine tune the model until convergence. Finally, one can run evaluate.sh to compute BLEU over the development set of Flores. See the README file for further details on setting up an environment and fetching data.