CLARIN Tool Portal

SloBENCH evaluation framework

2 resources

The evaluation framework contains public evaluation scripts. All the scripts contain additional Dockerfiles that allow for platform-independent evaluation and exact comparison of results. Pre-built Docker images are available in slobench/eval DockerHub repository. The evaluation framework is used and maintained by the SloBENCH leaderboard Web site team. SloBENCH submitters are able to check their compliance of submissions and evaluate theri model on training/validation data prior to submission. The initial version of SloBENCH contains evaluation scripts with examples of training and testing datasets for nine different tasks: named entity recognition, part-of-speech tagging, lemmatization, dependency parsing, semantic role labeling, translation (ENG-SLO, SLO-ENG), summarization and question answering.

Use "SloBENCH evaluation framework"

Trankit model for SST 2.15 1.1

2 resources

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15) featuring transcriptions of spontaneous speech in various everyday settings. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological feature prediction, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). Please note this model has been published for archiving purposes only. For production use, we recommend using the state-of-the art Trankit model available here: http://hdl.handle.net/11356/1965 (v1.2 or newest). The latter was trained on both spoken (SST) and written (SSJ) data, and demonstrates a significantly higher performance to the model featured in this submission. In comparison with version 1.0, this model was trained on a new train-dev-test split of the SST treebank introduced in release UD v2.15.

Use "Trankit model for SST 2.15 1.1"

Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

5 resources

The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers.

Use "Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0"

Alexia: Lexicon Acquisition Tool for Icelandic (Orðtökutól) 2.0

2 resources

The purpose of the lexicon acquisition tool is to facilitate the development and expansion of online dictionaries and glossaries, particularly the Database of Modern Icelandic Inflection (DMII/BÍN) and ISLEX. The tool is designed around the Icelandic Gigaword Corpus (IGC) and the information contained within its TEI-formatted documents. That is to say, its best performance comes from using the available part-of-speech tags, lemmas and word forms defined in the IGC. The lexicon acquisition tool can however use any corpus as input that uses either the same TEI-format as is used in the IGC or a plain text file format, depending on the user's preference. The output files, examples of which are included, are the following: Frequency per word form with no extra information added. Useful for generally picking candidates for the online dictionaries and glossaries. Frequency per lemma with no extra information added. Useful for generally picking candidates for the online dictionaries and glossaries. Frequency per word form, including information on all possible lemmas for the given word forms. Provides information on whether the word form can belong to more than one word class, as well as whether or not the automatic lemmatization is working correctly. Frequency per lemma, including information on all possible word forms for the given lemma. To examine if a certain word form appears much more or less frequently than the others and thus if the word form is only used as a part of a certain expression. Frequency per lemma, including information in which types of text the particular lemma appears. The frequency for each individual text type can also be examined in descending order. Facilitates the creation of a specialized glossary (e.g. a glossary of sport related words). Also included is a list of approximately 60 thousand subwords, manually chosen from the ICG. These include foreign words, typos, misspelled words, lemmatization errors and acronyms. Tilgangur orðtökutólsins er að einfalda þróun og smíði netorðabóka og netorðasafna, einkum og sér í lagi Beygingarlýsingu íslensks nútímamáls (BÍN) og Nútímamálsorðabókarinnar (ISLEX). Smíði tólsins byggist að miklu leyti á notkun Risamálheildarinnar (RMH) og þeirra upplýsinga sem eru skilgreindar innan tei-sniðsins sem hún notar, en þar er helst átt við notkun málfræðilegra marka, nefnimynda og orðmynda sem þar er að finna. Orðtökutólið má aftur á móti nota með hvaða málheild sem er sé hún annað hvort á sama tei-sniði og Risamálheildin eða á einföldu txt-sniði. Dæmi um úttaksskjöl orðtökutólsins má finna í meðfylgjandi möppu. Þau eru eftirfarandi: Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni. Þetta má nýta til þess að ákveða hvaða orð koma til greina að bæta við í orðabækur og -söfn. Tíðnilistar sem innihalda orðmyndir ásamt tíðni þeirra í inntaksmálheildinni. Þetta má nýta til þess að ákveða hvaða orð koma til greina að bæta við í orðabækur og -söfn. Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni, en jafnframt eru allar orðmyndir viðkomandi lemmu sem koma fyrir taldar upp. Nýtist til að kanna hvort tiltekin orðmynd er mun algengari en aðrar og þar með hvort orðið tilheyri einkum ákveðnu orðtaki. Tíðnilistar sem innihalda orðmyndir ásamt tíðni þeirra í inntaksmálheildinni, en jafnframt eru allar lemmur viðkomandi orðmyndar sem koma fyrir taldar upp. Veitir upplýsingar um hvort tiltekin orðmynd getur tilheyrt fleiri en einum orðflokki og hvort sjálfvirk lemmun skili réttum niðurstöðum. Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni, en auk þess tíðni hverrar lemmu innan ákveðinnar gerðar texta (t.d. fréttir, stærðfræði eða fótbolti). Má nýta við smíði íðorðasafna. Meðfylgjandi er einnig listi sem inniheldur um 60 þúsund stopporð sem hefur verið safnað handvirkt úr Risamálheildinni. Þetta eru erlend orð, stafsetningar- og innsláttarvillur, lemmuvillur og skammstafanir.

Use "Alexia: Lexicon Acquisition Tool for Icelandic (Orðtökutól) 2.0"

GreynirCorrect (3.2.1)

3 resources

GreynirCorrect is a Python 3 package and a command line tool for checking and correcting various types of spelling and grammar errors in Icelandic text. GreynirCorrect relies on the Tokenizer package, by the same authors, to tokenize text. More information can be found at https://github.com/mideind/GreynirCorrect, and detailed documentation at https://yfirlestur.is/doc/. GreynirCorrect er Python 3 pakki og skipanalínutól sem bendir á og leiðréttir ýmsar tegundir stafsetningar- og málvillna í íslenskum texta. GreynirCorrect reiðir sig á Tokenizer-pakkann, eftir sömu höfunda, til að tilreiða textann. Frekari upplýsingar má finna á https://github.com/mideind/GreynirCorrect, og ítarlega skjölun (á ensku) á https://yfirlestur.is/doc/.

Use "GreynirCorrect (3.2.1)"

GreynirPackage 2.6.1

3 resources

GreynirPackage is a Python 3 package for working with Icelandic natural language text. Greynir can parse text into sentence trees, find lemmas, inflect noun phrases, assign part-of-speech tags and much more. Greynir's sentence trees can inter alia be used to extract information from text, for instance about people, titles, entities, facts, actions and opinions. Greynir uses the Tokenizer package, by the same authors, to tokenize text. More information at https://github.com/mideind/GreynirPackage and detailed documentation at https://greynir.is/doc/. GreynirPackage er Python 3 pakki sem vinnur með íslenskan texta. Greynir þáttar texta í setningar, lemmar og markar texta, beygir nafnliði og margt fleira. Hægt er að nýta þáttunartrén sem tólið býr til í þeim tilgangi að draga upplýsingar út úr texta, til dæmis um manneskjur, starfstitla, sérnafnaeiningar, staðreyndir, atburði og skoðanir. Greynir notar Tokenizer-pakkann, eftir sömu höfunda, til að tilreiða texta. Frekari upplýsingar má finna á https://github.com/mideind/GreynirPackage og ítarlega skjölun (á ensku) á https://greynir.is/doc/.

Use "GreynirPackage 2.6.1"

Polish Grapheme-to-phoneme tool and service

2 resources

This archive contains the source code of the Polish grapheme-to-phoneme conversion tool and the webservice located at http://mowa.clarin-pl.eu/transcriber/

Use "Polish Grapheme-to-phoneme tool and service"

AnySoftKeyboard with custom autocompletion 22.10

2 resources

ENGLISH: This is a fork of the open source Android keyboard AnySoftKeyboard. This version contains a new autocompleter module based on finite-state-transducers (FST) as implemented in the Apache Lucene library. The autocompleter uses a bigram list from the Icelandic Gigaword Corpus (ICG, http://hdl.handle.net/20.500.12537/192) to enable next word suggestions from the beginning and not just after the user has used the keyboard for a certain amount of time, as implemented in the original keyboard. This version, however, still learns from the user, enhancing the original list with usage data and boosting frequently used combinations. ÍSLENSKA: Þetta er grein (e. fork) sem sveigð er frá opnu lyklaborði fyrir Android, AnySoftKeyboard. Þessi útgáfa inniheldur nýtt módúl fyrir ritspá, sem byggist á stöðuvélum Lucene hugbúnaðarins. Ritspáin notar orðatvístæður úr Íslenskri risamálheild (http://hdl.handle.net/20.500.12537/192) til þess að gera ritspá fyrir næsta orð mögulega strax þegar notandi byrjar að nota lyklaborðið, en ekki eingöngu byggða á fyrri notkun eins og upprunalega lyklaborðið. Þessi útgáfa lærir samt sem áður einnig af notkun, þannig að upprunalegi listinn breytist í takt við notkun en umfang hans helst.

Use "AnySoftKeyboard with custom autocompletion 22.10"

Neural Machine Translation model for Slovene-English language pair RSDO-DS4-NMT 1.2.6

3 resources

This Neural Machine Translation model for Slovene-English language pair was trained following the NVIDIA NeMo NMT AAYN recipe (for details see the official NVIDIA NeMo NMT documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/machine_translation/machine_translation.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for translating text written in Slovene language to English and vice versa. The training corpus was built from publicly available datasets, including Parallel corpus EN-SL RSDO4 1.0 (https://www.clarin.si/repository/xmlui/handle/11356/1457), as well as a small portion of proprietary data. In total the training corpus consisted of 32.638.758 translation pairs and the validation corpus consisted of 8.163 translation pairs. The model was trained on 64GPUs and on the validation corpus reached a SacreBleu score of 48.3191 (at epoch 37) for translation from Slovene to English and a SacreBleu score of 53.8191 (at epoch 47) for translation from English to Slovene.

Use "Neural Machine Translation model for Slovene-English language pair RSDO-DS4-NMT 1.2.6"

Tokenizer for Icelandic text (3.4.1) (2022-05-31)

3 resources

Tokenizer is a compact pure-Python (2.7 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer

Use "Tokenizer for Icelandic text (3.4.1) (2022-05-31)"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

SloBENCH evaluation framework

Trankit model for SST 2.15 1.1

Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

Alexia: Lexicon Acquisition Tool for Icelandic (Orðtökutól) 2.0

GreynirCorrect (3.2.1)

GreynirPackage 2.6.1

Polish Grapheme-to-phoneme tool and service

AnySoftKeyboard with custom autocompletion 22.10

Neural Machine Translation model for Slovene-English language pair RSDO-DS4-NMT 1.2.6

Tokenizer for Icelandic text (3.4.1) (2022-05-31)

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Session recording