CLARIN Tool Portal

GreynirSeq Domain Translation Pipeline (22.06)

3 resources

This is a pipeline for creating GreynirSeq domain-aware translation models. A valid checkpoint of a base translation model based on mBART25 can be finetuned as a domain translation model. The resulting model can be queried using a label for the requested domain. We recommend the English -- Icelandic translation models available in https://repository.clarin.is/repository/xmlui/handle/20.500.12537/125 . The included preprocess script expects a .tsv input file with the three fields (domains, english, icelandic), this is the training corpus. The script finetune.sh can be run to fine tune the model until convergence. Finally, one can run evaluate.sh to compute BLEU over the development set of Flores. See the README file for further details on setting up an environment and fetching data.

Use "GreynirSeq Domain Translation Pipeline (22.06)"

Kaldi L2 Speakers Recipe 22.10

2 resources

This release includes a recipe intended to show how to integrate the corpus "Samromur L2 22.09" [1] and the "Icelandic Language Models with Pronunciations 22.01" [2] to create automatic speech recognition systems using the Kaldi toolkit. Þessi útgáfa inniheldur talgreiningarforskrift sem sýnir hvernig má beita talmálheildinni „Samromur L2 22.09“ [1] ásamt „Íslenskum mállíkönum með framburðarorðabók 22.01“ [2] til þess að byggja talgreiningarkerfi með verkfærakistunni Kaldi. [1] http://hdl.handle.net/20.500.12537/263 [2] http://hdl.handle.net/20.500.12537/172

Use "Kaldi L2 Speakers Recipe 22.10"

Grapheme-to-phoneme (g2p) module for Icelandic (22.10)

2 resources

ENGLISH: Grapheme-to-phoneme (g2p) module for Icelandic. The module can be used to transcribe Icelandic in four pronunciation variants (standard pronunciation, north Iceland, north-east Iceland, south Iceland), with different levels of detail and in four different phonetic alphabets. Default output is X-SAMPA phonetic alphabet without syllabification or stress labeling, according to standard pronunciation. The module transcribes English words using the Icelandic phoneset but close to English transcription rules. A transcription dictionary is also a part of the package. The package can be installed from PyPI: pip install ice-g2p ICELANDIC: Hljóðritunarforrit (g2p) fyrir íslensku. Forritið má nota til þess að hljóðrita íslensku skv. fjórum framburðartilbrigðum (hefðbundnum framburði, harðmæli, rödduðum framburði og hv-framburði), með mismiklum upplýsingum og í fjórum mismunandi hljóðritunarstafrófum. Séu engar stillingar sérvaldar þá skilar forritið úttaki í X-SAMPA hljóðritunarstafrófinu, án atkvæðaskiptinga eða áherslumerkinga, skv. hefðbundnum framburði. Forritið hljóðritar ensk orð með íslenskum hljóðritunartáknum en eins nálægt enskum reglum og mögulegt er. Framburðarorðabók fylgir pakkanum. Hægt er að sækja pakkann á PyPI: pip install ice-g2p

Use "Grapheme-to-phoneme (g2p) module for Icelandic (22.10)"

Speech Corpora Toolkit (22.06)

2 resources

[ENGLISH] Speech corpora toolkit is a collection of tools for processing audio and scripts to prepare them for segmentation and alignment. The output for each source is standardized. [ÍSLENSKA] Tækjasafn fyrir talmálsheildir er samansafn af tólum til að vinna hljóð og handrit yfir á staðlað form sem gerir þau tilbúin fyrir niðurbútun og samröðun.

Use "Speech Corpora Toolkit (22.06)"

The Trankit model for linguistic processing of standard Slovenian

2 resources

This is a retrained Slovenian standard model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, feature prediction, and dependency parsing in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a dataset published by Universal Dependencies in release 2.12 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.12). Due to the larger training dataset compared to the original Trankit v1.1.1 model, this version yields superior results and achieves state-of-the art parsing performance for Slovenian (https://slobench.cjvt.si/leaderboard/view/11). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.

Use "The Trankit model for linguistic processing of standard Slovenian"

Yfirlestur Docs 22.10

2 resources

Yfirlestur Docs is the source code for a spelling and grammar correction add-on for Icelandic, for use with Google Docs. The plugin provides error annotation and replacement, based on user interaction. The source code is intended for third party development and can be installed and tested locally using Node.js. The plugin requires third party correction software for its functionality. For development and testing, the open-access Yfirlestur.is API produced by Miðeind was used (see:https://github.com/icelandic-lt/Yfirlestur) but is not intended for production use. This software is licensed under the MIT License. More information at https://github.com/icelandic-lt/Yfirlestur-Docs. Yfirlestur Docs er bakendakóði viðbótar fyrir Google Docs sem býður upp á leiðréttingu stafsetningar- og málfræðivillna. Viðbótin inniheldur notendaviðmót sem sýnir villur í textaskjali og býður notandanum að taka afstöðu til þeirra. Bakendakóðinn er ætlaður til utanaðkomandi þróunar og hægt er að prufukeyra viðbótina með því að ræsa vinnuumhverfi viðbótarinnar með NodeJS. Viðbótin þarf á utanaðkomandi leiðréttingarhugbúnaði að halda til að leiðrétta texta. Í þróunarferlinu var notast við forritaskilin á vegum Yfirlestur.is (sjá: https://github.com/icelandic-lt/Yfirlestur) en ekki er ætlast til að þau séu notuð í opinberri útgáfu viðbótarinnar.

Use "Yfirlestur Docs 22.10"

GreynirCorrect (3.2.0)

3 resources

GreynirCorrect is a Python 3 package and a command line tool for checking and correcting various types of spelling and grammar errors in Icelandic text. GreynirCorrect relies on the Tokenizer package, by the same authors, to tokenize text. More information can be found at https://github.com/mideind/GreynirCorrect, and detailed documentation at https://yfirlestur.is/doc/. GreynirCorrect er Python 3 pakki og skipanalínutól sem bendir á og leiðréttir ýmsar tegundir stafsetningar- og málvillna í íslenskum texta. GreynirCorrect reiðir sig á Tokenizer-pakkann, eftir sömu höfunda, til að tilreiða textann. Frekari upplýsingar má finna á https://github.com/mideind/GreynirCorrect, og ítarlega skjölun (á ensku) á https://yfirlestur.is/doc/.

Use "GreynirCorrect (3.2.0)"

Webrice extension (22.01)

2 resources

The Webrice plugin is a software add-on that gives access to people to listen to web pages instead of reading them. This chrome browser extension changes Icelandic text to speech. Webrice viðbótin er hugbúnaðarforrit sem hjálpar notendum að velja texta og hlusta á hann í staðinn fyrir að lesa. Þessi Chrome viðbót breytir íslenskan textan í tal.

Use "Webrice extension (22.01)"

The Trankit model for linguistic process of standard written Slovenian 1.1

2 resources

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the reference SSJ UD treebank featuring fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a dataset published by Universal Dependencies in release 2.14 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. This version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14) than the previous version of the model and produces similar results.

Use "The Trankit model for linguistic process of standard written Slovenian 1.1"

OCR Post-Processing Tool for Icelandic 22.10

3 resources

ENGLISH: This entry consists of two trained transformer models to correct OCR errors, along with ca 50,000 line pairs of OCRed/corrected text. The models were trained on ca 900,000 lines (~7,000,000 tokens) of which only 50,000 (~400,000 tokens) were from real OCRed texts. It can be assumed that increasing the amount of such data can significantly improve the tool. More info in README.md. ICELANDIC: Þessi gagnahirsla inniheldur tvö þjálfuð transformer-líkön til leiðréttingar á ljóslestrarvillum, auk u.þ.b. 50.000 línupara úr ljóslesnum/leiðréttum textum. Líkönin voru þjálfuð á u.þ.b. 900.000 línum (~7.000.000 orð) en af þeim voru ekki nema um 50.000 (~400.000 orð) úr raunverulegum ljóslesnum gögnum. Ætla má að aukið magn slíkra gagna geti bætt tólið umtalsvert. Nánari upplýsingar í README.md.

Use "OCR Post-Processing Tool for Icelandic 22.10"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

GreynirSeq Domain Translation Pipeline (22.06)

Kaldi L2 Speakers Recipe 22.10

Grapheme-to-phoneme (g2p) module for Icelandic (22.10)

Speech Corpora Toolkit (22.06)

The Trankit model for linguistic processing of standard Slovenian

Yfirlestur Docs 22.10

GreynirCorrect (3.2.0)

Webrice extension (22.01)

The Trankit model for linguistic process of standard written Slovenian 1.1

OCR Post-Processing Tool for Icelandic 22.10

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Session recording