CLARIN Tool Portal

Text classification model fastText-Trendi-Topics 1.0

2 resources

The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc. The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf The model was trained on the labeled texts using the word embeddings CLARIN.SI-embed.sl 1.0 (http://hdl.handle.net/11356/1204) and validated on a development set of 1,293 texts using the fastText library, 1000 epochs, and default values for the rest of the hyperparameters (see https://github.com/TajaKuzman/FastText-Classification-SLED for the full code). The model achieves a macro-F1-score of 0.85 on a test set of 1,295 texts (best for "vreme" at 0.97, worst for "prosti čas" at 0.67). Please note that the SloBERTa-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1709) that achieves higher classification accuracy, but is slower and computationally more demanding.

Use "Text classification model fastText-Trendi-Topics 1.0"

ABLTagger (PoS) - 3.0.0

3 resources

A Part-of-Speech (PoS) tagger for Icelandic. In this submission, you will find pretrained models for ABLTagger v3.0.0. In this submission we provide two versions, small and large, of PoS taggers that work with the revised tagset that achieve an accuracy of ~96.7% and ~97.8% on MIM-Gold (cross-validation, excluding "x" and "e" tags), respectively. For installation, usage, and other instructions see https://github.com/icelandic-lt/POS You should also check if a newer version is out (see README.md - versions) on CLARIN: - Model files ------------------------------------------------------------------------------------------- Markari fyrir íslensku. Í þessum pakka er ABLTagger v3.0.0. Í þessari útgáfu eru tvö forþjálfuð líkön, lítið og stórt, sem virka fyrir nýja markamengið og ná 96,7% og 97,8% nákvæmni á MÍM-Gull (krossprófanir, án "x" og "e" marka). Fyrir uppsetningar-, notenda- og aðrar leiðbeiningar sjá https://github.com/icelandic-lt/POS. Einnig er gott að athuga þar hvort ný útgáfa sé komin út (sjá README.md - versions) Á CLARIN: - Gögn fyrir líkan

Use "ABLTagger (PoS) - 3.0.0"

Heyra (1.0)

2 resources

Heyra is an Android application that provides three loosely coupled components, an implementation of Android's speech recognition interface, an intent handler activity for speech recognition actions from other applications and an input method service (i.e. virtual keyboard) that can either be used on its own or launched by supported applications. Heyra can be downloaded from the Google Play Store at https://play.google.com/store/apps/details?id=is.tiro.heyra Heyra er Android forrit sem inniheldur þrjá laustengda hluta; útfærslu á kerfisþjónustu í Android fyrir talgreiningu, meðhöndlara fyrir talgreiningaraðgerðir frá öðrum forritum og inntaksþjónustu (eða sýndarlyklaborð) sem hægt er að nota eitt og sér eða kalla á úr öðrum studdum forritum. Hægt er að sækja Heyra á Google Play Store á https://play.google.com/store/apps/details?id=is.tiro.heyra

Use "Heyra (1.0)"

GreynirCorrect 3.4.4 (22.06)

3 resources

GreynirCorrect is a Python 3 package and a command line tool for checking and correcting various types of spelling and grammar errors in Icelandic text. GreynirCorrect relies on the Tokenizer package, by the same authors, to tokenize text. More information can be found at https://github.com/mideind/GreynirCorrect, and detailed documentation at https://yfirlestur.is/doc/. GreynirCorrect er Python 3 pakki og skipanalínutól sem bendir á og leiðréttir ýmsar tegundir stafsetningar- og málvillna í íslenskum texta. GreynirCorrect reiðir sig á Tokenizer-pakkann, eftir sömu höfunda, til að tilreiða textann. Frekari upplýsingar má finna á https://github.com/mideind/GreynirCorrect, og ítarlega skjölun (á ensku) á https://yfirlestur.is/doc/.

Use "GreynirCorrect 3.4.4 (22.06)"

Icelandic GPT-SW3 for spell and grammar checking

3 resources

Icelandic GPT-SW3 for spell and grammar checking is a GPT-SW3 model fine-tuned on Icelandic and particularly on the spell and grammar checking task. The 6.7B GPT-SW3 model (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) was pre-trained on Icelandic texts and fine-tuned on Icelandic error corpora. Texts for pre-training included texts from the Icelandic Gigaword Corpus (http://hdl.handle.net/20.500.12537/253) and MÍM (http://hdl.handle.net/20.500.12537/195). For fine-tuning, the following Icelandic error corpora were used: the Icelandic Error Corpus (http://hdl.handle.net/20.500.12537/105), the Icelandic L2 Error Corpus (http://hdl.handle.net/20.500.12537/280), the Icelandic Dyslexia Error Corpus (http://hdl.handle.net/20.500.12537/281), and the Icelandic Child Language Error Corpus (http://hdl.handle.net/20.500.12537/133). The model is fine-tuned on three different tasks: - Task 1: The model evaluates one text with regards to e.g. grammar and spelling, and returns all errors in the input text as a list, with their position in the text and their corrections. - Task 2: The model evaluates two texts and chooses which one is better with regards to e.g. grammar and spelling. - Task 3: The model evaluates one text with regards to e.g. grammar and spelling, and returns a corrected version of the text. For task 1, the model delivers a 0.28 F0.5 score on the Grammatical Error Correction Test Set (http://hdl.handle.net/20.500.12537/320) and for task 2, the model delivers a 63.95% accuracy score on the same test set. For task 3, the model scores 0.925559 on the GLEU metric (modified BLEU for grammatical error correction) and 0.02 in TER (translation error rate). Íslenskt GPT-SW3 fyrir málfræði- og stafsetningarleiðréttingu er GPT-SW3-líkan sem hefur verið fínþjálfað á íslensku og sérstaklega í málfræði- og stafsetningarleiðréttingu. 6,7 milljarða stika GPT-SW3-líkan (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) var forþjálfað á íslenskum textum og fínþjálfað á íslenskum villumálheildum. Forþjálfunartextar samanstóðu m.a. af textum úr Risamálheildinni (http://hdl.handle.net/20.500.12537/253) og MÍM (http://hdl.handle.net/20.500.12537/195). Í fínþjálfun voru eftirfarandi villumálheildir notaðar: íslenska villumálheildin (http://hdl.handle.net/20.500.12537/105), íslenska annarsmálsvillumálheildin (http://hdl.handle.net/20.500.12537/280), íslenska dyslexíuvillumálheildin (http://hdl.handle.net/20.500.12537/281) og íslenska barnamálsmálheildin (http://hdl.handle.net/20.500.12537/133). Líkanið er fínþjálfað á þremur mismunandi verkefnum: - Verkefni 1: Líkanið metur einn texta hvað varðar t.d. málfræði og stafsetningu og skilar öllum villum í inntakstexta sem lista, þar sem staðsetning þeirra í textanum er tekin fram ásamt leiðréttum myndum þeirra. - Verkefni 2: Líkanið metur tvo texta og velur hvor þeirra er betri hvað varðar t.d. málfræði og stafsetningu. - Verkefni 3: Líkanið metur einn texta hvað varðar t.d. málfræði og stafsetningu og skilar leiðréttri útgáfu af textanum. Í verkefni 1 skilar líkanið 0.28 F0.5-skori þegar það er metið á Prófunarmengi fyrir textaleiðréttingar (http://hdl.handle.net/20.500.12537/320) og í verkefni 2 skilar líkanið 63,95% nákvæmni þegar það er metið á sömu gögnum. Í verkefni 3 skorar líkanið 0.925559 GLEU-stig (BLEU nema lagað að málrýni) og er með 0.02 villuhlutfall í þýðingu (translation error rate).

Use "Icelandic GPT-SW3 for spell and grammar checking"

WordnetLoom 2.0

4 resources

WordneLoom 2.0 executable files for plWordnet 4.0. Source code available at https://github.com/CLARIN-PL/WordnetLoom WordnetLoom – is an wordnet editor application built for the needs of the construction of a the largest Polish wordnet called plWordNet. WordnetLoom provides two means of interaction: a form-based, implemented initially, and a visual, graph-based introduced recently. The visual, graph-based interactive presentation of the wordnet structure enables browsing and its direct editing on the structure of lexico-semantic relations and synsets. WordnetLooms works in a distributed environment, i.e. several linguists can work simulanuously from different sites on the same central database.

Use "WordnetLoom 2.0"

Icelandic NER API - Ensamble model (21.09)

2 resources

A dockerized Named Entity Recognition (NER) API for Icelandic. It uses a the IceBERT language model from Miðeind as its primary model, but it also offers the possibility to use 3 other transformer language models with it ( ELECTRA-base, convbert-small, and multilingual-BERT) and combines them with CombiTagger. They were all fine tuned for NER using MIM-GOLD-NER. IceBERT was the best individual model as it achieves F1-score of ~92.73 on the test set for MIM-GOLD-NER, while the combination of the four, in the form of CombiTagger, achieved F1-score of 93.21. The code for the API is available at https://github.com/icelandic-lt/Icelandic-NER-API and the files for the fine tuned models are available in this submission. Dockerútfærð forritaskil fyrir nafnakennsl (NER) á íslensku. Þau notast við IceBERT mállíkan frá Miðeind sem sitt megin líkan, en þau bjóða líka upp á möguleikann að láta IceBERT vinna með 3 öðrum líkönum (ELECTRA-base, convbert-small og multilingual-BERT). Þau hafa öll verið fínstillt fyrir NER með nafnakennslamálheildinni MIM-GOLD-NER. Ef við skoðum hvert líkan fyrir sig, þá er IceBERT líkanið best, en það nær 92.73 í F1, á meðn CombiTagger nær 93.21 í F1. Forritunarkóðinn fyrir forritaskilinu eru aðgengileg hérna: https://github.com/icelandic-lt/Icelandic-NER-API og skrárnar fyrir fínstilltu líkönin má finna í þessari færslu.

Use "Icelandic NER API - Ensamble model (21.09)"

NeMo Conformer CTC BPE E2E Automated Speech Recognition service RSDO-DS2-ASR-E2E-API 1.1

2 resources

Automated Speech Recognition service for NeMo Conformer CTC BPE E2E models. For more details about building such models, see the official NVIDIA NeMo documentation (https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html) and NVIDIA NeMo GitHub (https://github.com/NVIDIA/NeMo). A model for automated speech recognition of Slovene speech can be downloaded from http://hdl.handle.net/11356/1740. The service accepts as input audio files in WAV 16kHz, 16bit PCM, mono format. The maximal accepted audio duration is 300s. Note that transcription of one 300s audio file on cpu will take advantage of all available cores, consume up to 16GB RAM and may take ~180s (on a system with 24 vCPU). See the service README.md for further details.

Use "NeMo Conformer CTC BPE E2E Automated Speech Recognition service RSDO-DS2-ASR-E2E-API 1.1"

Annotald 1.0.0 (22.06)

3 resources

Annotald is a program for annotating parsed corpora in the Penn Treebank format. For more information on the format (as instantiated by the Penn Parsed Corpora of Historical English), see the documentation by Beatrice Santorini. Annotald was originally written by Anton Ingason as part of the Icelandic Parsed Historical Corpus project. This version of Annotald has been adapted for the parsing schema used in GreynirPackage, Miðeind's rule-based deep parser. Annotald is available under the terms of the GNU General Public License (GPL) version 3 or (at your option) any later version. Please see the LICENSE file included with the source code for more information.

Use "Annotald 1.0.0 (22.06)"

MT: Moses-SMT (1.0)

5 resources

Moses phrase-based statistical machine translation (Moses PBSMT) is a system which is used to develop and run machine translation models. It is distributed here as four packages: 1. Code from a github repository to train and run models. 2. Pretrained is-en system (Docker) 3. Pretrained en-is system (Docker) 4. Frontend to pre- and postprocess text for translation (Docker) The models here are not (exactly) the same as were used for human evaluation. These models have additionally been trained on open dictionaries to extend their vocabularies. Moses phrase-based statistical machine translation (Moses PBSMT) er kerfi til þess að þróa og keyra tölfræðilegar vélþýðingar. Hér er dreift fjórum pökkum: 1. Kóða af github geymslusvæði fyrir þjálfun og keyrslu á líkönum 2. Forþjálfuðu is-en vélþýðingarlíkani (Docker) 3. Forþjálfuðu en-is vélþýðingarlíkani (Docker) 4. Framenda til að for- og eftirvinna texta fyrir þýðingar (Docker) Líkönin sem eru sett hér eru ekki (nákvæmlega) þau sömu og voru notuð við mannlegt mat. Þessi líkön hafa aukalega verið þjálfuð á gögnum úr opnum orðabókum til þess að auka orðaforða.

Use "MT: Moses-SMT (1.0)"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Text classification model fastText-Trendi-Topics 1.0

ABLTagger (PoS) - 3.0.0

Heyra (1.0)

GreynirCorrect 3.4.4 (22.06)

Icelandic GPT-SW3 for spell and grammar checking

WordnetLoom 2.0

Icelandic NER API - Ensamble model (21.09)

NeMo Conformer CTC BPE E2E Automated Speech Recognition service RSDO-DS2-ASR-E2E-API 1.1

Annotald 1.0.0 (22.06)

MT: Moses-SMT (1.0)

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Session recording