Result filters

Metadata provider

Language

Resource type

  • Unspecified

Availability

Active filters:

  • Resource type: Unspecified
Loading...
419 record(s) found

Search results

  • The CLASSLA-StanfordNLP model for named entity recognition of standard Slovenian 1.0

    This model for named entity recognition of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204).
  • Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (22.09)

    This Byte-Level Neural Error Correction Model for Icelandic is a fine-tuned byT5-base Transformer model for error correction in natural language. It acts as a machine translation model in that it “translates” from deficient Icelandic to correct Icelandic. The model is trained on parallel synthetic error data and real error data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The synthetic error data (35M lines of parallel data) was created by filtering and then scrambling the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/192) to simulate real grammatical and typographical errors. The pretrained byT5 model was trained on the synthetic data and finally fine-tuned on the real error data from IceEC. It can correct a variety of textual errors, even in texts containing many errors, such as those written by people with dyslexia. Measured on the iceEC test data, the model scores 0.862917 on the GLEU metric (modified BLEU for grammatical error correction) and 0.06% in TER (translation error rate). --- Þetta leiðréttingarlíkan fyrir íslensku er fínþjálfað byT5-base Transformer-líkan. Það er í raun þýðingalíkan sem þýðir úr íslenskum texta með villum yfir í texta án villna. Líkanið er þjálfað á samhliða gervivillugögnum og raunverulegum villum úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Gervivillugögnin (35 milljón línur af samhliða gögnum) voru búin til með því að sía og svo rugla íslensku Risamálheildinni (http://hdl.handle.net/20.500.12537/192) með því að nota margs konar villumynstur til að líkja eftir raunverulegum málfræði- og ritunarvillum. Forþjálfaða byT5-líkanið var þjálfað á gervivillugögnunum og svo fínþjálfað á raungögnum úr villumálheildunum. Það getur leiðrétt fjölbreyttar textavillur, jafnvel í texta sem inniheldur mjög margar villur, svo sem frá fólki með lesblindu. Líkanið skorar 0.862917 GLEU-stig (BLEU nema lagað að málrýni) og er með 0.06% villuhlutfall í þýðingu (translation error rate), þegar það er metið á prófunarhluta íslensku villumálheildarinnar.
  • BinPackage 0.4.4 (22.10)

    BinPackage is a Python Package that embeds the vocabulary of the DMII (https://bin.arnastofnun.is) and offers various lookups and queries of the data. The database, maintained by The Árni Magnússon Institute for Icelandic Studies, contains over 6.5 million entries, over 3.1 million unique word forms, and about 300,000 distinct lemmas. The database has been encapsulated in an easy-to-install Python package, and compressed from 400+ megabyte CSV file to an ~80 megabyte indexed binary structure. More information at: https://github.com/mideind/BinPackage BinPackage er Python-pakki utan um BÍN, Beygingarlýsingu íslensks nútímamáls (https://bin.arnastofnun.is), sem inniheldur yfir 6,5 milljónir færslna, 3,1 milljón einstakra orðmynda og um 300.000 stakar lemmur. Stofnun Árna Magnússonar í íslenskum fræðum heldur utan um gagnagrunninn. Gagnagrunninum, um 400 megabæta CSV-skrá, hefur verið pakkað í um 80 megabæta tvíundarbyggingu með vísum. Frekari upplýsingar á: https://github.com/mideind/BinPackage
  • Service for querying dependency treebanks Drevesnik 1.0

    Drevesnik (https://orodja.cjvt.si/drevesnik/) is an online service for querying syntactically parsed corpora in Slovenian using the Universal Dependencies annotation scheme with easy-to-use query language on the one hand and user-friendly graph visualizations on the other. It is based on the open-source dep_search tool (https://github.com/TurkuNLP/dep_search), which was localized and modified so as to also support querying by JOS morphosyntactic tags, random distribution of results, and filtering by sentence length. The source code and the documentation for the search backend and the web user interface are publicly available on the CLARIN.SI GitHub repository https://github.com/clarinsi/drevesnik. This submission corresponds to release 1.0: https://github.com/clarinsi/drevesnik/releases/tag/1.0.
  • GreynirCorrect4LT (1.0)

    This is a slightly adapted version of Miðeind's spell and grammar checker GreynirCorrect <CLARIN link: http://hdl.handle.net/20.500.12537/174> . This version is implemented for use in a text-to-speech text pre-processing pipeline, but includes guidelines for a quick adaptation to other use cases in language technology applications as well. [ICELANDIC] Þetta er lítillega aðlöguð útgáfa af málrýnitólinu GreynirCorrect <CLARIN link: http://hdl.handle.net/20.500.12537/174> til notkunar í textavinnslu fyrir talgervla. Einnig inniheldur útgáfan leiðbeiningar um það hvernig aðlaga má GreyniCorrect að öðrum notkunartilvikum í máltækni, þar sem þarfirnar kunna að vera aðrar en í málrýni fyrir almenna notendur.
  • Biaffine-based UD Parser for Icelandic 22.12

    ENGLISH: This Universal Dependencies parser for Icelandic was trained with Diaparser [1]. This version of it was trained on v2.11 of UD_Icelandic-IcePaHC [2] and UD_Icelandic-Modern [3]. (Note that texts in UD_Icelandic-Modern [3] labeled RUV_TGS_2017 and RUV_ESP_2017 were not included here as these were originally parsed with COMBO-based UD Parser 22.10 [4] and the output subsequently corrected.) The parser utilizes information from an ELECTRA language model [5]. Its UAS (unlabeled attachment score) is 89.58 and its LAS (labeled attachment score) is 86.46.   ICELANDIC: Þessi UD-þáttari var þjálfaður með Diaparser [1]. Þessi útgáfa hans var þjálfuð á útgáfu 2.11 af UD_Icelandic-IcePaHC [2] og UD_Icelandic-Modern [3]. (Ath. að textar í UD_Icelandic-Modern [3] merktir RUV_TGS_2017 og RUV_ESP_2017 voru ekki notaðir við þjálfunina þar sem þeir voru upphaflega þáttaðir með COMBO-based UD Parser 22.10 [4] og úttakið leiðrétt að því loknu.) Þáttarinn nýtir sér upplýsingar úr ELECTRA-mállíkani [5]. Hann skorar 89.58 á UAS (unlabeled attachment score) og 86.46 á LAS (labeled attachment score). [1] Diaparser: https://github.com/Unipisa/diaparser  [2] UD_Icelandic-IcePaHC: https://github.com/UniversalDependencies/UD_Icelandic-IcePaHC/  [3] UD_Icelandic-Modern: https://github.com/UniversalDependencies/UD_Icelandic-Modern/  [4] COMBO-based UD Parser 22.10: http://hdl.handle.net/20.500.12537/272 [5] electra-base-igc-is: https://huggingface.co/jonfd/electra-base-igc-is
  • Yfirlestur 1.0.0 (22.06)

    Yfirlestur.is is a public website where you can enter or submit your Icelandic text and have it checked for spelling and grammar errors. The tool also gives hints on words and structures that might not be appropriate, depending on the intended audience for the text. The core spelling and grammar checking functionality of Yfirlestur.is is provided by the GreynirCorrect engine, by the same authors. This software is licensed under the MIT License. More information at https://github.com/mideind/Yfirlestur.
  • Biaffine-based UD Parser 22.10

    ENGLISH: This Universal Dependencies parser for Icelandic was trained with Diaparser [1] on IcePaHC [2] and UD_Icelandic-Modern [3], the latter one having been revised before training, as some duplicate sentences had to be removed. The parser utilizes information from an ELECTRA language model [4]. Its UAS (unlabeled attachment score) is 89.52 and its LAS (labeled attachment score) is 86.23.