CLARIN Tool Portal

The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0

3 resources

This model for named entity recognition of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205).

Use "The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0"

EdUKate Czech-Ukrainian translation model 2024

2 resources

This package includes Czech-to-Ukrainian translation model adapted for the educational domain. The model is exported into the TensorFlow Serving format (using Tensor2tensor version 1.6.6), so it can be used in the Charles Translator service (https://translator.cuni.cz) and in the web portal Škola s nadhledem. This model was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools.

Use "EdUKate Czech-Ukrainian translation model 2024"

GlowTTS models for Talrómur1 (22.10)

6 resources

This release contains GlowTTS models for four different voices from the Talrómur 1 [1] corpus. The models were trained using the Coqui TTS library after it was adapted for Icelandic. Included is the model, model configuration, log file for the training and the recipe used for each model. Þessi útgáfa inniheldur þjálfuð GlowTTS módel fyrir fjórar mismunandi raddir úr Talrómur 1 [1] gagnasafninu. Módelin voru þjálfuð með Coqui TTS verkfærakistunni sem búið var að aðlaga fyrir íslensku. Innifalið fyrir hverja rödd er módelið, skjal með stillingum á módelinu, þjálfunarsaga og forskriftin sem var notuð. [1] http://hdl.handle.net/20.500.12537/104

Use "GlowTTS models for Talrómur1 (22.10)"

Semi-supervised Icelandic-Polish Translation System (22.09)

8 resources

This Icelandic-Polish translation model (bi-directional) was trained using fairseq (https://github.com/facebookresearch/fairseq) by means of semi-supervised translation by starting with the mBART50 model. The model was then trained using a multi-task curriculum to first learn to denoise sentences. Then the model was trained to translate using aligned parallel texts. Finally the model was provided with monolingual texts in both Icelandic and Polish with which it iteratively creates back-translations. For the PL-IS direction the model achieves a BLEU score of 27.60 on held out true parallel training data and 15.30 on the out-of-domain Flores devset. For the IS-PL direction the model achieves a score of 27.70 on the true data and 13.30 on the Flores devset. -- Þetta íslensk-pólska þýðingarlíkan (tvíátta) var þjálfað með fairseq (https://github.com/facebookresearch/fairseq) með hálf-sjálfvirkum aðferðum frá mBART50 líkaninu. Líkanið var þjálfað á þremur verkefnum, afruglun, samhliða þýðingum og bakþýðingum sem voru myndaðar á þjálfunartíma. Fyrir PL-IS áttina fæst BLEU skor 27.60 á raun gögnum sem voru tekin til hliðar og 15.30 á Flores þróunargögnunum. Fyrir IS-PL áttina fæst skor 27.70 á raun gögnunum og 13.30 á Flores þróunargögnunum.

Use "Semi-supervised Icelandic-Polish Translation System (22.09)"

Database of the Western South Slavic Verb HyperVerb -- Derivation

4 resources

The verbal Western South Slavic database (WeSoSlaV) contains 3000 most frequent Slovenian and 5300 most frequent BCS verbs which are all coded for a number of properties related to verb derivation. The database is a table where each verb is given a row of its own. The coded properties are organized in columns. Verbs in the database are coded for the following properties: root information, whether or not the verb has prefixes and the identity of the included prefix(es), whether or not the verb has suffixes and the identity of the included suffix(es) etc. All coded properties are explained in the accompanying pdf file.

Use "Database of the Western South Slavic Verb HyperVerb -- Derivation"

Open morphology of Finnish

2 resources

Omorfi is free and open source project containing various tools and data for handling Finnish texts in a linguistically motivated manner. The main components of this repository are: 1) a lexical database containing hundreds of thousands of words (c.f. lexical statistics), 2) a collection of scripts to convert lexical database into formats used by upstream NLP tools (c.f. lexical processing), 3) an autotools setup to build and install (or package, or deploy): the scripts, the database, and simple APIs / convenience processing tools, and 4) a collection of relatively simple APIs for a selection of languages and scripts to apply the NLP tools and access the database

Use "Open morphology of Finnish"

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian

3 resources

The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.1.

Use "The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian"

Smashcima (2025-03-28)

5 resources

Smashcima is a library and framework for synthesizing images containing handwritten music for creating synthetic training data for OMR models. It is primarily intended to be used as part of optical music recognition workflows, esp. with domain adaptation in mind. The target user is therefore a machine-learning, document processing, library sciences, or computational musicology researcher with minimal skills in python programming. Smashcima is the only tool that simultaneously: - synthesizes handwritten music notation, - produces not only raster images but also segmentation masks, classification labels, bounding boxes, and more, - synthesizes entire pages as well as individual symbols, - synthesizes background paper textures, - synthesizes also polyphonic and pianoform music images, - accepts just MusicXML as input, - is written in Python, which simplifies its adoption and extensibility. Therefore, Smashcima brings a unique new capability for optical music recognition (OMR): synthesizing a near-realistic image of handwritten sheet music from just a MusicXML file. As opposed to notation editors, which work with a fixed set of fonts and a set of layout rules, it can adapt handwriting styles from existing OMR datasets to arbitrary music (beyond the music encoded in existing OMR datasets), and randomize layout to simulate the imprecisions of handwriting, while guaranteeing the semantic correctness of the output rendering. Crucially, the rendered image is provided also with the positions of all the visual elements of music notation, so that both object detection-based and sequence-to-sequence OMR pipelines can utilize Smashcima as a synthesizer of training data. (In combination with the LMX canonical linearization of MusicXML, one can imagine the endless possibilities of running Smashcima on inputs from a MusicXML generator.)

Use "Smashcima (2025-03-28)"

Word embeddings CLARIN.SI-embed.mk 0.1

3 resources

CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram model of fastText trained on 323,158,626 tokens of running text for 448,182 lowercased surface forms.

Use "Word embeddings CLARIN.SI-embed.mk 0.1"

WMT21 Marian translation models (ca-ro,it,oc)

1 resources

Marian multilingual translation model from Catalan into Romanian, Italian and Occitan. Primary CUNI submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task.

Use "WMT21 Marian translation models (ca-ro,it,oc)"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0

EdUKate Czech-Ukrainian translation model 2024

GlowTTS models for Talrómur1 (22.10)

Semi-supervised Icelandic-Polish Translation System (22.09)

Database of the Western South Slavic Verb HyperVerb -- Derivation

Open morphology of Finnish

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian

Smashcima (2025-03-28)

Word embeddings CLARIN.SI-embed.mk 0.1

WMT21 Marian translation models (ca-ro,it,oc)

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Session recording