CLARIN Tool Portal

Word embeddings CLARIN.SI-embed.mk 2.0

3 resources

CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram model of fastText trained on 933,231,582 tokens of running text for 986,670 lowercased surface forms. The difference to the previous version of the embeddings is that this version was trained on the original dataset expanded with the MaCoCu-mk web crawl corpus (http://hdl.handle.net/11356/1512).

Use "Word embeddings CLARIN.SI-embed.mk 2.0"

CroSloEngual BERT 1.1

4 resources

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). Changes in version 1.1: fixed vocab.txt file, as previous verson had an error causing very bad results during fine-tuning and/or evaluation.

Use "CroSloEngual BERT 1.1"

Plumper

1 resources

Ontology mapper. Mapping plWordNet onto SUMO ontology.

Use "Plumper"

VIADAT-GIS

2 resources

A VIADAT module; VIADAT-GIS connects the platform with maps. Developed in cooperation with ÚSD AV ČR and NFA.

Use "VIADAT-GIS"

The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.1

2 resources

The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). The estimated F1 of the lemma annotations is ~97.6. The difference to the previous version of the model is that it is trained with the lemmatiser padding bug removed, cf. https://github.com/stanfordnlp/stanfordnlp/issues/143.

Use "The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.1"

PyTorch model for Slovenian Named Entity Recognition SloNER 1.0

2 resources

The SloNER is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library (https://github.com/huggingface/transformers). The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0 (http://hdl.handle.net/11356/1397). The model was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747).The source code of the model is available on GitHub repository https://github.com/clarinsi/SloNER.

Use "PyTorch model for Slovenian Named Entity Recognition SloNER 1.0"

MSTperl delexicalized parser transfer scripts and configuration files

3 resources

This is a set of MSTperl parser configuration files and scripts for delexicalized parser transfer. They were used in the work reported in arXiv:1506.04897 (http://arxiv.org/abs/1506.04897), as well as several related papers. The MSTperl parser is available at http://hdl.handle.net/11234/1-1480

Use "MSTperl delexicalized parser transfer scripts and configuration files"

The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.0

3 resources

This model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~91.11. The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses the updated embeddings.

Use "The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.0"

Slavic Forest, Norwegian Wood (models)

5 resources

Trained models for UDPipe used to produce our final submission to the Vardial 2017 CLP shared task (https://bitbucket.org/hy-crossNLP/vardial2017). The SK model was trained on CS data, the HR model on SL data, and the SV model on a concatenation of DA and NO data. The scripts and commands used to create the models are part of separate submission (http://hdl.handle.net/11234/1-1970). The models were trained with UDPipe version 3e65d69 from 3rd Jan 2017, obtained from https://github.com/ufal/udpipe -- their functionality with newer or older versions of UDPipe is not guaranteed. We list here the Bash command sequences that can be used to reproduce our results submitted to VarDial 2017. The input files must be in CoNLLU format. The models only use the form, UPOS, and Universal Features fields (SK only uses the form). You must have UDPipe installed. The feats2FEAT.py script, which prunes the universal features, is bundled with this submission. SK -- tag and parse with the model: udpipe --tag --parse sk-translex.v2.norm.feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu A slightly better after-deadline model (sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe), which we mention in the accompanying paper, is also included. It is applied in the same way (udpipe --tag --parse sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu). HR -- prune the Features to keep only Case and parse with the model: python3 feats2FEAT.py Case < hr-ud-predPoS-test.conllu | udpipe --parse hr-translex.v2.norm.Case.w2v.trainonpred.udpipe NO -- put the UPOS annotation aside, tag Features with the model, merge with the left-aside UPOS annotation, and parse with the model (this hassle is because UDPipe cannot be told to keep UPOS and only change Features): cut -f1-4 no-ud-predPoS-test.conllu > tmp udpipe --tag no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe no-ud-predPoS-test.conllu | cut -f5- | paste tmp - | sed 's/^\t$//' | udpipe --parse no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe

Use "Slavic Forest, Norwegian Wood (models)"

UPSKILLS Teaching and Learning Content

14 resources

This is a collection of modular teaching and learning content created in the UPSKILLS project ( UPgrading the SKIlls of Linguistics and Language Students) and downloaded from the Moodle platform in .mbz format. The learning content can be reused and adapted by curriculum designers, lecturers, and instructors of courses in linguistics and language-related subjects. Different blocks or individual units within a block can be combined to create new learning paths at the BA and MA levels. Some of the learning content is also suitable for the PhD level. Students can also use the content for self-study, considering this is not a MOOC (Massive Open Online Course). Before downloading the files, it is recommended to: - use the project URL to read the descriptions of each learning block on the UPSKILLS project website - use the demo link to preview the learning content on the Moodle platform and decide which learning blocks you would like to download. Each learning block in Moodle contains several units on different topics, including presentations, learning activities, assignments, and a final student project. Furthermore, we have included a short guide explaining how the materials are organised, and how they can be used and cited. Please note that the .mbz files can be used exclusively on Moodle systems, version 3.8+. The material can be directly imported in MBZ format without changes. If help is required, please consult the Moodle User Guide > Course Restore: https://docs.moodle.org/402/en/Course_restore. The "Processing Texts and Corpora" and "Introduction to Language Data: Standards and Repositories" contain interactive presentations and quizzes created in H5p, which means that the H5p plugin should be available in your Moodle instance to be able to view and reuse the content (both in code and as a plugin), tiles formats, stashes and badges. The badges are given as a separate downloadable file. Nevertheless, the H5P content can be downloaded directly from the UPSKILLS Moodle platform and reused outside Moodle. H5P is richer HTML5, which has become famous for creating interactive learning objects (e.g. presentations, videos, gamified learning activities). It is a free and open format, which can be used as a plugin in Learning Management Systems, such as Moodle, Blackboard, Brightspace, OpenEdX, etc., and Content Management Systems, such as WordPress, Drupal, and Canvas. See the H5P administrators' guides for more information:https://help.h5p.com/hc/en-us/sections/7556764070429-Guides. All UPSKILLS learning content is made available under the CC-BY 4.0 International license. This means you can copy and share it with others in any medium or format, even for commercial purposes. However, it is required that you give appropriate credit to the source, include the license link, and indicate whether any changes were made to the original content. To learn more about the UPSKILLS project, please visit the project website and the following guides: 1. Research-Based Teaching: Guidelines and Best Practices 2. Integrating Research Infrastructures into Teaching (this guide is especially relevant if you are interested in reusing the learning content created by CLARIN, namely Introduction to Language Data: Standards and Repositories) 3. Integrating Industry-Based Research into Teaching Finally, all project deliverables are accessible in the UPSKILLS Community on Zenodo: https://zenodo.org/communities/upskills/?page=1&size=20.

Use "UPSKILLS Teaching and Learning Content"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Word embeddings CLARIN.SI-embed.mk 2.0

CroSloEngual BERT 1.1

Plumper

VIADAT-GIS

The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.1

PyTorch model for Slovenian Named Entity Recognition SloNER 1.0

MSTperl delexicalized parser transfer scripts and configuration files

The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.0

Slavic Forest, Norwegian Wood (models)

UPSKILLS Teaching and Learning Content

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Session recording