CLARIN Tool Portal

ZRCola 2

2 resources

ZRCola is an input system designed mainly, although not exclusively, for linguistic use. It allows the user to combine basic letters with any diacritic marks and insert the resulting complex characters into the texts with ease. The system is comprised of an input program and a font, which can also be installed separately. The font is based on the Unicode standard and includes a vastly enlarged set of Latin, Cyrillic and other characters for Slavic writing systems in the Private Use Area.

Use "ZRCola 2"

ELMo embeddings model, Slovenian

3 resources

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on entire Gigafida 2.0 corpus (https://viri.cjvt.si/gigafida/System/Impressum) for 10 epochs. 1,364,064 most common tokens were provided as vocabulary during the training. The model can also infer OOV words, since the neural network input is on the character level.

Use "ELMo embeddings model, Slovenian"

GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models (1.0)

3 resources

Code and models required to run the GreynirT2T Transformer NMT system for translation between English and Icelandic. Includes a Docker-Compose file that starts a REST web server making the translation models available to clients. Forrit og líkön til að keyra GreynirT2T Transformer vélþýðingarlíkön fyrir þýðingar á milli íslensku og ensku. Docker-Compose uppskrift keyrir upp REST vefþjón sem gerir líkönin aðgengileg netbiðlurum.

Use "GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models (1.0)"

Q-CAT Corpus Annotation Tool 1.3

2 resources

The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI).

Use "Q-CAT Corpus Annotation Tool 1.3"

TreeHopper (TreeLSTM): wydźwięk na poziomie zdań i fraz

2 resources

A Tree-LSTM-based dependency tree sentiment labeler

Use "TreeHopper (TreeLSTM): wydźwięk na poziomie zdań i fraz"

Long term archive operating system source code

2 resources

This submission contains the operating system of the long-term archive, built in the Polish-Japanese Academy of Information Technology for the Clarin-PL project. Basic elements of the archive are data nodes, equipped with mass memories. The nodes are controlled by embedded low-power computers which are independently powered up only when their storage is about to be accessed. This allows not only for limiting the overall energy consumption but also lowers environmental demands (no air-condition needed). The nodes are grouped in trays. Basic and recommended configuration allows for 30 nodes in trays, but it is possible to extend this limit up to 253. Each tray contains several networks designed for data transport, devices’ state control and power supply. Communication with clients is conducted through buffers that are the only parts visible from externally connected networks. Therefore, stored files are completely isolated and cannot be directly accessed. Multiple trays located at single physical site create a complete archive. It is possible to split storage space into virtual archives that are separated on logical level. The operating system of the data network allows to store from 3 to 7 copies of single digital file in different nodes. Moreover, additional copies of the resource may be stored automatically in remotely located archives. The trays are treated as local parts of wider dispersed data network structure. Software of the archive enables not only secure read and write operations data but it also automatically takes care of the stored data. It periodically regenerates physical state of saved files. In case of device failure clients are transparently redirected to local or remote redundant copies. The mechanism of "software bots" was implemented. Archive can be supplied with external programs for processing files stored inside the data network. This allows for data analyzes, indexation, post-data creation, statistical computations or finding associations in unstructured data sets of Big Data type. Only the output of software bot can be externally accessed what makes such operations very secure. Client programs communicate with the archive using set of simple protocols based on key-value pair strings, making it convenient to build web interfaces for archive access and administration. By automating the supervision of the resources, reduction of requirements for storage, precise energy consumption control and proposed solution significantly lowers the cost of long-term data storage.

Use "Long term archive operating system source code"

Icelandic NER API - ELECTRA-base model (21.05)

2 resources

A dockerized Named Entity Recognition (NER) API for Icelandic. It uses a ELECTRA-base language model, that has been fine tuned for NER using MIM-GOLD-NER. It achieves F1-score of ~91.9 on the test set for MIM-GOLD-NER. The code for the API is available at https://github.com/icelandic-lt/Icelandic-NER-API and the files for the fine tuned model are available in this submission. Dockerútfærð forritaskil fyrir nafnakennsl (NER) á íslensku. Það notast við ELECTRA-base mállíkan, sem hefur verið fínstillt fyrir NER með nafnakennslamálheildinni MIM-GOLD-NER. Líkanið nær u.þ.b. 91.9 fyrir prófunarmengi MIM-GOLDöNER. Forritunarkóðinn fyrir forritaskilinu eru aðgengileg hérna: https://github.com/icelandic-lt/Icelandic-NER-API og skrárnar fyrir fínstillta líkanið má finna í þessari færslu.

Use "Icelandic NER API - ELECTRA-base model (21.05)"

Slovene Conformer CTC BPE E2E Automated Speech Recognition model PROTOVERB-ASR-E2E 1.0

2 resources

This Conformer CTC BPE E2E Automated Speech Recognition model was trained following the NVIDIA NeMo Conformer-CTC fine-tuning recipe (for details see the official NVIDIA NeMo NMT documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for transcribing Slovene speech to text. The starting point was the Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0, which was fine-tuned on the Protoverb closed dataset. The model was fine-tuned for 20 epochs, which improved the performance on the Protoverb test dataset for 9.8% relative WER, and for 3.3% relative WER on the Slobench dataset.

Use "Slovene Conformer CTC BPE E2E Automated Speech Recognition model PROTOVERB-ASR-E2E 1.0"

Slovene Punctuation and Capitalisation model RSDO-DS2-P&C 3.6

2 resources

This Punctuation and Capitalisation model was trained following the NVIDIA NeMo Punctuation and Capitalisation recipe (for details see the official NVIDIA NeMo P&C documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/punctuation_and_capitalization.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for restoring punctuation (,.!?) and capital letters in lowercased non-punctuated Slovene text. The training corpus was built from publicly available datasets, as well as a small portion of proprietary data. In total the training corpus consisted of 38.829.529 sentences and the validation corpus consisted of 2.092.497 sentences.

Use "Slovene Punctuation and Capitalisation model RSDO-DS2-P&C 3.6"

Dependency tree extraction tool STARK 3.0

2 resources

STARK is a highly customizable tool designed for extracting different types of syntactic structures (trees) from parsed corpora (treebanks), aimed at corpus-driven linguistic investigations of syntactic and lexical phenomena of various kinds. It takes a treebank in the CONLL-U format as input and returns a list of all relevant dependency trees with frequency information and other useful statistics, such as the strength of association between the nodes of a tree, or its significance in comparison to another treebank. For installation, execution and the description of various user-defined parameter settings, see the official project page at: https://github.com/clarinsi/STARK. An online demo version of the tool is available at: https://orodja.cjvt.si/stark/. In comparison to v2, this version introduces several new features and improvements, such as the ability to extract very long trees, ignore irrelevant relations, process multi-root treebanks, or handle special operators when querying.

Use "Dependency tree extraction tool STARK 3.0"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

ZRCola 2

ELMo embeddings model, Slovenian

GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models (1.0)

Q-CAT Corpus Annotation Tool 1.3

TreeHopper (TreeLSTM): wydźwięk na poziomie zdań i fraz

Long term archive operating system source code

Icelandic NER API - ELECTRA-base model (21.05)

Slovene Conformer CTC BPE E2E Automated Speech Recognition model PROTOVERB-ASR-E2E 1.0

Slovene Punctuation and Capitalisation model RSDO-DS2-P&C 3.6

Dependency tree extraction tool STARK 3.0

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Session recording