CLARIN Tool Portal

NeMo Punctuation and Capitalisation service RSDO-DS2-P&C-API 1.0

2 resources

Punctuation and Capitalisation service for NeMo models. For more details about building such models, see the official NVIDIA NeMo documentation (https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/punctuation_and_capitalization.html) and NVIDIA NeMo GitHub (https://github.com/NVIDIA/NeMo). A model for punctuation and capitalisation restoration in lowercased non-punctuated Slovene text can be downloaded from http://hdl.handle.net/11356/1735. The service accepts as input either a single string or list of strings for which punctuation and capitalisation should be restored. The result will be in the same format as the request, either a single string or list of strings. The maximal accepted text length is 5000c. Note that punctuation and capitalization of one 5000c text block on cpu will take advantage of all available cores and may take ~30s (on a system with 24 vCPU). See the service README.md for further details.

Use "NeMo Punctuation and Capitalisation service RSDO-DS2-P&C-API 1.0"

Polish Speech Services

2 resources

This archive contains the source code and configuration of the speech tools web service available at http://mowa.clarin-pl.eu/mowa. The services provided include: + speech to text alignemnt + speaker diarization + speech transcription + speech activity detection and noise classification + keyword spotting

Use "Polish Speech Services"

IceNeuralParsingPipeline 20.04

2 resources

The Icelandic Neural Parsing Pipeline (IceNeuralParsingPipeline) includes all steps necessary for parsing plain Icelandic text, i.e. preprocessing, parsing and post processing. The preprocessing step consists of tokenization, both punctuation and matrix clause splitting. The parsing step consists of an Icelandic model of the Berkeley Neural Parser, trained on IcePaHC, which reports an 84.74 F1 score. The output's annotation scheme is the same as IcePaHC's, except that neither empty phrases, e.g. traces and zero subjects, nor lemmas are shown. The post processing step includes minor steps for cleaning and formatting the parsed text.

Use "IceNeuralParsingPipeline 20.04"

Q-CAT Corpus Annotation Tool 1.5

2 resources

The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations) Version 1.5 supports listening to audio recordings (provided in the # sound_url comment line in CONLL-U)

Use "Q-CAT Corpus Annotation Tool 1.5"

Parsito

2 resources

Parsito is a fast open-source dependency parser written in C++. Parsito is based on greedy transition-based parsing, it has very high accuracy and achieves a throughput of 30K words per second. Parsito can be trained on any input data without feature engineering, because it utilizes artificial neural network classifier. Trained models for all treebanks from Universal Dependencies project are available (37 treebanks as of Dec 2015). Parsito is a free software under Mozilla Public License 2.0 (http://www.mozilla.org/MPL/2.0/) and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA (http://creativecommons.org/licenses/by-nc-sa/4.0/) license, although for some models the original data used to create the model may impose additional licensing conditions. Parsito website http://ufal.mff.cuni.cz/parsito contains download links of both the released packages and trained models, hosts documentation and offers online demo. Parsito development repository http://github.com/ufal/parsito is hosted on GitHub.

Use "Parsito"

Face-domain-specific automatic speech recognition models

2 resources

This entry contains all the files required to implement face-domain-specific automatic speech recognition (ASR) applications using the Kaldi ASR toolkit (https://github.com/kaldi-asr/kaldi), including the acoustic model, language model, and other relevant files. It also includes all the scripts and configuration files needed to use these models for implementing face-domain-specific automatic speech recognition. The acoustic model was trained using the relevant Kaldi ASR tools (https://github.com/kaldi-asr/kaldi) and the Artur speech corpus (http://hdl.handle.net/11356/1776; http://hdl.handle.net/11356/1772). The language model was trained using the domain-specific text data involving face descriptions obtained by translating the Face2Text English dataset (https://github.com/mtanti/face2text-dataset) into the Slovenian language. These models, combined with other necessary files like the HCLG.fst and decoding scripts, enable the implementation of face-domain-specific ASR applications. Two speech corpora ("test" and "obrazi") and two Kaldi ASR models ("graph_splosni" and "graph_obrazi") can be selected for conducting speech recognition tests by setting the variable "graph" and "test_sets" in the "local/test_recognition.sh" script. Acoustic speech features can be extracted and speech recognition tests can be conducted using the "local/test_recognition.sh" script. Speech recognition test results can be obtained using the "results.sh" script. The KALDI_ROOT environment variable also needs to be set in the script "path.sh" to set the path to the Kaldi ASR toolkit installation folder.

Use "Face-domain-specific automatic speech recognition models"

Trankit model for linguistic processing of spoken Slovenian

2 resources

This is a retrained Slovenian spoken language model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, feature prediction, and dependency parsing in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a combination of two datasets published by Universal Dependencies in release 2.12, the spoken SST treebank (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.12) and the written SSJ treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.12). Its evaluation on the spoken SST test set yields an F1 score of 97.78 for lemmas, 97.19 for UPOS, 95.05 for XPOS and 81.26 for LAS, a significantly better performance in comparison to the counterpart model trained on written SSJ data only (http://hdl.handle.net/11356/1870). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.

Use "Trankit model for linguistic processing of spoken Slovenian"

Q-CAT Corpus Annotation Tool 1.4

2 resources

The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations)

Use "Q-CAT Corpus Annotation Tool 1.4"

Icelandic Gigaword Corpus JSONL Converter

2 resources

Icelandic Gigaword Corpus JSONL Converter is a tool for converting the unannotated version of the Icelandic Gigaword Corpus (IGC; http://hdl.handle.net/20.500.12537/253) to JSONL format. The converter takes in original XML files from IGC and converts them to JSONL format, adding information on the subcorpus' quality and domain, which is obtained from an attached file created by the Árni Magnússon Institute for Icelandic Studies. For further information on the output format, see the attached README. JSONL-varpari fyrir Risamálheild er tól til þess að varpa ómarkaðri útgáfu af Risamálheildinni (http://hdl.handle.net/20.500.12537/253) yfir á JSONL-snið. Varparinn tekur við upprunalegri XML-skrá Risamálheildarinnar og skilar henni á JSONL-sniði ásamt því að bæta við upplýsingum um gæði og óðal undirmálheildarinnar, en þær upplýsingar eru fengnar úr skjali sem fylgir með varparanum og var búið til af Stofnun Árna Magnússonar í íslenskum fræðum. Sjá README-skrá fyrir frekari upplýsingar um úttakssnið.

Use "Icelandic Gigaword Corpus JSONL Converter"

Slowal

2 resources

Slowal is a web tool designed for creating, editing and browsing valence dictionaries. So far, it has mainly been used for creating The Polish Valence Dictionary (Walenty). Slowal supports the process of creating the dictionary; it also facilitates access by making it possible to browse the dictionary using an advanced built-in filtering system, covering both syntactic and semantic phenomena. Slowal also gives control over the work of lexicographers involved in creating dictionary, for instance by using predefined lists of values, which prevents spelling errors and enforces consistency, as well as by imposing strict validation rules. Last but not least, the created dictionary can be exported from Slowal in various formats: plain text, TeX, PDF, and TEI XML.

Use "Slowal"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

NeMo Punctuation and Capitalisation service RSDO-DS2-P&C-API 1.0

Polish Speech Services

IceNeuralParsingPipeline 20.04

Q-CAT Corpus Annotation Tool 1.5

Parsito

Face-domain-specific automatic speech recognition models

Trankit model for linguistic processing of spoken Slovenian

Q-CAT Corpus Annotation Tool 1.4

Icelandic Gigaword Corpus JSONL Converter

Slowal

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Session recording