Result filters

Metadata provider

Language

Resource type

  • Unspecified

Availability

Active filters:

  • Resource type: Unspecified
Loading...
419 record(s) found

Search results

  • The CLASSLA-StanfordNLP model for lemmatisation of non-standard Serbian 1.1

    The model for lemmatisation of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200), the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), the hr500k training corpus (http://hdl.handle.net/11356/1183) and the RAPUT corpus (https://www.aclweb.org/anthology/L16-1513/), using the srLex inflectional lexicon (http://hdl.handle.net/11356/1233). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~97.62. The difference to the previous version of the lemmatizer is that now it relies solely on XPOS annotations, and not on a combination of UPOS, FEATS (lexicon lookup) and XPOS (lemma prediction) annotations.
  • The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonian 2.1

    This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the 1984 training corpus expanded with the Macedonian SETimes corpus (to be published) and using the Macedonian CLARIN.SI word embeddings (http://hdl.handle.net/11356/1788). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~97.14. The difference from the previous version is that this version was trained using a larger training dataset and the new version of the Macedonian word embeddings.
  • MUSCIMarker

    MUSCIMarker is an open-source tool for annotating visual objects and their relationships in binary images. It is implemented in Python, known to run on Windows, Linux and OS X, and supports working offline. MUSCIMarker is being used for creating a dataset of musical notation symbols, but can support any object set. The user documentation online is currently (12.2016) incomplete, as it is continually changing to reflect annotators' comments and incorporate new features. This version of the software is *not* the final one, and it is under continuous development (we're currently working on adding grayscale image support with auto-binarization, and Android support for touch-based annotation). However, the current version (1.1) has already been used to annotate more than 100 pages of sheet music, over all the major desktop OSes, and I believe it is already in a state where it can be useful beyond my immediate music notation data gathering use case.
  • A lexicographical browser for DjVu

    The program is an indexer and browser for the scans of lexicographical paper slips. The slips are presented in DjVu format and an appropriate relational database stores the information about them. The integration of three approaches: incremental search, binary search and the so-called occasional indexing which consists in refinement of the stored information while searching, offers easy and convenient browsing.
  • The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Macedonian 1.0

    This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the 1984 training corpus (to be published) and using the Macedonian CLARIN.SI word embeddings (http://hdl.handle.net/11356/1359). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~97.6.
  • GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English (1.0)

    Provided are a general domain IS-EN and EN-IS translation models developed by Miðeind ehf. They are based on a multilingual BART model (https://arxiv.org/pdf/2001.08210.pdf) and finetuned for translation on parallel and backtranslated data. The model is trained using the Fairseq sequence modeling toolkit by PyTorch. Provided here are a model files, sentencepiece subword-tokenizing model and dictionary files for running the model locally. You can run the scripts infer-enis.sh and infer-isen.sh to test the model by translating sentences command-line. For translating documents and evaluating results you will need to binarize the data using fairseq-preprocess and use fairseq-generate for translating. Please refer to the Fairseq documentation for further information on running a pre-trained model: https://fairseq.readthedocs.io/en/latest/ - Pakkinn inniheldur almenn þýðingarlíkön fyrir áttirnar IS-EN og EN-IS þróuð af Miðeind ehf. Þau eru byggð á margmála BART líkani (https://arxiv.org/pdf/2001.08210.pdf) og fínþjálfuð fyrir þýðingar. Líkönin eru þjálfað með Fairseq og PyTorch. Líkönin sjálf og ásamt sentencepiece tilreiðingarlíkani eru gerð aðgengileg. Skripturnar infer-enis.sh og infer-isen.sh gefa dæmi um hvernig er hægt að keyra líkönin á skipanalínu. Til að þýða stór skjöl og meta niðurstöður þarf að nota fairseq-preprocess skipunina ásamt fairseq-generate. Frekari upplýsingar er að finna í Fairseq leiðbeiningunum: https://fairseq.readthedocs.io/en/latest/
  • Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data (https://hdl.handle.net/11234/1-5787). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_215_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
  • Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)

    This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving four NLP tasks: machine translation, image captioning, sentiment analysis, and summarization. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The models are described in the accompanying paper. The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd In addition to the models presented in the referenced paper (developed and published in 2018), we include models for automatic news summarization for Czech and English developed in 2019. The Czech models were trained using the SumeCzech dataset (https://www.aclweb.org/anthology/L18-1551.pdf), the English models were trained using the CNN-Daily Mail corpus (https://arxiv.org/pdf/1704.04368.pdf) using the standard recurrent sequence-to-sequence architecture. There are several separate ZIP archives here, each containing one model solving one of the tasks for one language. To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory. Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization). The 'experiment.ini' file, which was used to train the model, is also included. Then there are files containing the model itself, files containing the input and output vocabularies, etc. For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/ For the machine translation, you do not need to tokenize the data, as this is done by the model. For image captioning, you need to: - download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz - clone the git repository with TensorFlow models: https://github.com/tensorflow/models - preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script The summarization models require input that is tokenized with Moses Tokenizer (https://github.com/alvations/sacremoses) and lower-cased. Feel free to contact the authors of this submission in case you run into problems!