Result filters

Metadata provider

  • DSpace

Language

Resource type

Availability

Active filters:

  • Metadata provider: DSpace
Loading...
419 record(s) found

Search results

  • The Orange workflow for observing collocation trends ColTrend 1.0

    The Orange workflow for observing collocation trends ColTrend 1.0 ColTrend is a workflow (.OWS file) for Orange Data Mining (an open-source machine learning and data visualization software: https://orangedatamining.com/) that allows the user to observe temporal collocation trends in corpora. The workflow consists of a series of Python scripts, data filters, and visualizers. As input, the workflow takes a .CSV file with data on collocations and their relative frequencies by year of publication extracted from a corpus. As output, it provides a .TSV file containing the same data (or a filtered selection thereof) enriched with four measures that indicate the collocation’s temporal trend in the corpus: (1) the slope (k) of a linear regression model fitted to the frequency data, which indicates whether the frequency of use of the collocation is increasing or declining; (2) the coefficient of determination (R2) of the linear regression model, indicating how linear the change in the collocation’s use is; (3) the ratio (m) of maximum relative frequency and average relative frequency, which indicates peaks in collocation usage; and (4) the coefficient of recent growth (t), which indicates an increased usage of the collocation in the last three years of the observed corpus data. The entry also contains three .CSV files that can be used to test the workflow. The files contain collocation candidates (along with their relative frequencies per year of publication) extracted from the Gigafida 2.0 Corpus of Written Slovene (https://viri.cjvt.si/gigafida/) with three different syntactic structures (as defined in http://hdl.handle.net/11356/1415): 1) p0-s0 (adjective + noun, e.g. rezervni sklad), 2) s0-s2 (noun + noun in the genitive case, e.g. ukinitev lastnine), and 3) gg-s4 (verb + noun in the accusative case, e.g. pripraviti besedilo). It should be noted that only collocation candidates with absolute frequency of 15 and above were extracted. Please note that the ColTrend workflow requires the installation of the Text Mining add-on for Orange. For installation instructions as well as a more detailed description of the different phases of the workflow and the measures used to observe the collocation trends, please consult the README file.
  • The Trankit model for linguistic processing of spoken and written Slovenian 1.1

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an almost identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/dev), thus producing significantly better results for spoken data.
  • Trankit model for SST 2.15

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/dev) featuring transcriptions of spontaneous speech in various everyday settings. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological feature prediction, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). Please note this model has been published for archiving purposes only. For production use, we recommend using the state-of-the art Trankit model available here: http://hdl.handle.net/11356/1965. The latter was trained on both spoken (SST) and written (SSJ) data, and demonstrates a significantly higher performance to the model featured in this submission.
  • Tiro TTS web service (22.10)

    Tiro TTS is a text-to-speech (TTS) API web service that works with various TTS backends. By default, it expects a FastSpeech2+Melgan+IceG2P backend. See the https://github.com/cadia-lvl/fastspeech2 repository for more information on the backend. The service can accept either unnormalized text or an SSML document and respond with audio (MP3, Ogg Vorbis or raw 16 bit PCM) or speech marks, indicating the byte and time offset of each synthesized word in the request. The full API documentation in OpenAPI 2 format is available online at tts.tiro.is. The code for the service along with further information is on https://github.com/tiro-is/tiro-tts/releases/tag/M9. You should also check if a newer version is out (see README.md)
  • COMBO-based UD Parser 22.10

    ENGLISH: This Universal Dependencies parser for Icelandic was trained with COMBO on IcePaHC and UD_Icelandic-Modern, the latter one having been revised before training, as some duplicate sentences had to be removed. It utilizes information from an ELECTRA language model (https://huggingface.co/jonfd/electra-base-igc-is). Its UAS (unlabeled attachment score) is 89.13 and its LAS (labeled attachment score) is 85.97.
  • Text classification model SloBERTa-Trendi-Topics 1.0

    The SloBerta-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc. The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf The model was trained on the labeled texts using the SloBERTa 2.0 contextual embeddings model (http://hdl.handle.net/11356/1397; also available at HuggingFace: https://huggingface.co/EMBEDDIA/sloberta) and validated on a development set of 1,293 texts using the simpletransformers library and the following hyperparameters: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 2 The model achieves a macro-F1-score of 0.94 on a test set of 1,295 texts (best for "črna kronika", "politika", "šport", and "vreme" at 0.98, worst for "prosti čas" at 0.83). Please note that the fastText-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1710) that is faster and computationally less demanding, but achieves lower classification accuracy.
  • Editor for pronunciation dictionaries

    A web application for the editing of pronunciation dictionaries. The tool offers detailed annotation of entries, e.g. on compounds, prefixes, dialects and part-of-speech. Exports dictionaries in .tsv format for use in speech applications. Vefviðmót til þess að vinna með framburðarorðabækur. Tólið býður upp á að merkja upplýsingar með hverju orði, t.d. hvort orðið sé samsett, byrji á forskeyti, framburðartilbrigði og orðflokk. Unna orðalista er svo hægt að flytja út á .tsv formi til notkunar í taltæknihugbúnaði.