Skip to main content

Metadata on LaRS: the in-built scheme

In proper order (© Wilhei, CC BY 3.0, picture detail reproduced with some changes, our title)
In proper order (© Wilhei, CC BY 3.0, picture detail reproduced with some changes, our title)

ShareTIGR

03/10/2024

Repeatedly on this blog we explored the topic of metadata in relation with corpus-based linguistic research and data-sharing. In the last few weeks, we began to look into specific solutions to describe the TIGR corpus on the Language repository of Switzerland (LaRS) @ SWISSUbase. We encountered some problems that took us time to analyse and which we now address in this post and in the next one.

Let’s start by recalling that LaRS has a hierarchical structure consisting of three levels: study, datasets, and files. Each level can be described by level-specific metadata that vary in scope, degree of detail, and function. The metadata on study level is quite general and forms the repository’s catalogue, which is searchable by human users and read (“harvested”) by larger catalogues such as the Virtual Language Observatory (VLO) powered by CLARIN (for a presentation of the catalogue see Uytvanck, Stehouwer & Lampen 2012). Some information provided at study level (‘Study title’, ‘Author(s)’, ‘Main discipline(s)’, ‘Period’, ‘Geographical Area’) can be classified as descriptive metadata, whereas other information such as ‘Version number’ or ‘Version notes’ is of an administrative nature. Descriptive and administrative metadata can also be found at dataset level. As the dataset is the downloadable unit on SWISSUbase, rights management metadata such as ‘Deposit contract’, ‘Usage license’ or ‘Embargo end date’ become relevant in addition to descriptive information such as ‘Resource type’, ‘Keywords’ or ‘DOI’. The most detailed information can be found at file level. Here, for each file, which can either be a ‘Documentation’, a ‘Single Data File’ or a ‘File Collection’ (zip folder), there is the possibility to specify general information such as ‘File title’ or ‘Remarks’ concerning the document attached or to add ‘Linguistic metadata’. The options available in the ‘Linguistic metadata’ tab are the main subject of the following paragraphs.

As outlined in the LaRS Metadata Guide for Linguistics Data, the categories of addable linguistic metadata are ‘Language’, ‘Annotation’, ‘Text’, ‘Audio’, ‘Video’, ‘Image’, and ‘Tools’. The guide also lists ‘Resource Metadata’, which is the only category that refers to the dataset level rather than the file level. In the following, we will say a few words about each category and the possibilities and limitations when it comes to describe the TIGR corpus.

Under ‘Language’, different aspects of the language(s) recorded in a file can be described: among others, name, language status and modality type. For the TIGR data, it appears wise to include the language name (ita – Italian), the linguality type (monolingual), the country to which the language resource refers (Switzerland), the modality type (spoken language), and the naturality (natural). An additional mandatory field is language status (in our case: living). When compiling the ‘Audio’ and ‘Video’ sections, technical, rather than descriptive or administrative, metadata come into play. It is mandatory to name the media type (e.g., audio/wav or video/mp4) and, additionally, the user can specify formats (MPEG-4), codec and, in the case of video data, resolution and frame rate. A further possible information is duration. In the ‘Tools’ section, it is possible to provide information about the tools used to generate, process or annotate the data. You can name a software (e.g., in the TIGR case, EUDICO Linguistic Annotator ELAN), its role (e.g. transcription, annotation), and a short description (e.g., “ELAN is an annotation tool for audio and video recordings”). Many available metadata categories are directly relevant to the files included in the TIGR corpus.

An interesting issue regards the written transcripts of video recordings capturing spoken interaction. LaRS provides two categories of documents that may contain textual data, ‘Texts’ and ‘Annotations’, with two different metadata sets. The category ‘Text’ allows to add a media type (e.g., text/plain or application/msword), character encoding (UTF-8), provenience/derivation (the guide provides the example ‘transcribed folk song’) and a genre (e.g., ‘folk song about local customs’). The document category ‘Annotation’ offers the possibility to indicate an annotation type (with the recommendation to use standardized media types like those used for text), annotation format, and annotation tiers, and to refer to a controlled vocabulary if a documented closed tag set was used. We found it hard to decide whether transcripts of spoken interaction should be categorised as ‘Text’ or ‘Annotation’. Transcripts, in general, are similar to both. Like texts, they represent verbal discourse. Like annotations, they are produced based on a preexisting document (a recording, in this case), each element of the transcript relates to a precisely defined portion of that document, and they may contain comments on, or categorizations of, nonverbal information in addition to verbal information. One of us raised the question of the relationship between transcripts and annotations recently in the introductory remarks to a workshop on Database evolution for the study of social interaction: Designing annotations for long-term usability. We’ve kept looking for answers ever since, asking fellow linguists and repository managers about their opinion, but no clear consensus has emerged! In the LaRS guide, too, different clues point in different directions. The possibility to describe a ‘Text’ as being a transcript of a folk song, according to the example we cited earlier in this paragraph, invites to range transcripts in the ‘Text’ category. On the other hand, metadata categories of ‘Annotations’ such as annotation tiers seem to be tailored to our EAF transcripts and underline the affinity between transcripts and annotations. To further complicate the matter, we plan to provide several transcript versions, and one of them – the plain text movie-script style transcript we are producing – is less similar to an annotation that the original EAF file. Should we categorize the two transcript formats differently, then, one as an ‘Annotation’ and the other one as a ‘Text’?

If we now look back at the metadata discussed so far, it is noticeable that sociolinguistic speaker data were not mentioned. As discussed in our blog post on the digitisation of participant questionnaires, we registered the age and sex of the speakers as well as their language skills and the places of upbringing and current residence. None of these features can be captured using the categories mentioned so far. It is possible to describe speakers and their gender and age under ‘Resource metadata’ (i.e. metadata relating to an entire dataset). But these fields are geared towards groups of people, as shown by the plural forms used (‘Participants’, ‘Participants Gender’ etc.) and by the ‘Number of persons’ descriptor; they do not relate to the single speakers participating in an interaction. The LaRS data model does not currently represent individual speakers.

Another information that the current version of LaRS does not allow to describe in the form of explicit metadata is the kind of relation between different multimedia and transcript files, especially those documenting the same event. By grouping files into a single dataset, it is possible to show the user that they are related, generally speaking, but no relations between specific pairs of files can be defined using metadata fields. While exploring this topic, further desirable metadata candidates came to our mind, for instance a description of the kind of measures taken to de-identify data.

Having looked closely at the LaRS metadata scheme, we came to the conclusion that it was necessary to upload metadata files into our datasets to be able to add information that does not fit the repository’s in-built scheme. We did some research, talked to the repository managers, discovered various options and decided about further steps: we will report on these investigations in the next post.

Johanna Miecznikowski & Nina Profazi


References

Miecznikowski, J. (2024, January 16). Introductory remarks [workshop presentation]. Database evolution for the study of social interaction: Designing annotations for long-term usability, Neuchâtel, Switzerland.
https://www.youtube.com/watch?v=DdBNoqFA7W8&t=94s

SWISSUbase. Metadata Guide for Linguistics Data. Metadata Documentation (last modified: 21.11.2023). Language Repository of Switzerland (LaRS). https://resources.swissubase.ch/linguistics_metadata- guide_en/

Van Uytvanck, D., Stehouwer, H., & Lampen, L. (2012). Semantic metadata mapping in practice: The Virtual Language Observatory. In N. Calzolari (Ed.), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, May 23rd-25th, 2012 (pp. 1029-1034). European Language Resources Association (ELRA).

Virtual Language Observatory (VLO). https://vlo.clarin.eu/?6.