Condivisione del corpus di italiano parlato TIGR:
un caso studio ORD

Why metadata is important for FAIR data sharing and reuse

Metadata

ShareTIGR

05/09/2024

Following our discussions on fieldwork and data collection of the past weeks, we would now like to turn our attention to metadata and its relevance in the context of sharing and reusing (FAIR) research data in the fields of Interactional Linguistics (IL) and Conversation Analysis (CA). This topic is being explored to prepare the readers of our lab blog for future posts dedicated to TIGR’s metadata and how we intend to share it on SWISSUbase. The two main questions of this post are: What is metadata? and Why is metadata important, especially in the context of corpus-based linguistic research? The post will close with some explanations on the relationship between metadata and the FAIR Principles for scientific data management and stewardship.

The term ‘metadata’, which can be used in singular or plural form, is commonly defined as ‘data about data’ or ‘information about information’. It can therefore be found in any discipline that relies on data. Metadata facilitates the appropriate interpretation and utilisation of the data it is associated with by contextualising or describing their provenance, composition, processing, and storage through the provision of additional information. It can therefore be described as “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource” (NISO 2004, p. 1). In the case of Interactional Linguistics or Conversation Analysis the documents that serve as basis for empirical research are audio/video recordings of spoken interaction and associated written transcripts. In today’s digital age, where online resources for research are more and more wide-spread and sought after, those types of documents are frequently made available in the form of online corpora, accessible via repositories such as SWISSUbase or corpus platforms such as CLAPI Corpus de LAngue Parlée en Interaction. How such corpus data can be used, for which research questions and in which contexts depends on the manner in which they are documented, described, categorised, and organised, thus on the metadata provided (cf. Schmidt 2022, p. 249). With this knowledge in mind, we are approaching the second question of this post.

Why is metadata important? Metadata can comprise a number of elements, which can be categorised according to the functions they perform. Depending on the type of information resource, the methodological approach, and the research interest pursued, different metadata may be relevant. Therefore, metadata can be considered corpus-specific to a certain extent. Overall, however, three broad categories of metadata can be identified: descriptive metadata, technical metadata, and administrative metadata. “Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords” (NISO 2004, p. 1). In the case of spoken corpus data, the descriptive metadata is more far-reaching, as it can contain information about the corpus itself (corpus metadata), the interactional settings (situational metadata), and the speakers (speaker metadata). According to Schmidt (2022), probably the most important function of a corpus’ metadata is to map the corpus design. The design of a corpus describes the systematics of its structure and essentially determines which research questions can be addressed with it. The metadata categories used to describe a corpus therefore reflect the essential parameters of its design. Corpus metadata can also contain information about the size of the resource (e.g., in the case of the TIGR corpus, 23.5 hours), the time of origin (e.g., 2021-2022), the place of origin (e.g., Ticino and the Grisons) and the type of data collected (e.g., face-to-face-interactions in various kinds of non-experimental settings). Situational metadata then provides detailed information about the settings in which a recording was made, e.g. table conversations, food preparation, lessons and tutoring encounters, interviews. Speaker metadata describes the people who were involved in the interaction and can contain information such as (pseudonymised) names, age, origin, first language or language skills. The sum of this descriptive metadata makes it possible to quantify a corpus (e.g., 30,4% of speech events come from private settings, 69,6% stem from institutional ones), to perform systematic searches for properties recorded in the metadata or to discover correlations between linguistic forms and features of the conversations (cf. Schmidt 2022, p. 253; the advantages and limits of quantitative analyses using descriptive metadata were also discussed critically by Deppermann 2023). Easily available descriptions should not be used for shallow explanations, as only the detailed analysis of interactional conduct could provide reliable findings. The next category of metadata, technical metadata, “describes the technical processes used to produce, or required to use a digital object” (Higgins 2007). This can include information on the devices used to gather the data (e.g., 2 camcorders, pocket audio recorders equipped with clip-on microphones) or the formats in which the data are stored (e.g., WAV for audio files, MP4 for video files or EAF for transcripts, which is the raw XML output of the annotation tool ELAN). Information of this kind helps to manage, understand, and utilise data by illustrating their technical characteristics and relationships.

The final category of metadata, administrative metadata, “is used to manage administrative aspects of the digital object such as intellectual property rights and acquisition. Administrative Metadata also documents information concerning the creation, alteration and version control of the metadata itself. This is sometimes known as meta-metadata!” (Higgins 2007). Administrative metadata also includes subsets of data that are sometimes listed as separate metadata types: Rights management metadata or use metadata and preservation metadata. The former deals with intellectual property rights and manages user access, the later contains information needed to archive and preserve a resource. Considering all the functions listed above, metadata can be considered so important for corpus-based linguistic research as it facilitates the discovery of relevant information, helps organise electronic resources, fosters interoperability, provides digital identification, and supports archiving and preservation (cf. NISO 2004, p. 1). Higgins (2007) even goes so far as to describe metadata as “the backbone of digital curation. Without it a digital resource may be irretrievable, unidentifiable or unusable.”

If we take a closer look at Higgin’s quote and the retrievability, identifiability and usability of a resource, we feel reminded of the FAIR Principles for scientific data management and stewardship, which “provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets” (GO FAIR initiative, 2022). A diverse set of stakeholders representing academia, industry, funding agencies, and scholarly publishers designed this set of principles wishing to provide guidelines to enhance the reusability of scholarly data, thereby advancing the open science movement (cf. Wilkinson et al. 2016). In the final section of this post, we would therefore like to explain the connection between metadata and the FAIR Principles and how it can improve the findability, accessibility, interoperability and reusability of research data.

For a resource to be (re-)usable, it first needs to be findable by humans and computers. Describing a resource with machine-readable metadata is an essential component in the FAIRification process because it enables automatic discovery of datasets and services (see principles F1-F4), provided that the data repository interacts with web search engines using an open, free, and universally implementable communications protocol. To guarantee accessibility, where necessary, the protocol should allow for an authentication or authorisation procedure to grant user-specific access (Principles A1-A2). The ‘I’ in FAIR stands for interoperability, which describes the “ability of multiple systems with different hardware and software platforms, data structures, and interfaces to exchange data with minimal loss of content and functionality” (NISO 2004, p. 2). Describing a resource with adequate metadata in a defined schema allows it to be understood by both humans and machines, which is why principle I1 suggests using a formal, accessible, and broadly applicable knowledge representation. The last set of principles refers to the reusability of scholarly data. The authors of the principles state: “The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings” (GO FAIR initiative, 2022). Data should therefore be described with a plurality of accurate and relevant attributes (principle R1) to enable the assessment of their usefulness.

This was only an abridged summary of how rich, structured, and defined metadata supports the FAIR principles. The interested reader is recommended to consult the principles in their entirety, also to learn more about how metadata can support proper citation and provenance acknowledgement. In next week’s post, we will elaborate on the metadata of the TIGR corpus and how we intend to make it available on SWISSUbase.

Nina Profazi

References

Deppermann, A. (2023, December 7-8). Using large machine-readable corpora for Interactional Linguistics and Conversation Analysis: Potentials and limitations [Workshop presentation]. Designing, building and using data banks of interactional corpora from a conversation analytic perspective, Basel, Switzerland.

GO FAIR initiative (2022). FAIR principles. https://www.go-fair.org/fair-principles/

Higgins, S. (2007). What are Metadata Standards. https://www.dcc.ac.uk/guidance/briefing-papers/standards-watch-papers/what-are-metadata-standards#top

NISO National Information Standards Organization (2004). Understanding Metadata. NISO Press. https://www.lter.uaf.edu/metadata_files/UnderstandingMetadata.pdf

Schmidt, T. (2022). Daten und Metadaten. In M. Beißwenger, L. Lemnitzer & C. Meyer-Spitzer (Eds.), Forschen in der Linguistik. Eine Methodeneinführung für das Germanistik-Studium (pp. 249–258). Wilhelm Fink.

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., . . . Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 1–9. https://doi.org/10.1038/sdata.2016.18

Institute of Italian Studies
Università della Svizzera italiana
West Campus, Main Building
Via Buffi 13
6900 Lugano, Switzerland
tel +41 58 666 42 95
e-mail isi@usi.ch

Stay in touch

Team

Corpus

Blog

Publications

Why metadata is important for FAIR data sharing and reuse

Quicklinks

Share

Print

Stay in touch