Corpus
-
Short presentation
The TIGR corpus of spoken Italian was gathered in the Swiss cantons Ticino and Grisons in 2021-2022. It documents face-to-face interactions in various kinds of non-experimental settings: table conversations, food preparation, lessons and tutoring encounters, interviews, for 23.5 hours in total. The interactions were recorded with two camcorders and pocket audio recorders equipped with clip-on microphones and transcribed using the ELAN editor. This work was completed within the realm of the InfinIta project (SNSF grant no. 192771). Besides the InfinIta team (Johanna Miecznikowski, Elena Battaglia and Christian Geddo), further collaborators were involved performing specific tasks: Chiara Sbordoni (fieldwork and transcription), Benedetta Scotto di Santolo (transcription) and Costanza Lucchini (transcription). A next phase of data processing started in 2023, aiming at depositing the TIGR corpus on the LaRS @ SWISSUbase repository to make it available to the scientific community. The related tasks are currently being performed by the InfinIta team and by Nina Profazi within the ShareTIGR project (USI ORD grant).
-
Corpus design
The design of the TIGR corpus responds to the research goals of the InfinIta project, which examines the categorisation of information sources in talk. The corpus was built varying several event-related parameters, i.e. the more or less institutional character of the encounter, the number of participants, their social roles and the presence or absence of multi-activity. The variation of the first three parameters gives rise to diverse epistemic configurations and dynamics of stance-taking. Multi-activity, when present, enhances the probability of encountering certain types of information sources, in particular sources involving direct perception in situ.
Guided by the aim of diversifying interaction types along these lines, the InfinIta team recorded five table conversations, two interactions while preparing food, four tutoring sessions in architecture, six lessons - in theatre, music, restoration, general education and language teaching - and six interviews focused on the then topical issue of the Covid 19 pandemic. The number of participants ranged from two to nineteen. In interactions taking place in an institutional context (mainly lessons and tutoring, but also the interviews conducted by InfinIta team members), participant roles were clearly defined and distributed asymmetrically. In the non-institutional interactions of TIGR, which were recorded at people's homes, the participants' roles tended to be symmetrical, but some asymmetries emerged, for example when different generations were co-present. Multi-activity, finally, characterised not only table conversations and collaborative cooking, but also several interactions in educational settings that were organised as practical workshops. It needs to be clarified that the latter workshops replaced a different interaction type, i.e. student group work, which had been part of the original design of TIGR, but was then excluded because of difficulties encountered during the pandemic.
All events were recorded in Southern Switzerland, more precisely in Ticino (nineteen events), the Italian Grisons (three events) and a multilingual institution in the Grisons (one event). Eleven municipalities are represented, assuring a certain degree of geographical variety within Southern Switzerland. Speaker-related parameters (age, gender, information about the person's origin, residence, education, profession, and languages) were varied as far as possible, within the limits of a procedure to recruit study participants that partly depended on the social networks of the InfinIta team members (see the section First contact with participants and informed consent).
Lab blog post:
- Blog post of April 4, 2024: La composizione del TIGR (in Italian)
Document:
-
A particular moment in history: the Covid-19 pandemic
Lab blog post:
- Blog post of August 29th, 2024: Raccogliere dati linguistici ai tempi del COVID-19 (in Italian)
-
Places and participants
under construction
- Blog post of March 21, 2024: Digitisation of the TIGR participant questionnaires
-
First contact with participants and informed consent
In April 2021, InfinIta launched a campaign to recruit informants willing to be video-recorded in various settings included in the corpus design. Contact with potential participants was established in three steps: (i) The team drafted a short project presentation, providing basic information and motivating people to learn more about the project, which was then shared on X (then Twitter) and sent by e-mail to individuals and institutions; (ii) any person who got interested responded to an on-line questionnaire created by means of Qualtrics XM; (iii) using contact details and other information obtained through the questionnaire, the InfinIta team reached out to potential participants by phone and e-mail to offer further clarification and schedule an event to be recorded on video.
Lab blog posts:
- Blog post of July 18th, 2024: Il lavoro sul campo: ricerca e contatto dei partecipanti (in Italian)
- Blog post of July, 25th, 2024: Dichiarazioni di consenso informato (in Italian)
Documents:
- Information sheet about InfinIta/TIGR, April 2021 (in Italian)
- Information about Covid prevention during data collection, April 2021 (in Italian)
- Archived version of the questionnaire used to recruit potential study participants, April 2021 (in Italian)
-
Audio and video recordings
Each event of TIGR was recorded from two different angles with Sony HXR-NX80//C camcorders. The sound was recorded on from two to six tracks, depending on the number of participants per event. The team used from two to four Tentacle Track E pocket-sized audio recorders with clip-on microphones and one external Sony EGM-VG1 microphone mounted on one of the camcorders. When documenting classroom interaction, an additional microphone (Neumann TLM 127 ni-K) was placed in the center of the room and connected to the other camcorder.
All devices were synchronised before starting the recordings in order to obtain a maximally precise correspondence between sound and image. This was achieved by means of Tentacle timecode generators. Such generators are a built-in part of Tentacle Track E recorders, which directly register timecode as metadata. To synchronise camcorders, on the other hand, external Tentacle Sync timecode generators were connected to the camcorders' microphone entries. These generate acoustic timecode that is registered during the recordings in one of the camcorder's audio channels. A crucial component of the Tentacle system is a mobile application that communicates with all devices through bluetooth, allowing to synchronise them from remote and to start and interrupt recordings.
In post-production, in a first step, video files were processed by means of the Tentacle Timecode Tool for Windows. This software reads the acoustic signal that encodes temporal information, converts it to metadata timecode and cancels it, maintaining only metadata timecode. Subsequently, all video and audio files were imported into an Adobe Premiere project, where they were aligned based on their metadata timecode and cut to equal length.
The technical set-up of each event and further issues related to data collection were described in a field note form, which in some cases contained photographs. The team registered the event's date and place, listed the devices used and all anonymous participant identifiers, reported the participants' whishes for de-identification, where applicable, and made a note of any other peculiarity of the situation that was judged potentially relevant to interpret the data. One important function of the form was to associate identifiers to a description of the participants' physical appearance and to the names of clip-on microphones. After processing the files in TTT and Adobe Premiere, any technical problem encountered in that phase was registered in the field note form as well.
Lab blog post:
- Blog post of April 11th, 2024: Dall'evento al dataset (in Italian)
-
Transcription in ELAN
The corpus was transcribed by means of the multimedia annotator ELAN, v. 6.7 and adopting the GAT 2 conventions for fine transcription (Selting et al. 2011) with some adaptations.
Lab blog posts:
- Blog post of May 2nd, 2004: Morfologia delle trascrizioni, parte I: leggibili in che modo?
- Blog post of May 9th, 2004: Morfologia delle trascrizioni, parte II: codificare il tempo
Associated video: https://www.youtube.com/watch?v=Ileqblg23_o (in Italian) - Blog post of June 6th, 2004: Morfologia delle trascrizioni, parte IV: allineamento temporale e segmentazione
Associated video: https://youtu.be/rUkGMdGEZbM (in Italian) - Vlog post of June 13th, 2004: Morfologia delle trascrizioni, parte V: gestire le sovrapposizioni
https://youtu.be/1sTw4s-9f44
References:
- Brugman, H., Russel, A. (2004). Annotating Multimedia/ Multi-modal resources with ELAN. In: Proceedings of LREC 2004, Fourth International Conference on Language Resources and Evaluation.
- Selting, M., Auer, P., Barth-Weingarten, D., Bergmann, J., Bergmann, P., Birkner, K., Couper-Kuhlen, E., Deppermann, A., Gilles, P., Günthner, S., Hartung, M., Kern, F., Mertzlufft, C., Meyer, C., Morek, M., Oberzaucher, F., Peters, J., Quasthoff, U., Schütte, W., & Uhmann, S. (2011). A system for transcribing talk-in-interaction: GAT 2 translated and adapted for English by Elizabeth Couper-Kuhlen and Dagmar Barth-Weingarten. Gesprächsforschung, 12, 1-51. http://www.gespraechsforschung-online.de/heft2011/heft2011.html
-
De-identification
Before making them available through the LaRS repository, the video files will be de-identified in Adobe Premiere by applying video effects (e.g. Gaussian blur, Find edges), according to the wishes expressed by participants in their declarations of consent. Audio tracks will be de-identified by distorting voices according to the wishes expressed by participants and by replacing certain names and temporal information by noise, especially the names of persons, institutions and places as well as dates that could lead to the identification of participants. These replacements have been prepared in ELAN by annotating the problematic stretches of talk as name in a dedicated tier. A script will be used to read the segments' start and end times and instruct the Praat application to process all audio tracks in the corresponding time intervals, cancelling the original sound and inserting stretches of noise. In the transcribed text, personal information was pseudonymised. Participant names were replaced by pseudonyms of similar length. The remaining personal information was replaced by the labels personname / institutionname / placename / datename + an index allowing for co-reference between multiple mentions of the same entity within one transcript.
-
TXT and XML transcript files
Based on the transcripts produced in ELAN, in ShareTIGR we intend to produce two versions of each transcript:
- A TXT version (plain text). It is designed for human readers and its layout respects criteria of both theoretical adequacy and readability. It contains timecode stamps at intervals of approximately 10 seconds to facilitate navigating between the text and the associated audio/video files in situations in which no software is available to automatically align text and video.
- An XML version. It is planned to produce transcripts in an XML format that is still to be defined. This version should be interpretable by corpus query software and, ideally, the text should be tokenized, i.e. divided into words.
In both cases, ELAN documents must be processed. To obtain a plain text transcript, a workflow has been defined which alternates automatic processing, manual editing and script-assisted manual revision. The workflow is described in a document that will be part of the TIGR corpus' methodological documentation. It is also the topic of the series "Morfologia delle trascrizioni I-VI" (in Italian) published on the ShareTIGR lab blog. As to the properties of XML transcripts and the procedures to produce them, they are currently being discussed.
Lab blog posts:
- Blog post of April 11th, 2024: Dall'evento al dataset (in Italian)
- Blog post of May 2, 2024: Morfologia delle trascrizioni, parte I: leggibili in che modo? (in Italian)
- Blog post of May 9, 2024: Morfologia delle trascrizioni, parte II: codificare il tempo (in Italian)
Associated video: https://www.youtube.com/watch?v=Ileqblg23_o (in Italian) - Blog post of May 16, 2024: Morfologia delle trascrizioni, parte III: il primo script (in Italian)
Associated video: https://www.youtube.com/watch?v=wNyGZJVDbyg (mute) - Blog post of June 6, 2024: Morfologia delle trascrizioni, parte IV: allineamento temporale e segmentazione (in Italian)
Associated video: https://youtu.be/rUkGMdGEZbM (in Italian) - Vlog post of June 13, 2024: Morfologia delle trascrizioni, parte V: gestire le sovrapposizioni
https://www.youtube.com/watch?v=1sTw4s-9f44 (in Italian) - Blog post of July 11, 2024: Morfologia delle trascrizioni, parte VI: uso di script in fase di impaginazione e di revisione (in Italian)
-
Accessibility on repository
The TIGR corpus will be deposited on the Language Repository of Switzerland LaRS, a section of the SWISSUbase repository.
Lab blog posts:
- Blog post of March 28, 2024: "As open as possible, as restricted as necessary"
- Blog post of April 18, 2024: Exploring LaRS @ SWISSUbase
Associated video: https://www.youtube.com/watch?v=lqU2JPhQjBY (mute) - Blog post of April 25, 2024: Grouping the TIGR data for reuse
-
Metadata
The SWISSUbase repository organizes data as studies, which contain one or more datasets, which in turn contain one or more files. It offers forms to fill in metadata at all three levels. To describe datasets and files, the LaRS section of SWISSUbase offers metadata fields and controlled vocabularies that are specifically designed for linguistic data. At dataset level, these fields regard for example the type of resource and general characteristics of the set of involved participants. At file level, it is possible to describe the languages documented; properties of texts and annotations; technical features, length, and content of audio, video, text and image files; and, finally, the tools used to process data (e.g. a transcript editor or annotation software).
To accurately document the TIGR corpus, further metadata will be added in separate documentation files at dataset level. This is particularly important for datasets that represent a recorded event. In that kind of dataset, the event is the entity to which all included files relate and from which they inherit properties such as the region in which the event was recorded, the interaction genre, the set of participants involved or the technical set-up of the recordings. A metadata file allows to model the event entity and its relations to the files contained in the dataset. Moreover, such a file can be used to list individual participants to an event and to link this information to the sociolinguistic questionnaire data collected during the recordings.
Lab blog posts:
- Blog post of April 18, 2024: Exploring LaRS @ SWISSUbase
Associated video: https://www.youtube.com/watch?v=lqU2JPhQjBY (mute) - Blog post of September 6, 2024: Why metadata is important for FAIR data sharing and reuse.
- Blog post of October 3, 2024: Metadata on LaRS: the in-built scheme
- Blog post of October 10, 2024: Metadata on LaRS: designing metadata files for event datasets
Reference:
- SWISSUbase (2023). Metadata Guide for Linguistics Data. Metadata documentation. Version 1.1. https://resources.swissubase.ch/wp-content/uploads/2023/12/Linguistics_Metadata-Guide_en.pdf
- Blog post of April 18, 2024: Exploring LaRS @ SWISSUbase