Corpus
-
Short presentation
The TIGR corpus of spoken Italian was collected in the Swiss cantons Ticino and Grisons in 2021-2022. It documents face-to-face interactions in various kinds of non-experimental settings: dinner and lunch table conversations, food preparation, lessons and tutoring encounters, interviews, for 23.5 hours in total. The interactions were recorded with two camcorders and pocket audio recorders equipped with clip-on microphones and transcribed using the ELAN editor. This work was completed within the realm of the InfinIta project (SNSF grant no. 192771). Besides the InfinIta team (Johanna Miecznikowski, Elena Battaglia and Christian Geddo), further collaborators were involved performing specific tasks: Chiara Sbordoni (fieldwork and transcription), Benedetta Scotto di Santolo (transcription), Costanza Lucchini (transcription), Simona Kaufmann (transcription), Tommaso Barenco (transcription). From 2023 to 2025, the corpus data were processed and uploaded to the LaRS @ SWISSUbase repository to make TIGR available to the scientific community. The related tasks were performed within the ShareTIGR project (USI ORD grant) by the InfinIta team, assisted by Nina Profazi, Simona Kaufmann, Tommaso Barenco, Alessia Blum and Giadamaria Valentino.
-
Corpus design
The design of the TIGR corpus responds to the research goals of the InfinIta project, which examines the categorisation of information sources in talk. The corpus was built varying several event-related parameters, i.e. the more or less institutional character of the encounter, the number of participants, their social roles and the presence or absence of multi-activity. The variation of the first three parameters gives rise to diverse epistemic configurations and dynamics of stance-taking. Multi-activity, when present, enhances the probability of encountering certain types of information sources, especially sources involving direct perception in situ.
Guided by the aim of diversifying interaction types along these lines, the InfinIta team recorded five dinner and lunch table conversations, two interactions while preparing food, four tutoring sessions in architecture, six lessons - in theatre, music, restoration, general education and language teaching - and six interviews focused on the then topical issue of the Covid 19 pandemic. The number of participants ranged from two to nineteen. In those interactions that took place in an institutional context (mainly lessons and tutoring, but also the interviews conducted by InfinIta team members), participant roles were clearly defined and distributed asymmetrically. In the non-institutional interactions of TIGR, which were recorded at people's homes, the participants' roles tended to be symmetrical, but some asymmetries emerged, for example when different generations were co-present. Multi-activity, finally, characterised not only dinner and lunch table conversations and collaborative cooking, but also several interactions in educational settings that were organised as practical workshops. Those workshops replaced a different interaction type, i.e. student group work, which had been part of the original design of TIGR, but was then excluded because of difficulties encountered during the pandemic.
All events were recorded in Southern Switzerland, more precisely in Ticino (nineteen events), the Italian Grisons (three events) and a multilingual institution in the Grisons (one event). Eleven municipalities are represented, assuring a certain degree of geographical variety within Southern Switzerland. Speaker-related parameters (age, gender, information about the person's origin, residence, education, profession, and languages) were varied as far as possible, within the limits of a procedure to recruit study participants that partly depended on the social networks of the InfinIta team members (see the section First contact with participants and informed consent).
Lab blog post:
- Blog post of April 4, 2024: La composizione del TIGR (in Italian)
Document:
-
A particular moment in history: the Covid-19 pandemic
The recordings were planned from late 2020 onwards and realised between May 2021 and May 2022. The pandemic influenced research in many ways. Most importantly, habits had changed. This constrained fieldwork, for it was difficult to contact potential study participants directly and, during recordings, Covid protection measures needed to be taken. Moreover, interaction genres such as face-to-face student group work or lunch breaks, which were originally part of the corpus design, were not actually practiced and the team therefore had to exclude them. Finally, several social practices that are typical for the pandemic appear in the video recordings, which show face masks here and there, social distancing, the participation in lessons via video conference, and interviews outdoors in more or less desert public spaces. On a different level, various aspects of the pandemic are mentioned in the TIGR conversations, both in the natural settings (where participants refer to vaccination in particular) and in the interviews, where the pandemic is a central topic.
Lab blog post:
- Blog post of August 29th, 2024: Raccogliere dati linguistici ai tempi del COVID-19 (in Italian)
-
Participants
All study participants filled in a short sociolinguistic questionnaire before the start of the recordings. They answered questions about their sex and age, their place of upbringing, education and language repertoire. These data were then aggregated, creating age ranges of 10 years and grouping geographical locations in regions: the Canton for places in Switzerland and the Region for places in Italy. Places in other countries were registered maintaining just the information about the country.
61 women, 53 men and one school-age child participated in the study. While all age ranges from 10 to 80 years are represented, the speakers who are between 20 and 29 years old are by far the most numerous group (73 out of 115). 51 participants were scholarised in Switzerland, of which 4/5 in Ticino or the Grisons; 58 participants went to primary school in Italy, mostly in Lombardy (27 speakers), but also in other regions: Piedmont, Campania, Tuscany, Emilia-Romagna, Trentino - South Tyrol, Sicily, Marche, Liguria, Calabria, Veneto (ordered by the number of TIGR speakers coming from the region). 6 speakers went to primary school in some other country. As to education, the most important group are the graduates from a secondary school providing general education, followed by the graduates from a school providing vocational education and training, the bachelor's graduates, the master's graduates, the participants holding a PhD degree and a small group of participants whose highest degree is a secondary I school diploma.
All TIGR participants work or study in Southern Switzerland and use the Italian language on a daily basis. Among those who were not socialised in Southern Switzerland or in Italy, the competence in Italian is variable. The TIGR participants are generally multilingual: 45 speakers have competences in four or more languages, 57 declared to have a repertoire of 2-3 languages, whereas only 13 speakers consider themselves monolingual in Italian. The languages mentioned most often, besides Italian, are English and the other major Swiss National languages.
Lab blog post:
- Blog post of March 21, 2024: Digitisation of the TIGR participant questionnaires
-
First contact with participants and informed consent
In April 2021, InfinIta launched a campaign to recruit informants willing to be video-recorded in various settings included in the corpus design. Contact with potential participants was established in three steps: (i) The team drafted a short project presentation, providing basic information and motivating people to learn more about the project, which was then shared on X (then Twitter) and sent by e-mail to individuals and institutions; (ii) any person who got interested responded to an on-line questionnaire created by means of Qualtrics XM; (iii) using contact details and other information obtained through the questionnaire, the InfinIta team reached out to potential participants by phone and e-mail to offer further clarification and schedule an event to be recorded on video.
Lab blog posts:
- Blog post of July 18th, 2024: Il lavoro sul campo: ricerca e contatto dei partecipanti (in Italian)
- Blog post of July, 25th, 2024: Dichiarazioni di consenso informato (in Italian)
Documents:
- Information sheet about InfinIta/TIGR, April 2021 (in Italian)
- Information about Covid prevention during data collection, April 2021 (in Italian)
- Archived version of the questionnaire used to recruit potential study participants, April 2021 (in Italian)
-
Audio and video recordings
Each event of TIGR was recorded from two different angles with Sony HXR-NX80//C camcorders. The sound was recorded on from two to six tracks, depending on the number of participants per event. The team used from two to four Tentacle Track E pocket-sized audio recorders with clip-on microphones and one external Sony EGM-VG1 microphone mounted on one of the camcorders. When documenting classroom interaction, an additional microphone (Neumann TLM 127 ni-K) was placed in the center of the room and connected to the other camcorder.
All devices were synchronised before starting the recordings in order to obtain a maximally precise correspondence between sound and image. This was achieved by means of Tentacle timecode generators. Such generators are a built-in part of Tentacle Track E recorders, which directly register timecode as metadata. To synchronise camcorders, on the other hand, external Tentacle Sync timecode generators were connected to the camcorders' microphone entries. These generate acoustic timecode that is registered during the recordings in one of the camcorder's audio channels. A crucial component of the Tentacle system is a mobile application that communicates with all devices through bluetooth, allowing to synchronise them from remote and to start and interrupt recordings.
In post-production, in a first step, video files were processed by means of the Tentacle Timecode Tool for Windows. The software reads the acoustic signal that encodes temporal information, converts it to metadata timecode and cancels it, maintaining only metadata timecode. Subsequently, all video and audio files were imported into an Adobe Premiere project, where they were aligned based on their metadata timecode and cut to equal length. De-identification measures were applied (see the relevant section below). For each event, to complete the set of audio recordings, two mute videos were extracted and one split-screen video with mixed sound was produced.
The technical set-up of each event and further issues related to data collection were described in a field note form, sometimes adding photographs. The team listed the event's date and place, the devices used, any technical problems encountered, all anonymous participant identifiers, the participants' whishes for de-identification, where applicable, and any other peculiarity of the situation that was judged potentially relevant to interpret the data. One important function of the form was to associate identifiers to a description of the participants' physical appearance and to the names of clip-on microphones. The form was used to look up information at various stages, from multimedia post-production to transcription, from local archiving to the compilation of metadata in view of data sharing.
Lab blog post:
- Blog post of April 11th, 2024: Dall'evento al dataset (in Italian)
-
Transcription in ELAN
The corpus was transcribed by means of the multimedia annotator ELAN, v. 6.7 . One tier per speaker was used; in some classroom settings, a tier for the class was added to be able to attribute certain phenomena - such as whispering, laughter, cheering, musical practice - to a collective agent. A further tier was dedicated to “ambient noises”. In the transcript, actions by the class and “noises” were included in double round brackets and either quantified in seconds or, when simultaneous with transcribed text, anchored to the latter by means of special characters, adopting conventions similar to those proposed by Mondada (2018) for multimodal transcription. Besides the tiers for transcription and for the description of noises, one tier was added to annotate segments containing names and dates to be silenced later.
A subset of the GAT 2 conventions for fine transcription (Selting et al. 2011) was adopted, to which one sign was added: the tilde (~) for word interruptions.
Segmentation in ELAN aimed at maintaining transcription efficiency while paying attention to the precision of temporal information. In the case of overlapping discourse, when more than two speakers were involved, the transcribers were instructed to segment at overlap boundaries, such as to obtain a maximum of temporal information and a neat vertical alignment of the transcribed discourse in a tier-based visualization. This was an advantage when subsequently deriving transcripts in playscript style, a process that required extensive manual revision and during which the original transcripts were often read in parallel in ELAN. On the other hand, this practice produced many segment boundaries within words, which represented a challenge for automatic tokenization at a later stage.
Lab blog posts:
- Blog post of May 2nd, 2004: Morfologia delle trascrizioni, parte I: leggibili in che modo?
- Blog post of May 9th, 2004: Morfologia delle trascrizioni, parte II: codificare il tempo
Associated video: https://www.youtube.com/watch?v=Ileqblg23_o (in Italian) - Blog post of June 6th, 2004: Morfologia delle trascrizioni, parte IV: allineamento temporale e segmentazione
Associated video: https://youtu.be/rUkGMdGEZbM (in Italian) - Vlog post of June 13th, 2004: Morfologia delle trascrizioni, parte V: gestire le sovrapposizioni
https://youtu.be/1sTw4s-9f44
References:
- Brugman, H., Russel, A. (2004). Annotating Multimedia/ Multi-modal resources with ELAN. In: Proceedings of LREC 2004, Fourth International Conference on Language Resources and Evaluation.
- Mondada, L. (2018). Multiple Temporalities of Language and Body in Interaction: Challenges for Transcribing Multimodality. Research on Language and Social Interaction, 51:1, 85-106. https://www.lorenzamondada.net/multimodal-transcription
- Selting, M., Auer, P., Barth-Weingarten, D., Bergmann, J., Bergmann, P., Birkner, K., Couper-Kuhlen, E., Deppermann, A., Gilles, P., Günthner, S., Hartung, M., Kern, F., Mertzlufft, C., Meyer, C., Morek, M., Oberzaucher, F., Peters, J., Quasthoff, U., Schütte, W., & Uhmann, S. (2011). A system for transcribing talk-in-interaction: GAT 2 translated and adapted for English by Elizabeth Couper-Kuhlen and Dagmar Barth-Weingarten. Gesprächsforschung, 12, 1-51. http://www.gespraechsforschung-online.de/heft2011/heft2011.html
-
De-identification
Video files were de-identified in Adobe Premiere by means of Gaussian blur effects, masking participants who expressed that wish in their declarations of consent and, in a few events, people passing by. Audio tracks were de-identified in several ways. In a few cases, voices were distorted according to the wishes expressed by participants, either by applying audio effects to the entire recording (one event) or, when few non overlapping turns were concerned, by selectively distorting the corresponding segments (two events). A few conversations with people passing by were muted. Segments containing names of persons, institutions and places as well as dates that could lead to the identification of participants were replaced by silence by a procedure prepared in ELAN and performed in Praat. In ELAN, the problematic stretches of talk were annotated as names in a dedicated tier. The tier content was then exported as a CSV table and a script authored by Francesco Cangemi, whom we thank, was used to read the segments' start and end times from the table and instruct Praat to process all audio tracks in the corresponding time intervals, cancelling the original sound and inserting stretches of silence. As to the transcripts, personal information was pseudonymised. Participant names were replaced by pseudonyms of similar length. The remaining personal information was replaced by the labels personname / institutionname / placename / datename + an index allowing for co-reference between multiple mentions of the same entity within one transcript.
Document:
- Guida uso script Praat (in Italian)
-
TXT and XML transcript files
Based on the transcripts produced in ELAN, two further transcript versions have been produced.
In a first phase, a plain text version in playscript style was derived, adopting a workflow that alternated automatic processing, manual editing and script-assisted manual revision. It is mainly designed for human readers. Its layout respects criteria of both theoretical adequacy and readability. It contains timecode stamps at intervals of approximately 10 seconds to facilitate navigating between the text and the associated audio/video files in situations in which no software is available to automatically align text and video. Moreover, the structural simplicity of this format makes it compatible with a wide range of applications.
In a later phase, a XML version was created, which is tokenized at the level of words, intonation units and speaker contributions and conforms to the well-documented ISO/TEI 20624:2016 standard for the transcription of spoken language (see Hedeland and Schmidt 2022). The format is suitable as an input to databases and corpus platforms. Conversion was performed by Thomas Schmidt (https://linguisticbits.de/), while problematic cases during tokenization were back-checked manually by the project team. The conversion pipeline is documented on the private Github repository https://github.com/berndmoos/linguisticbits; access can be requested by writing to the author.
Lab blog posts:
- Blog post of April 11th, 2024: Dall'evento al dataset (in Italian)
- Blog post of May 2, 2024: Morfologia delle trascrizioni, parte I: leggibili in che modo? (in Italian)
- Blog post of May 9, 2024: Morfologia delle trascrizioni, parte II: codificare il tempo (in Italian)
Associated video: https://www.youtube.com/watch?v=Ileqblg23_o (in Italian) - Blog post of May 16, 2024: Morfologia delle trascrizioni, parte III: il primo script (in Italian)
Associated video: https://www.youtube.com/watch?v=wNyGZJVDbyg (mute) - Blog post of June 6, 2024: Morfologia delle trascrizioni, parte IV: allineamento temporale e segmentazione (in Italian)
Associated video: https://youtu.be/rUkGMdGEZbM (in Italian) - Vlog post of June 13, 2024: Morfologia delle trascrizioni, parte V: gestire le sovrapposizioni
https://www.youtube.com/watch?v=1sTw4s-9f44 (in Italian) - Blog post of July 11, 2024: Morfologia delle trascrizioni, parte VI: uso di script in fase di impaginazione e di revisione (in Italian)
Reference:
- Hedeland, H. & T. Schmidt (2022). The TEI-based ISO Standard ‘Transcription of spoken language’as an Exchange Format within CLARIN and beyond.Selected papers from the CLARIN Annual Conference2021. Ed. M. Monachini & M. Eskevich. Linköping Electronic Conference Pro-ceedings 189, pp. 34–45. DOI: https://doi.org/10.3384/9789179294441
- TIGRformat 2025. https://github.com/JohannaMiecznikowski/Infinitaplus/releases/tag/v.1.0.0
-
Accessibility on repository
The TIGR corpus is archived in the Language Repository of Switzerland LaRS, a section of the SWISSUbase repository (project no. 20902). Besides general documentation and a dataset with all transcripts, TIGR on LaRS includes two datasets per event, which differ mainly by their multimedia documents. The "light" version of an event contains one integrated split-screen video with mixed sound, whereas the "full" version adds to that the single audio tracks and two mute video files. After signing a licensing agreement, the user can download datasets and reuse them for scientific purposes. Datasets are citable thanks to DOIs and bibliographic references provided by the repository.
Lab blog posts:
- Blog post of March 28, 2024: "As open as possible, as restricted as necessary"
- Blog post of April 18, 2024: Exploring LaRS @ SWISSUbase
Associated video: https://www.youtube.com/watch?v=lqU2JPhQjBY (mute) - Blog post of April 25, 2024: Grouping the TIGR data for reuse
Reference:
- Miecznikowski-Fuenfschilling, J., Battaglia, E., & Geddo, C. (2025). General TIGR documentation (Version 1.0) [Data set]. LaRS - Language Repository of Switzerland. https://doi.org/10.48656/mgq4-7p77
-
Metadata
The SWISSUbase repository hosts studies, which contain one or more datasets, which in turn contain one or more files. It offers forms to fill in metadata at all three levels. To describe datasets and files, the LaRS section of SWISSUbase offers fields and controlled vocabulary that are specifically designed for linguistic data. At dataset level, these fields regard for example the type of resource and general characteristics of the set of involved participants. At file level, it is possible to describe the languages documented; properties of texts and annotations; technical features, length, and content of audio, video, text and image files; and, finally, the tools used to process data (e.g. a transcript editor or annotation software).
Additionally, in order to accurately document the TIGR corpus and facilitate local data management, metadata have been provided in the form of separate CSV tables. The metadata categories have been defined following CLARIN recommendations (CMDI Best Practices Guide, 2017) and most categories refer to concepts listed in the CLARIN concept registry.
Event datasets contain the following tables: (1) event-related properties; (2) a list of the participants involved; (3) basic properties of the corpus (identical for all events); (4) video metadata; (5) audio metadata (full version only); (6) transcript metadata. The dataset General TIGR documentation contains several overview tables: an overview of all TIGR events, geographical information about the recording locations, and a description of the TIGR speakers based on questionnaire data. The location and speaker tables assign IDs to each listed entity, which are referenced in the event-related metadata tables (1) and (2).
The metadata contained in the CSV tables have been recategorized according to TEI conventions and inserted into suitable slots of the header of TEI/ISO 20624:2016 transcripts.
Lab blog posts:
- Blog post of April 18, 2024: Exploring LaRS @ SWISSUbase
Associated video: https://www.youtube.com/watch?v=lqU2JPhQjBY (mute) - Blog post of September 6, 2024: Why metadata is important for FAIR data sharing and reuse.
- Blog post of October 3, 2024: Metadata on LaRS: the in-built scheme
- Blog post of October 10, 2024: Metadata on LaRS: designing metadata files for event datasets
Document:
Reference:
- CMDI and Metadata Curation task forces of the Standing Committee on CLARIN Technical Centres (2017). CMDI Best Practices Guide, v. 1.2.0. https://www.clarin.eu/content/cmdi-best-practices-guide
- Miecznikowski-Fuenfschilling, J., Battaglia, E., & Geddo, C. (2025). General TIGR documentation (Version 1.0) [Data set]. LaRS - Language Repository of Switzerland. https://doi.org/10.48656/mgq4-7p77
- SWISSUbase (2023). Metadata Guide for Linguistics Data. Metadata documentation. Version 1.1. https://resources.swissubase.ch/wp-content/uploads/2023/12/Linguistics_Metadata-Guide_en.pdf
- Blog post of April 18, 2024: Exploring LaRS @ SWISSUbase