Skip to main content

Corpus

  • Short presentation

    The TIGR corpus of spoken Italian was collected in the Swiss cantons Ticino and Grisons in 2021-2022. It documents face-to-face interactions in various kinds of non-experimental settings: dinner and lunch table conversations, food preparation, lessons and tutoring encounters, interviews, for 23.5 hours in total. The interactions were recorded with two camcorders and pocket audio recorders equipped with clip-on microphones and transcribed using the ELAN editor. This work was completed within the realm of the InfinIta project (SNSF grant no. 192771). Besides the InfinIta team (Johanna Miecznikowski, Elena Battaglia and Christian Geddo), further collaborators were involved performing specific tasks: Chiara Sbordoni (fieldwork and transcription), Benedetta Scotto di Santolo (transcription), Costanza Lucchini (transcription), Simona Kaufmann (transcription), Tommaso Barenco (transcription). From 2023 to 2025, the corpus data were processed and uploaded to the LaRS @ SWISSUbase repository to make TIGR available to the scientific community. The related tasks were performed within the ShareTIGR project (USI ORD grant) by the InfinIta team, assisted by Nina Profazi, Simona Kaufmann, Tommaso Barenco, Alessia Blum and Giadamaria Valentino.

  • Corpus design

    The design of the TIGR corpus responds to the research goals of the InfinIta project, which examines the categorisation of information sources in talk. The corpus was built varying several event-related parameters, i.e. the more or less institutional character of the encounter, the number of participants, their social roles and the presence or absence of multi-activity. The variation of the first three parameters gives rise to diverse epistemic configurations and dynamics of stance-taking. Multi-activity, when present, enhances the probability of encountering certain types of information sources, especially sources involving direct perception in situ

    Guided by the aim of diversifying interaction types along these lines, the InfinIta team recorded five dinner and lunch table conversations, two interactions while preparing food, four tutoring sessions in architecture, six lessons - in theatre, music, restoration, general education and language teaching - and six interviews focused on the then topical issue of the Covid 19 pandemic. The number of participants ranged from two to nineteen. In those interactions that took place in an institutional context (mainly lessons and tutoring, but also the interviews conducted by InfinIta team members), participant roles were clearly defined and distributed asymmetrically. In the non-institutional interactions of TIGR, which were recorded at people's homes, the participants' roles tended to be symmetrical, but some asymmetries emerged, for example when different generations were co-present. Multi-activity, finally, characterised not only dinner and lunch table conversations and collaborative cooking, but also several interactions in educational settings that were organised as practical workshops. Those workshops replaced a different interaction type, i.e. student group work, which had been part of the original design of TIGR, but was then excluded because of difficulties encountered during the pandemic. 

    All events were recorded in Southern Switzerland, more precisely in Ticino (nineteen events), the Italian Grisons (three events) and a multilingual institution in the Grisons (one event). Eleven municipalities are represented, assuring a certain degree of geographical variety within Southern Switzerland. Speaker-related parameters (age, gender, information about the person's origin, residence, education, profession, and languages) were varied as far as possible, within the limits of a procedure to recruit study participants that partly depended on the social networks of the InfinIta team members (see the section First contact with participants and informed consent). 

    Lab blog post:

    Document:

  • A particular moment in history: the Covid-19 pandemic

    The recordings were planned from late 2020 onwards and realised between May 2021 and May 2022. The pandemic influenced research in many ways. Most importantly, habits had changed. This constrained fieldwork, for it was difficult to contact potential study participants directly and, during recordings, Covid protection measures needed to be taken. Moreover, interaction genres such as face-to-face student group work or lunch breaks, which were originally part of the corpus design, were not actually practiced and the team therefore had to exclude them. Finally, several social practices that are typical for the pandemic appear in the video recordings, which show face masks here and there, social distancing, the participation in lessons via video conference, and interviews outdoors in more or less desert public spaces. On a different level, various aspects of the pandemic are mentioned in the TIGR conversations, both in the natural settings (where participants refer to vaccination in particular) and in the interviews, where the pandemic is a central topic.

    Lab blog post:

  • Participants

    All study participants filled in a short sociolinguistic questionnaire before the start of the recordings. They answered questions about their sex and age, their place of upbringing, education and language repertoire. These data were then aggregated, creating age ranges of 10 years and grouping geographical locations in regions: the Canton for places in Switzerland and the Region for places in Italy. Places in other countries were registered maintaining just the information about the country.

    61 women, 53 men and one school-age child participated in the study. While all age ranges from 10 to 80 years are represented, the speakers who are between 20 and 29 years old are by far the most numerous group (73 out of 115). 51 participants were scholarised in Switzerland, of which 4/5 in Ticino or the Grisons; 58 participants went to primary school in Italy, mostly in Lombardy (27 speakers), but also in other regions: Piedmont, Campania, Tuscany, Emilia-Romagna, Trentino - South Tyrol, Sicily, Marche, Liguria, Calabria, Veneto (ordered by the number of TIGR speakers coming from the region). 6 speakers went to primary school in some other country. As to education, the most important group are the graduates from a secondary school providing general education, followed by the graduates from a school providing vocational education and training, the bachelor's graduates, the master's graduates, the participants holding a PhD degree and a small group of participants whose highest degree is a secondary I school diploma. 

    All TIGR participants work or study in Southern Switzerland and use the Italian language on a daily basis. Among those who were not socialised in Southern Switzerland or in Italy, the competence in Italian is variable. The TIGR participants are generally multilingual: 45 speakers have competences in four or more languages, 57 declared to have a repertoire of 2-3 languages, whereas only 13 speakers consider themselves monolingual in Italian. The languages mentioned most often, besides Italian, are English and the other major Swiss National languages.

    Lab blog post:

  • Audio and video recordings

    Each event of TIGR was recorded from two different angles with Sony HXR-NX80//C camcorders. The sound was recorded on from two to six tracks, depending on the number of participants per event. The team used from two to four Tentacle Track E pocket-sized audio recorders with clip-on microphones and one external Sony EGM-VG1 microphone mounted on one of the camcorders. When documenting classroom interaction, an additional microphone (Neumann TLM 127 ni-K) was placed in the center of the room and connected to the other camcorder. 

    All devices were synchronised before starting the recordings in order to obtain a maximally precise correspondence between sound and image. This was achieved by means of Tentacle timecode generators. Such generators are a built-in part of Tentacle Track E recorders, which directly register timecode as metadata. To synchronise camcorders, on the other hand, external Tentacle Sync timecode generators were connected to the camcorders' microphone entries. These generate acoustic timecode that is registered during the recordings in one of the camcorder's audio channels. A crucial component of the Tentacle system is a mobile application that communicates with all devices through bluetooth, allowing to synchronise them from remote and to start and interrupt recordings.

    In post-production, in a first step, video files were processed by means of the Tentacle Timecode Tool for Windows. The software reads the acoustic signal that encodes temporal information, converts it to metadata timecode and cancels it, maintaining only metadata timecode. Subsequently, all video and audio files were imported into an Adobe Premiere project, where they were aligned based on their metadata timecode and cut to equal length. De-identification measures were applied (see the relevant section below). For each event, to complete the set of audio recordings, two mute videos were extracted and one split-screen video with mixed sound was produced.

    The technical set-up of each event and further issues related to data collection were described in a field note form, sometimes adding photographs. The team listed the event's date and place, the devices used, any technical problems encountered, all anonymous participant identifiers, the participants' whishes for de-identification, where applicable, and any other peculiarity of the situation that was judged potentially relevant to interpret the data. One important function of the form was to associate identifiers to a description of the participants' physical appearance and to the names of clip-on microphones. The form was used to look up information at various stages, from multimedia post-production to transcription, from local archiving to the compilation of metadata in view of data sharing.

    Lab blog post: 

  • Transcription in ELAN

    The corpus was transcribed by means of the multimedia annotator ELAN, v. 6.7 . One tier per speaker was used; in some classroom settings, a tier for the class was added to be able to attribute certain phenomena - such as whispering, laughter, cheering, musical practice - to a collective agent. A further tier was dedicated to “ambient noises”. In the transcript, actions by the class and “noises” were included in double round brackets and either quantified in seconds or, when simultaneous with transcribed text, anchored to the latter by means of special characters, adopting conventions similar to those proposed by Mondada (2018) for multimodal transcription. Besides the tiers for transcription and for the description of noises, one tier was added to annotate segments containing names and dates to be silenced later. 

    A subset of the GAT 2 conventions for fine transcription (Selting et al. 2011) was adopted, to which one sign was added: the tilde (~) for word interruptions.

    Segmentation in ELAN aimed at maintaining transcription efficiency while paying attention to the precision of temporal information. In the case of overlapping discourse, when more than two speakers were involved, the transcribers were instructed to segment at overlap boundaries, such as to obtain a maximum of temporal information and a neat vertical alignment of the transcribed discourse in a tier-based visualization. This was an advantage when subsequently deriving transcripts in playscript style, a process that required extensive manual revision and during which the original transcripts were often read in parallel in ELAN. On the other hand, this practice produced many segment boundaries within words, which represented a challenge for automatic tokenization at a later stage.

    Lab blog posts:

    References: 

    • Brugman, H., Russel, A. (2004). Annotating Multimedia/ Multi-modal resources with ELAN. In: Proceedings of LREC 2004, Fourth International Conference on Language Resources and Evaluation.
    • Mondada, L. (2018). Multiple Temporalities of Language and Body in Interaction: Challenges for Transcribing Multimodality. Research on Language and Social Interaction, 51:1, 85-106. https://www.lorenzamondada.net/multimodal-transcription
    • Selting, M., Auer, P., Barth-Weingarten, D., Bergmann, J., Bergmann, P., Birkner, K., Couper-Kuhlen, E., Deppermann, A., Gilles, P., Günthner, S., Hartung, M., Kern, F., Mertzlufft, C., Meyer, C., Morek, M., Oberzaucher, F., Peters, J., Quasthoff, U., Schütte, W., & Uhmann, S. (2011). A system for transcribing talk-in-interaction: GAT 2 translated and adapted for English by Elizabeth Couper-Kuhlen and Dagmar Barth-Weingarten. Gesprächsforschung, 12, 1-51. http://www.gespraechsforschung-online.de/heft2011/heft2011.html
  • De-identification

    Video files were de-identified in Adobe Premiere by means of Gaussian blur effects, masking participants who expressed that wish in their declarations of consent and, in a few events, people passing by. Audio tracks were de-identified in several ways. In a few cases, voices were distorted according to the wishes expressed by participants, either by applying audio effects to the entire recording (one event) or, when few non overlapping turns were concerned, by selectively distorting the corresponding segments (two events). A few conversations with people passing by were muted. Segments containing names of persons, institutions and places as well as dates that could lead to the identification of participants were replaced by silence by a procedure prepared in ELAN and performed in Praat. In ELAN, the problematic stretches of talk were annotated as names in a dedicated tier. The tier content was then exported as a CSV table and a script authored by Francesco Cangemi, whom we thank, was used to read the segments' start and end times from the table and instruct Praat to process all audio tracks in the corresponding time intervals, cancelling the original sound and inserting stretches of silence. As to the transcripts, personal information was pseudonymised. Participant names were replaced by pseudonyms of similar length. The remaining personal information was replaced by the labels personname / institutionname / placename / datename + an index allowing for co-reference between multiple mentions of the same entity within one transcript.

    Document:

  • Metadata

    The SWISSUbase repository hosts studies, which contain one or more datasets, which in turn contain one or more files. It offers forms to fill in metadata at all three levels. To describe datasets and files, the LaRS section of SWISSUbase offers fields and controlled vocabulary that are specifically designed for linguistic data. At dataset level, these fields regard for example the type of resource and general characteristics of the set of involved participants. At file level, it is possible to describe the languages documented; properties of texts and annotations; technical features, length, and content of audio, video, text and image files; and, finally, the tools used to process data (e.g. a transcript editor or annotation software). 

    Additionally, in order to accurately document the TIGR corpus and facilitate local data management, metadata have been provided in the form of separate CSV tables. The metadata categories have been defined following CLARIN recommendations (CMDI Best Practices Guide, 2017) and most categories refer to concepts listed in the CLARIN concept registry

    Event datasets contain the following tables: (1) event-related properties; (2) a list of the participants involved; (3) basic properties of the corpus (identical for all events); (4) video metadata; (5) audio metadata (full version only); (6) transcript metadata. The dataset General TIGR documentation contains several overview tables: an overview of all TIGR events, geographical information about the recording locations, and a description of the TIGR speakers based on questionnaire data. The location and speaker tables assign IDs to each listed entity, which are referenced in the event-related metadata tables (1) and (2).  

    The metadata contained in the CSV tables have been recategorized according to TEI conventions and inserted into suitable slots of the header of TEI/ISO 20624:2016 transcripts.

    Lab blog posts:

    Document:

    Reference: