Skip to main content

Corpus

  • Short presentation

    The TIGR corpus of spoken Italian was gathered in the Swiss cantons Ticino and Grisons in 2021-2022. It documents face-to-face interactions in various kinds of non-experimental settings: table conversations, food preparation, lessons and tutoring encounters, interviews, for 23.5 hours in total. The interactions were recorded with two camcorders and pocket audio recorders equipped with clip-on microphones and transcribed using the ELAN editor. This work was completed within the realm of the InfinIta project (SNSF grant no. 192771). Besides the InfinIta team (Johanna Miecznikowski, Elena Battaglia and Christian Geddo), further collaborators were involved performing specific tasks: Chiara Sbordoni (fieldwork and transcription), Benedetta Scotto di Santolo (transcription)  and Costanza Lucchini (transcription). A next phase of data processing started in 2023, aiming at depositing the TIGR corpus on the LaRS @ SWISSUbase repository to make it available to the scientific community. The related tasks are currently being performed by the InfinIta team and by Nina Profazi within the ShareTIGR project (USI ORD grant).

  • Corpus design

    The design of the TIGR corpus responds to the research goals of the InfinIta project, which examines the categorisation of information sources in talk. The corpus was built varying several event-related parameters, i.e. the more or less institutional character of the encounter, the number of participants, their social roles and the presence or absence of multi-activity. The variation of the first three parameters gives rise to diverse epistemic configurations and dynamics of stance-taking. Multi-activity, when present, enhances the probability of encountering certain types of information sources, in particular sources involving direct perception in situ

    Guided by the aim of diversifying interaction types along these lines, the InfinIta team recorded five table conversations, two interactions while preparing food, four tutoring sessions in architecture, six lessons - in theatre, music, restoration, general education and language teaching - and six interviews focused on the then topical issue of the Covid 19 pandemic. The number of participants ranged from two to nineteen. In interactions taking place in an institutional context (mainly lessons and tutoring, but also the interviews conducted by InfinIta team members), participant roles were clearly defined and distributed asymmetrically. In the non-institutional interactions of TIGR, which were recorded at people's homes, the participants' roles tended to be symmetrical, but some asymmetries emerged, for example when different generations were co-present. Multi-activity, finally, characterised not only table conversations and collaborative cooking, but also several interactions in educational settings that were organised as practical workshops. It needs to be clarified that the latter workshops replaced a different interaction type, i.e. student group work, which had been part of the original design of TIGR, but was then excluded because of difficulties encountered during the pandemic. 

    All events were recorded in Southern Switzerland, more precisely in Ticino (nineteen events), the Italian Grisons (three events) and a multilingual institution in the Grisons (one event). Eleven municipalities are represented, assuring a certain degree of geographical variety within Southern Switzerland. Speaker-related parameters (age, gender, information about the person's origin, residence, education, profession, and languages) were varied as far as possible, within the limits of a procedure to recruit study participants that partly depended on the social networks of the InfinIta team members (see the section First contact with participants and informed consent). 

    Lab blog post:

    Document:

  • Audio and video recordings

    Each event of TIGR was recorded from two different angles with Sony HXR-NX80//C camcorders. The sound was recorded on from two to six tracks, depending on the number of participants per event. The team used from two to four Tentacle Track E pocket-sized audio recorders with clip-on microphones and one external Sony EGM-VG1 microphone mounted on one of the camcorders. When documenting classroom interaction, an additional microphone (Neumann TLM 127 ni-K) was placed in the center of the room and connected to the other camcorder. 

    All devices were synchronised before starting the recordings in order to obtain a maximally precise correspondence between sound and image. This was achieved by means of Tentacle timecode generators. Such generators are a built-in part of Tentacle Track E recorders, which directly register timecode as metadata. To synchronise camcorders, on the other hand, external Tentacle Sync timecode generators were connected to the camcorders' microphone entries. These generate acoustic timecode that is registered during the recordings in one of the camcorder's audio channels. A crucial component of the Tentacle system is a mobile application that communicates with all devices through bluetooth, allowing to synchronise them from remote and to start and interrupt recordings.

    In post-production, in a first step, video files were processed by means of the Tentacle Timecode Tool for Windows. This software reads the acoustic signal that encodes temporal information, converts it to metadata timecode and cancels it, maintaining only metadata timecode. Subsequently, all video and audio files were imported into an Adobe Premiere project, where they were aligned based on their metadata timecode and cut to equal length. 

    The technical set-up of each event and further issues related to data collection were described in a field note form, which in some cases contained photographs. The team registered the event's date and place, listed the devices used and all anonymous participant identifiers, reported the participants' whishes for de-identification, where applicable, and made a note of any other peculiarity of the situation that was judged potentially relevant to interpret the data. One important function of the form was to associate identifiers to a description of the participants' physical appearance and to the names of clip-on microphones. After processing the files in TTT and Adobe Premiere, any technical problem encountered in that phase was registered in the field note form as well. 

    Lab blog post: 

  • De-identification

    Before making them available through the LaRS repository, the video files will be de-identified in Adobe Premiere by applying video effects (e.g. Gaussian blur, Find edges), according to the wishes expressed by participants in their declarations of consent. Audio tracks will be de-identified by distorting voices according to the wishes expressed by participants and by replacing certain names and temporal information by noise, especially the names of persons, institutions and places as well as dates that could lead to the identification of participants. These replacements have been prepared in ELAN by annotating the problematic stretches of talk as name in a dedicated tier. A script will be used to read the segments' start and end times and instruct the Praat application to process all audio tracks in the corresponding time intervals, cancelling the original sound and inserting stretches of noise. In the transcribed text, personal information was pseudonymised. Participant names were replaced by pseudonyms of similar length. The remaining personal information was replaced by the labels personname / institutionname / placename / datename + an index allowing for co-reference between multiple mentions of the same entity within one transcript. 

  • Metadata

    The SWISSUbase repository organizes data as studies, which contain one or more datasets, which in turn contain one or more files. It offers forms to fill in metadata at all three levels. To describe datasets and files, the LaRS section of SWISSUbase offers metadata fields and controlled vocabularies that are specifically designed for linguistic data. At dataset level, these fields regard for example the type of resource and general characteristics of the set of involved participants. At file level, it is possible to describe the languages documented; properties of texts and annotations; technical features, length, and content of audio, video, text and image files; and, finally, the tools used to process data (e.g. a transcript editor or annotation software). 

    To accurately document the TIGR corpus, further metadata will be added in separate documentation files at dataset level. This is particularly important for datasets that represent a recorded event. In that kind of dataset, the event is the entity to which all included files relate and from which they inherit properties such as the region in which the event was recorded, the interaction genre, the set of participants involved or the technical set-up of the recordings. A metadata file allows to model the event entity and its relations to the files contained in the dataset. Moreover, such a file can be used to list individual participants to an event and to link this information to the sociolinguistic questionnaire data collected during the recordings. 

    Lab blog posts:

    Reference: