Grouping the TIGR data for reuse
ShareTIGR
25/04/2024
Last week, we looked into the general file and metadata structure provided by the Language Repository of Switzerland (LaRS) at SWISSUbase. We will now outline how we intend to transfer the data of the TIGR corpus into this structure. We will report the reasoning that led us to the structure currently envisaged and indicate some open questions that still require further reflection.
The unit at the top of LaRS’ folder hierarchy is the study. Each study consists of one or more datasets, into which single files or file collections (zip folders) can be uploaded. Each hierarchical level has its own metadata. Given this design, various ways of structuring a multimedia corpus come to mind. TIGR basically consists of 23 speech events with two video tracks, two to six audio tracks and three transcript versions each (see our overview here). One possibility is to simply create a study containing all data in one dataset. We might also consider creating an individual study for each speech event, summing up to 23 studies with one dataset each. Still another option is to create one study with all multimedia files and another with all transcript versions. There are certainly more possible groupings, but we can anticipate that, as for now, we decided to create one TIGR study consisting of 49 datasets. How did we reach this conclusion?
One reason was the wish to underline the unity of the TIGR data, which were all collected within the same research project, in the same time period and geographical area, and adopting the same methodological procedures. On the basis of this criterion, we decided to create one single study, since the LaRS metadata categories on study level are best suited to show a connection of data across datasets.
The remaining decisions depend a lot on the expected usage scenarios.
A quite general consideration regards the quantity of data future users might want to download at once. In LaRS, the only downloadable unit is the dataset, which means that the dataset must group files that are likely to be used jointly. How many datasets are needed and what should they contain? On the one hand, it seems reasonable to limit the number of datasets in order to reduce the number of downloads a user needs to perform. One of the labels that LaRS offers to categorise datasets as resource types (a categorisation that is mandatory) is corpus, which in our case seems to suggest to even pack all data into one single dataset. On the other hand, users could wish to select certain data and not others, depending on their research interest and equipment; accordingly, data sets should not be too large and complex, in order not to force users to store locally more data than they actually intend to handle. To serve differentiated user needs we decided against the one-study-one-dataset design.
Rather, we plan to upload the TIGR events as individual datasets, expecting that some researchers might be interested in looking at subsets of events – for example, only classroom interaction or only data recorded in the Grisons – rather than at all events at once. The only eligible predefined resource type is corpus, anyway, which is somewhat unsettling because not quite congruous with the nature of a single recorded event. As an alternative, LaRS offers the possibility to avoid its predefined labels and declare the resource type at dataset level to be other.
The event datasets will be available in two versions, a 'light' and a 'full' version. This requires duplicating a significant number of files (the light version is contained entirely in the full version) but has the advantage to create datasets tailored to two different usage scenarios.
The 'light' version of each dataset will contain one compact multimedia file with a split screen view of the two video angles and one mixed audio track. It will also hold the two transcripts we intend to produce for each event, namely a text version with timecodes at intervals of 10 seconds, which is easily readable to the human eye, and a xml version that is interpretable by query programmes. For users to be able to interpret the uploaded data appropriately, we will provide additional materials documenting the specific circumstances of the individual speech event. The documentation will include the technical sheet (scheda tecnica) produced in preparation of the recording, i.e., field notes explaining the situational and technical details of the recording (e.g., interaction type, location, date, number of participants, recording devices, presence of the researcher). We will also provide the conventions according to which the written transcripts have been produced and explanations on the corpus and its methodological design, e.g., workflows, de-identification measures etc. It appears that the structure of LaRS requires us to multiply this kind of general corpus documentation at the dataset level. This 'light' dataset version is aimed at users with little available storage and/or software, who intend to view the data without processing them further and are happy to play back the media files in a simple video player.
The 'full' version of each speech event is planned to contain the same compact video as the light version, but will additionally feature the single multimedia files, comprising two video recordings and two to six separate audio tracks. Like the light version, it will hold documentation files and a text and xml transcript. It will be completed by an .eaf transcript version, which is the raw xml output of the annotation tool ELAN. The full dataset version is aimed at users who wish to further process the data to make them fit a specific research design or presentational requirements. Further processing could include retranscription or further annotation in ELAN (starting from the .eaf file) or in other applications. For this task, it can be beneficial to be able to listen selectively to audio tracks to take full advantage of the technical quality of the recordings made with clip-on microphones. Another possible scenario is the re-arrangement or remixing of video and audio files in multimedia editors, which might be useful to perform certain types of analyses or prepare example clips to be included in presentations or scientific publications.
Up to now, we mentioned 46 datasets, composed of two versions of each of the 23 TIGR events. To that we intend to add two more datasets for researchers who are mainly interested in querying or annotating the written transcripts. The first transcript dataset will contain the .txt transcripts of all 23 events, the second one will include the machine-readable xml transcripts. The last dataset, to reach the pre-announced number of 49, will be a file containing participant-related data based on the sociolinguistic questionnaires. As described here, these data were digitised and are currently available in the form of an excel table; we will need to decide how to further process the data before depositing it on the repository.
Further open questions regard the bundling of files within datasets. LaRS offers the option of uploading either single data files or file collections (zip folders) within a dataset. For the full versions of datasets, is it helpful to zip the multimedia files? Would it improve the file handling for the users, especially with regard to the metadata we intend to provide within LaRS’ metadata scheme? We will keep these questions in mind in ShareTIGR and use the months to come to plan the corpus structure on LaRS in more detail.
Nina Profazi & Johanna Miecznikowski