"As open as possible, as restricted as necessary"
ShareTIGR
28/03/2024
“As open as possible, as restricted as necessary”
Finding a suitable repository for the TIGR corpus has been one of the core tasks in our running corpus sharing project. The InfinIta team had started to explore the existing research infrastructure some time ago and discussed this issue with colleagues in the context of the CHORD-talk-in-interaction project, which had started in April 2023 and generated many valuable insights. In ShareTIGR we built on that input to finally come to a conclusion. I can anticipate that we decided to store TIGR in the Language Repository of Switzerland (LaRS). In this post, I would like to illustrate the reasons for our choice by highlighting some of LaRS’ features and by comparing them with those of Zenodo, a generalist repository that the project leader had taken into consideration five years ago when writing the InfinIta project proposal.
What should a suitable repository look like in our case? It should allow for sufficiently large data deposits (TIGR captures 4TB in uncompressed form), enable the upload of diverse data types (text, audio, video), allow rich (linguistic) metadata and corpus documentation, preferably store the data in Switzerland and hence be compliant with Swiss regulations on data protection and copy right, and provide different layers of access control for data affected by data protection laws. Lastly and perhaps most importantly, the repository should comply with the FAIR principles for scientific data management and stewardship (Wilkinson et al. 2016). Considering these requirements, not many repositories are eligible. To get an idea of existing repositories used in academic contexts, the first port of call could be the website of the Swiss National Science Foundation (SNSF), where repositories are listed that fulfil the SNSF’s Open Research Data criteria. When you search for discipline-specific, linguistic repositories, you will find only one entry: SWISSUbase, or rather the Language Repository of Switzerland (LaRS) at SWISSUbase.
As its name reveals, LaRS is a domain-specific national Swiss platform for the publication of linguistic research data. It is a new platform that went online in September 2022. It uses SWISSUbase as repository system, which is why some of the features I will describe in this post are not LaRS-specific but part of the SWISSUbase infrastructure. SWISSUbase is a multidisciplinary platform that offers free data deposition and access and was built with the aim to “answer the call for a flexible, national, multi-disciplinary data repository and archiving solution” (Buerli, 2023). Its goal is to facilitate data sharing and data preservation for future reuse by providing services and support to the Swiss academic community. It is powered by a consortium consisting of FORS (the Swiss Centre of Expertise in the Social Sciences) and the universities of Lausanne, Neuchâtel, and Zurich. These institutions are responsible for the so-called Data Service Units, which offer discipline-specific or cross-disciplinary support and services. The Data Service Unit that targets the Swiss Linguistics community and is in charge of LaRS is located at the University of Zurich, while FORS is at the service of the Social Sciences community and the universities of Lausanne and Neuchâtel cross-disciplinarily support their own researchers. To describe the services that a linguist can expect when storing their research data in LaRS, I will refer to the repository as LaRS for the sake of brevity and not distinguish between the support provided specifically by the Zurich Data Service Unit and by SWISSUbase in general.
LaRS accepts data of different kinds, meaning texts, audio, video, images, and programming scripts. It is more restrictive, however, when it comes to the formats of these files. Since the repository strives to be compliant with the FAIR principles, it wants its data to be interoperable and reusable, meaning the data formats used should be non-proprietary and suitable for long-term archiving. The data should also be complete and facilitate data reuse through sufficient (metadata) documentation, e.g., by means of codebooks, readme files, or technical instructions. LaRS furthermore requires the data to conform to current Swiss data protection laws, copy right laws, and ethics regulations.
As mentioned earlier, LaRS is tailored to the academic community of Switzerland. On its About us page, SWISSUbase declares that its goal is “to serve the Swiss scientific community”. Accordingly, there are some restrictions with regard to who can deposit and access the data on the platform. Data storage is only possible for members of a Swiss institution of higher education and the deposited data must support a scientific publication and/or be of high reuse value for third parties. To ensure high data quality, LaRS provides data curation and quality assurance checks. Data access, on the other hand, is possible for everyone in and outside of Switzerland who has a SWITCH-edu account.
If we now take a closer look at Zenodo, a generalist repository that also fulfils the SNSF’s criteria for Open Research Data, no such restrictions can be found. Zenodo advertises itself as “effective catch-all repository, that eliminates barriers to adopting data sharing practices” (Zenodo, n.d.). To do so, it does not impose any requirements on format, size, access conditions or licence. It also allows anyone (after prior registration) to deposit data independently of their field of study or the status of the data in the data lifecycle. The only requirement is that the depositor possess the appropriate rights and that the content does “not violate privacy or copyright, or breach confidentiality or non-disclosure agreements for data collected from human subjects” (Zenodo, n.d.).
In an open research data perspective, what Zenodo promotes as “open in every sense” (OpenAIRE, n.d.) in fact has some shortcomings. First, there is no quality control. Accepting all data does help preserve them, but also harbours the risk of the repository becoming convoluted and the stored data becoming of little value for reuse. Second, allowing any type of data format, including proprietary ones, collides with the principles of interoperability and reusability. Third, although unrestricted openness honours the principle of accessibility, it may conflict with data protection or ethics regulations. Taking the example of spoken language data, it is virtually impossible to remove all personal information while maintaining the data's research value. In our field, researchers are dependent on declarations of consent given by study participants and access restrictions implemented by the repository or platform that hosts the data (Miecznikowski & Profazi 2023). A FAIR repository suitable for spoken language data should therefore ideally provide a graded access system, so that anonymised data can be shared openly, while data that contains some personal information can be shared with selected user groups only.
Zenodo and LaRS @ SWISSUbase both offer possibilities to restrict data access. While Zenodo allows depositors to upload restricted (or embargoed) data and to specify the conditions under which they may be accessed, LaRS uses role-based access control. The default scenario is that users can authenticate themselves as trusted users affiliated to a research institution by using their SWITCH edu-ID, an identifier that is widely used in Swiss educational institutions. On the side of data users, also people without SWITCH edu-ID may access data, if the data depositor has explicitly allowed for this on a special permission basis. The user is then required to send a written request that specifies and justifies the intended usage and needs to sign a download contract in which they indicate the duration of use (either 3, 6, 12, or 24 months). When the contract runs out, the user receives a notification and needs to confirm the deletion of all downloaded data. Because of these possibilities of managing and restricting access, LaRS advertises its platform with the slogan “As open as possible, as restricted as necessary”. Two more features both platforms share are the possibility to assign a Creative Commons usage licence to the uploaded records and to provide them with Digital Object Identifiers (DOI).
While some of the above-listed features can be found on both platforms, LaRS and Zenodo, and possibly on others, in Switzerland only LaRS offers tailored, discipline-specific services. Alongside with the repository's attention to access control, this is the main reason why we opted for this infrastructure. First of all, LaRS has designed a specific metadata schema that allows for long-term preservation, access, and reuse of linguistic data. LaRS stresses the importance of metadata – not only as a resource to adequately interpret the stored data, but also as a means to ensure they can be discovered by other researchers. The LaRS metadata schema allows linguists to describe their data extensively and, in case categories are missing, the schema can be expanded to accommodate the needs of the researcher. A second interesting service provided by LaRS is its support, which covers all stages of a research project, be it project design and data collection (e.g., help with consent forms), data curation and preservation (e.g., file migration to preservation formats), access management and reuse, or data curation and quality assurance checks. Lastly, LaRS offers a folder structure that helps organise the data and constrains the future users' download options. We think that it will be possible to make the quite complex TIGR datasets fit LaRS’ folder hierarchy. Our next blog posts will describe the composition of the TIGR corpus and how we intend to group our data in subsets in LaRS, thereby allowing scholars and students to download those parts that are most relevant to them and technically most compatible with their research workflows.
Nina Profazi
References
Buerli, S. (2023). SWISSUbase. Welcome to SWISSUbase [PowerPoint slides]. https://resources.swissubase.ch/wp-content/uploads/2023/02/LaRS-SWISSUbase-webinar-FINAL-14-Feb-2023.pdf
Miecznikowski, J., Profazi, N. (2023): Social interaction is among people. Legal, technical, and ethical explorations about personal information and its removal in talk-in-interaction as data. https://www.chord-talk-in-interaction.usi.ch/news/feeds/36387
OpenAIRE. Zenodo – A universal repository for all research outcomes. https://www.openaire.eu/zenodo-guide (etrieved March 27, 2024)
SWISSUbase. About us. Retrieved March 27, 2024. https://resources.swissubase.ch/about-us/ (retrieved March 27, 2024)
Swiss National Science Foundation. Which data repositories can be used? https://www.snf.ch/en/WtezJ6qxuTRnSYgF/topic/open-research-data-which-data-repositories-can-be-used (retrieved March 27, 2024)
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J. W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Waagmeester, A., Wittenburg, P., Wols tencroft, K., ... Velterop, J. (2016). The FAIR guiding principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
Zenodo. About Zenodo. https://about.zenodo.org/ (retrieved March 27, 2024)
Zenodo. General Policies. https://about.zenodo.org/policies/ (retrieved March 27, 2024)