Publication of resources

If you are interested in depositing linguistic data with the HZSK Repository, please read the following information and contact us with any questions!

General requirements

The HZSK is a CLARIN centre that accepts corpora and other linguistic resources from research projects and other contexts in order to make these available mainly to the academic community for research and teaching purposes. The focus of the HZSK is on spoken, multilingual and multimodal corpora, and (spoken) corpora in other languages than German, especially of lesser-recourced or endangered languages. For corpora, we expect certain quality standards to be met regarding completeness and consistency of data and metadata. Apart from basic requirements regarding the legal situation and the general quality and documentation of the data, there are certain technical requirements on the data, which are necessary for the integration of the data into the HZSK Repository and the digital research infrastructures the repository is integrated into. We also accept thoroughly documented data sets with high qualita primary data on a case by case basis to preserve these in their current state for future curation. A detailed estimation for existing data can be made by considering our guideline (in German):

Leitfaden zur Beurteilung von Aufbereitungsaufwand und Nachnutzbarkeit von Korpora gesprochener Sprache (PDF)

Guidelines and Best practices

Please consider the following best practices guidelines on linguistic research data from the German Research Foundation (DFG) (in German):

Empfehlungen zu datentechnischen Standards und Tools bei der Erhebung von Sprachkorpora (PDF | 290 KB)
Informationen zu rechtlichen Aspekten bei der Handhabung von Sprachkorpora (PDF | 173 KB)
Empfehlungen zur gesicherten Aufbewahrung und Bereitstellung digitaler Forschungsprimärdaten (PDF | 61 KB)

Additionally, the University of Hamburg provides information on good scientific practice (in German):

Satzung zur Sicherung Guter wissenschaftlicher Praxis und zur Vermeidung wissenschaftlichen Fehlverhaltens an der Universität Hamburg (PDF | 248 KB)

Access restrictions

As defined by CLARIN, data hosted at the HZSK can be distributed with different access restrictions corresponding to three different distribution types: non-restricted access for publicly available data (PUB), access for all academic users via single sign-on (ACA), or access restricted to individual accounts upon request (RES). The most suitable distribution type for a resource depends on the legal situation, the resource type and the distribution channel. Clarifying the legal situation is a prerequisite for data deposit and the responsibility of the data depositor.

For corpus data and datasets distributed via the HZSK Repository that should not be made publicly available, access restrictions can be implemented either as access for all academic users logging in via single sign-on and acknowledging our general Terms of Use, or as access for individual users, who need to describe their usage scenarios and agree to corpus specific Terms of Use and are granted access if they fulfil all corpus specific requirements. For corpus data distributed via the ANNIS platform and for other web resources, access restrictions are only possible as single sign-on for academic users. The access restricition and corpus specific Terms of Use are part of the depositor agreement.

Accepted standards and formats

The workflows at the HZSK and the search and browsing functionality of the repository have been implemented for specific formats, which is why certain formats are preferred. Generally, only best practice formats are considered acceptable for deposit.

Transcription formats

Preferred: EXB (EXMARaLDA transcriptions without structure or segmentation errors)
Unproblematic: FOLKER, ELAN, ISO-TEI Spoken, FLEX, PRAAT, Transcriber, ANVIL
Acceptable: other thoroughly documented XML and text formats (e.g. CHAT/CLAN or CSV-Formate) that can be losslessly converted into preferred or unproblematic formats
Problematic: proprietary formats (e.g. Microsoft Word), Rich Text Formats (with information encoded as formatting) and legacy data (analogue oder deprecated digital formats)

Metadata formats

Preferred: CMDI using the HZSK profiles (SpokenCorpusProfile (where applicable complemented with CommunicationProfile) for spoken corpora, TextCorpusProfile for text corpora and ToolProfile for linguistic tools
Unproblematic: other CMDI profiles, consistent metadata in the EXMARaLDA Corpus Manager (Coma XML) format
Acceptable: consistent metadata in other commonly used XML formats or other structured formats
Problematic: analogue and proprietary formats, legacy formats that can't be automatically processed, inconsistent metadata

Media formats

Audio recordings should be provided as WAV (uncompressed PCM, 16bit, 48kHz). For archiving of video recordings, we require open, lossless formats, preferrably MPEG-4. For reliable use with the EXMARaLDA tools, MPEG-1 is additionally recommended. Please consider our detailed information on audio and video file properties.

Documentation formats

Each corpus needs to be documented thoroughly, apart from structured metadata we also expect a description of the original corpus design and transcription/annotation guidelines used within the creation process. These can be provided in commonly used formats for text documents (docx, doc, odt, rtf, txt) and will be converted into PDF for dissemination.

Data curation

Before deposited data can be ingested into the HZSK Repository, it is integrated into the university's backup system and placed under version control to allow for transparency of any alterations necessary for the publication process, possibly including a curation process. The reason for including a corpus curation process is not only to create a consistent digital resource useful for as many scenarios and methods as possible, but also to harmonize all hosted data as far as possible regarding technical aspects. Only data which complies to the technical standards used at the HZSK and for which the semantics, e.g. transcription conventions and metadata elements, have been thoroughly documented can be preserved independent of its current data formats. The effort required for the curation of a corpus heavily depends on the formats used initially and can be mimimized by considering recommended formats and standards at an early stage.

Preservation plan

The preservation plan of the HZSK is based on harmonization of hosted resources, which will all be converted into standard best practice formats, currently EXMARaLDA and CMDI and others as stated above. This allows for efficient and consistent conversion of all data should these format be superseded by newer formats. The HZSK follows the recommendations issued by standard bodies and the CLARIN infrastructure to which the HZSK belongs. The current decisions on formats and the hosted resources will undergoe a yearly review to assess and document the need of curation. The HZSK will further host resources deposited in non-preferred or problematic formats and requiring curation, but the HZSK can not carry out the necessary curation for such data with its own means. Should the HZSK at some point not be able to continue the deposited resources, data owners can choose between relocating the data to another suitable CLARIN centre or any other research data centre specialising in the relevant resource type, or to have the resource ingested into the generic research data repository of the University of Hamburg.

Data publication

For publication, the deposited resource is either converted into current preferred formats and fully integrated into the HZSK Repository, or ingested into the HZSK Repository as a data set for download. In either case, publicly available metadata is created and distributed via a standard protocol (OAI-PMH) to be harvested by various portals such as e.g. the CLARIN Virtual Language Observatory. Resources made available through the HZSK Repository receives a persistent identifier, a Handle PID, to make sure they can be easily and reliably referenced in e.g. research articels based on the resource.

Update policy

A persistent identifier - such as the Handle PIDs issued by the HZSK for published versions of resources - is a special kind of link that always resolves to the resource for which it was issued regardless of its current storage location. We issue a new top level PID for each version of a corpus made available through the repository, making reliable referencing in research papers etc. possible, with a new version usually comprising a number of minor revisions/corrections. PIDs are also issued for all files distributed as parts of a complex resource, e.g. the transcriptions and recordings of a corpus. A strict interpretation of versioning using PIDs would imply a new PID must be issued for a file that has changed in any way. With the reason for using PIDs - reliable referencing to allow for others to comprehend and assess research and its results - in mind, at the HZSK we reserve the right to change individual files without changing the PID if the changes do not influence the resource as the basis for research. Such changes are however strictly limited to the structure or mark-up contents of the file, or purely administrative data, e.g. the new address of the unchanged affiliation of the data owner. Superseded versions of hosted resources will be made available upon request unless there are legally relevant reasons for not doing so.