Publication of resourcesIf you are interested in depositing linguistic data with the HZSK Repository, please read the following information and contact us with any questions!
The HZSK is a CLARIN centre that accepts corpora and other linguistic resources from research projects and other contexts in order to make these available mainly to the academic community for research and teaching purposes. The focus of the HZSK is on spoken, multilingual and multimodal corpora, and (spoken) corpora in other languages than German, especially of lesser-recourced or endangered languages. For corpora, we expect certain quality standards to be met regarding completeness and consistency of data and metadata. Apart from basic requirements regarding the legal situation and the general quality and documentation of the data, there are certain technical requirements on the data, which are necessary for the integration of the data into the HZSK Repository and the digital research infrastructures the repository is integrated into. We also accept thoroughly documented data sets with high qualita primary data on a case by case basis to preserve these in their current state for future curation. A detailed estimation for existing data can be made by considering our guideline (in German):
Guidelines and Best practices
Please consider the following best practices guidelines on linguistic research data from the German Research Foundation (DFG) (in German):
Additionally, the University of Hamburg provides information on good scientific practice (in German):
As defined by CLARIN, data hosted at the HZSK can be distributed with different access restrictions corresponding to three different distribution types: non-restricted access for publicly available data (PUB), access for all academic users via single sign-on (ACA), or access restricted to individual accounts upon request (RES). The most suitable distribution type for a resource depends on the legal situation, the resource type and the distribution channel. Clarifying the legal situation is a prerequisite for data deposit and the responsibility of the data depositor.
Accepted standards and formats
The workflows at the HZSK and the search and browsing functionality of the repository have been implemented for specific formats, which is why certain formats are preferred. Generally, only best practice formats are considered acceptable for deposit.
|Preferred:||EXB (EXMARaLDA transcriptions without structure or segmentation errors)|
|Unproblematic:||FOLKER, ELAN, ISO-TEI Spoken, FLEX, PRAAT, Transcriber, ANVIL|
|Acceptable:||other thoroughly documented XML and text formats (e.g. CHAT/CLAN or CSV-Formate) that can be losslessly converted into preferred or unproblematic formats|
|Problematic:||proprietary formats (e.g. Microsoft Word), Rich Text Formats (with information encoded as formatting) and legacy data (analogue oder deprecated digital formats)|
|Preferred:||CMDI using the HZSK profiles (SpokenCorpusProfile (where applicable complemented with CommunicationProfile) for spoken corpora, TextCorpusProfile for text corpora and ToolProfile for linguistic tools|
|Unproblematic:||other CMDI profiles, consistent metadata in the EXMARaLDA Corpus Manager (Coma XML) format|
|Acceptable:||consistent metadata in other commonly used XML formats or other structured formats|
|Problematic:||analogue and proprietary formats, legacy formats that can't be automatically processed, inconsistent metadata|
|Audio recordings should be provided as WAV (uncompressed PCM, 16bit, 48kHz). For archiving of video recordings, we require open, lossless formats, preferrably MPEG-4. For reliable use with the EXMARaLDA tools, MPEG-1 is additionally recommended. Please consider our detailed information on audio and video file properties.|
|Each corpus needs to be documented thoroughly, apart from structured metadata we also expect a description of the original corpus design and transcription/annotation guidelines used within the creation process. These can be provided in commonly used formats for text documents (docx, doc, odt, rtf, txt) and will be converted into PDF for dissemination.|
Before deposited data can be ingested into the HZSK Repository, it is integrated into the university's backup system and placed under version control to allow for transparency of any alterations necessary for the publication process, possibly including a curation process. The reason for including a corpus curation process is not only to create a consistent digital resource useful for as many scenarios and methods as possible, but also to harmonize all hosted data as far as possible regarding technical aspects. Only data which complies to the technical standards used at the HZSK and for which the semantics, e.g. transcription conventions and metadata elements, have been thoroughly documented can be preserved independent of its current data formats. The effort required for the curation of a corpus heavily depends on the formats used initially and can be mimimized by considering recommended formats and standards at an early stage.
The preservation plan of the HZSK is based on harmonization of hosted resources, which will all be converted into standard best practice formats, currently EXMARaLDA and CMDI and others as stated above. This allows for efficient and consistent conversion of all data should these format be superseded by newer formats. The HZSK follows the recommendations issued by standard bodies and the CLARIN infrastructure to which the HZSK belongs. The current decisions on formats and the hosted resources will undergoe a yearly review to assess and document the need of curation. The HZSK will further host resources deposited in non-preferred or problematic formats and requiring curation, but the HZSK can not carry out the necessary curation for such data with its own means. Should the HZSK at some point not be able to continue the deposited resources, data owners can choose between relocating the data to another suitable CLARIN centre or any other research data centre specialising in the relevant resource type, or to have the resource ingested into the generic research data repository of the University of Hamburg.
For publication, the deposited resource is either converted into current preferred formats and fully integrated into the HZSK Repository, or ingested into the HZSK Repository as a data set for download. In either case, publicly available metadata is created and distributed via a standard protocol (OAI-PMH) to be harvested by various portals such as e.g. the CLARIN Virtual Language Observatory. Resources made available through the HZSK Repository receives a persistent identifier, a Handle PID, to make sure they can be easily and reliably referenced in e.g. research articels based on the resource.
A persistent identifier - such as the Handle PIDs issued by the HZSK for published versions of resources - is a special kind of link that always resolves to the resource for which it was issued regardless of its current storage location. We issue a new top level PID for each version of a corpus made available through the repository, making reliable referencing in research papers etc. possible, with a new version usually comprising a number of minor revisions/corrections. PIDs are also issued for all files distributed as parts of a complex resource, e.g. the transcriptions and recordings of a corpus. A strict interpretation of versioning using PIDs would imply a new PID must be issued for a file that has changed in any way. With the reason for using PIDs - reliable referencing to allow for others to comprehend and assess research and its results - in mind, at the HZSK we reserve the right to change individual files without changing the PID if the changes do not influence the resource as the basis for research. Such changes are however strictly limited to the structure or mark-up contents of the file, or purely administrative data, e.g. the new address of the unchanged affiliation of the data owner. Superseded versions of hosted resources will be made available upon request unless there are legally relevant reasons for not doing so.