About the HZSK Repository

The digital repository of the HZSK was created to support archiving, maintenance, distribution, and exploitation of spoken language corpora. These corpora usually include audio and/or video recordings, transcriptions, additional data and structured metadata.

Primarily focussing on the topic 'Multilingualism', this collection of corpora is made freely available for scientific research and teaching. However, depending on the corpus, individual registration might be necessary.

The EXMARaLDA Demo Corpus does not require registration; but to access any of the other corpora in the present repository you may consider the corpus release guidelines.

Belonging to the CLARIN group, the repository meets the criteria of the CLARIN Center Assessment as well as the Data Seal of Approval. This means:

  • The data of the HZSK repository are unambiguous and persistent to identify and quote (handle system) (oder versions remain accessible, see guidelines for versioning).
  • The individual corpora can be searched via the Federated Content Search of the Virtual Language Observatory (VLO)
  • Single-Sign On via Shibboleth (CLARIN IdP) is possible
  • The metadata of every HZSK corpora listed in the present repository are made searchable via the OAI PMH Metadata Harvesting of the CLARIN language resources catalog

You can find the precise Technical Documentation here.

The HZSK repository emerged as part of the projects “CLARIN” – funded by the BMBF- and “LIS” – funded by the DFG - between 2011 and 2013 at the University of Hamburg.

The HZSK repository is mainly based on the open source technologies Fedora Commons, Islandora and Drupal.

Corpora of the SFB 538 "Multilingualism"

At the SFB 538 ‘Multilingualism’, a variety of corpora were created, documenting multilingual communication (e.g. interpreting), language development of multilingual speakers (e.g. language acquisition, language attrition) and aspects of social, individual and historical multilingualism.

The corpora are available in the EXMARaLDA data format and can be displayed/exported in different formats via the HZSK Repository.

Corpora of the SFB 632 "Information structure"

Many corpora of the Collaborative Research Center 632 (Sonderforschungsbereich / SFB 632) "Information Structure: The Linguistic Means of Structuring Utterances, Sentences and Texts" (funded by the DFG between July 2003 and June 2015) have been incorporated into the HZSK Repository.


The following internal documents contain information about the technical implementation and guidelines of the HZSK:


Hedeland, Hanna; Jettka, Daniel & Lehmberg, Timm (2014). Vernetzung statt Vereinheitlichung. Digitale Forschungsinfrastrukturen in den Geisteswissenschaften. In b.i.t. online. Vol. 17, No. 5.

Jettka, Daniel & Stein, Daniel (2014). The HZSK Repository: Implementation, Features, and Use Cases of a Repository for Spoken Language Corpora. In D-Lib Magazine. Vol. 20, No. 9/10. DOI: 10.1045/september2014-jettka

Windhouwer, Menzo; Kemps-Snijders, Marc; Trilsbeek, Paul; Moreira, André; van der Veen, Bas; Silva, Guilherme & von Reihn, Daniel (2016). FLAT: Constructing a CLARIN Compatible Home for Language Resources. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk and Stelios Piperidis (eds.). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 23.-28.05.2016. Portorož, Slovenia. ISBN: 978-2-9517408-9-1

Yeh, Shea-Tinn; Reyes, Fernando; Rynhart, Jeff & Bain, Philip (2016). Deploying Islandora as a Digital Repository Platform: a Multifaceted Experience at the University of Denver Libraries. In D-Lib Magazine. Vol. 22, No. 7/8. DOI: 10.1045/july2016-yeh


What are the corpora used for?

Where do the users come from?