Linguistic Corpora at the HZSK Repository
euroWiss - Linguistic Profiling of European Academic Education (Subcorpus 1)
Subcorpus 1 presents part of the euroWiss-Corpus covering communication in teaching/learning discourses in instruction at German and Italian universities, in the humanities as well as the technical and natural sciences; it offers access to transcriptions of lectures and seminars aligned with audio recordings and the text types used for instruction. The corpus comprises 18 Communications, 24 audio recordings, 24 transcriptions, 140,000 transcribed words, 19 identified speakers, 18 students' notes, 2 lecture scripts, 24 chalkboard presentions, 2 powerpoint presentations, 3 overhead slides, 3 handouts, 14 schedules/descriptions of recorded lecture/seminar
Language: German, Italian
License: HZSK-ACA (academic)
The Hamburg MapTask Corpus (HAMATAC)
Audio recordings of map tasks with adult L2 users of German. The speakers´ L1 and their L2 proficiencies vary. The maps used for the tasks are available.; Audioaufnahmen von Map-Task-Aufgaben bei Erwachsenen mit Deutsch als Zweitsprache. Die Kompetenzen der Sprecher in Erst- und Zweitsprache variieren. Die in dieser Aufgabe benutzten Karten sind verfügbar.
Language: German
License: HZSK-ACA (academic)
The Hamburg MapTask Corpus (HAMATAC)
Audio and two video recordings of map tasks with adult L2 users of German and one L1 speaker. The speakers' L1 and their L2 proficiencies vary. The maps used for the tasks are available.
Language: German
License: HZSK-ACA (academic)
Hamburg Modern Times Corpus (HaMoTiC)
Audio recordings of a film retelling task with adult L2 users of German. The speakers' L1 and their L2 proficiencies vary. 24 communications + 1 German reference communication, duration between 2 and 16 minutes. For each speaker, a language learner biography (audio and freely transcribes) is available.
Language: German
License: HZSK-ACA (academic)
Covert translation: Business Communication (new)
Translation corpora of original texts with translations and comparable texts from the genre external business communication.
Language: German, English
License: HZSK-ACA (academic)
Covert translation: Business Communication (old)
Translation corpora of original texts with translations and comparable texts from the genre external business communication
Language: German, English
License: HZSK-ACA (academic)
B1 Foodo
The data sets for each language consist of a small number of mini-dialogues, chosen out of the 189 entries within the Focus Translation Task (cf. Skopeteas et al. 2006: 209ff.) in order to get a basic set of utterances for comparison between the languages dealt with in the project.
Language: Foodo
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B1 Fon
The data sets for each language consist of a small number of mini-dialogues, chosen out of the 189 entries within the Focus Translation Task (cf. Skopeteas et al. 2006: 209ff.) in order to get a basic set of utterances for comparison between the languages dealt with in the project.
Language: Fon
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B2 Marghi
Full set: all focus related experiments, status: work in progress, large parts elicited, most of the data transcribed, partly annotated.
Language: Marghi
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B1 Yom
The data sets for each language consist of a small number of mini-dialogues, chosen out of the 189 entries within the Focus Translation Task (cf. Skopeteas et al. 2006: 209ff.) in order to get a basic set of utterances for comparison between the languages dealt with in the project.
Language: Yom
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B2 Tangale
Tangale sample: sample, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function, sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA.
Language: Tangale
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B1 Aja
The data sets for each language consist of a small number of mini-dialogues, chosen out of the 189 entries within the Focus Translation Task (cf. Skopeteas et al. 2006: 209ff.) in order to get a basic set of utterances for comparison between the languages dealt with in the project.
Language: Aja
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B2 Bura
Full set: all focus related experiments, status: work in progress, large parts elicited, most of the data transcribed, partly annotated
Language: Bura
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
Türkisch-Englisch-Deutsch bei Herkunftssprechern (TEDH)
The TEDH has been created as part of the project "Foreign Language Acquisition in German-Turkish bilinguals". The TEDH Corpus contains interviews in three languages: Turkish, English, German. The corpus contains 74 communications from 25 different speakers. The bulk of the language material to be integrated, glossed and annotated has been collected by several researchers and is available in audio format. The transcription data as well as the metadata of the corpus are processed and stored in EXMARaLDA format.
Language: German, Turkish, English
License: HZSK-ACA (academic)
B2 Hausa
Hausa: complete set, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function, sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA.
Language: Hausa
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B7 Wolof (Wikipedia)
The corpus comprises out of a collection of texts from the Wolof Wikipedia, randomly chosen for their near-standard like orthography and language, and treating different topics. The texts are translated manually by a mother tongue speaker and automatically tagged by a part-of-speech tagger. No further annotation is provided.
Language: Wolof
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B2 Guruntum
Guruntum sample: sample, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function, sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA.
Language: Guruntum
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
Das Kiezdeutschkorpus (KiDKo)
A multi-modal digital corpus of spontaneous discourse data from informal, oral peer group in multi- and monoethnic speech communities.
Language: German, Turkish, Kurdish, Arabic
License: HZSK-ACA (academic)
B7 Wolof (web)
The corpus comprises out of a collection of texts from discussion forums in the web, randomly chosen for their near-standard like orthography and language, and treating different topics. The texts are translated manually by a mother tongue speaker and automatically tagged by a part-of-speech tagger. No further annotation is provided.
Language: Wolof
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B4 Sächsische Weltchronik
The corpus contains a chronic from the 13th century in Middle Low German.
Language: Old High German
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B4 Ludolf
The texts of this corpus, Ludolf von Sudheims Reise ins Heilige Land (Ludolf of Sudheim’s Journey to the Holy Land), is a journey diary describing the adventures of a group of pilgrims, written in Middle Low German and dated back to 1350. For information on the properties of the text, including the manuscripts, see Blust-Thiele (1985). This corpus uses the text edition by Stapelmohr (1937). The first 20 pages of it are tagged for clause type and grammatical function. The corpus includes 6,690 tokens.
Language: German Middle Low
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B4 Historisches Predigtenkorpus zum Nachfeld
HIPKON is the first corpus based on only one text type (sermons) and on one dialect area, Upper German (Bavarian-Alemannic). The sermons cover the time from Middle High German to the beginning of the New High German period. They were accurately selected so that each of them is representative of one century. Among others, syntax, information structure and discourse structure were annotated in the corpus.
Language: New High German
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B4 Muspilli
Complete text, status: work in progress, digitalization, translation to English, manually annotated with parts of speech, syntactic category, grammatical function, clause status, numbers of syllables (per constituent), information status, topic/comment, position of constituent in sentence, definiteness, focus/background, focus marker, comments, source (bibliography).
Language: Old High German
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B4 Heliand
Heliand 1, 4 and 5: complete text, status: final, digitalization, translation to Modern German, manually annotated with parts of speech, syntactic categories, grammatical functions, clause status, numbers of syllables (per constituent), alliteration, information status, topic/comment, position of phrase in sentence, definiteness, focus/background, focus-marker, comments on context, source (bibliography).
Language: Old High German
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
TraCES
Corpus of the Classical Ethiopic Language (Ge'ez), produced by the TraCES project (https://www.traces.uni-hamburg.de/en/about.html) in 2014-2019. The corpus is morphologically annotated and freely accessible for online search. The current corpus is a beta test run and should be treated as work in progress, as annotation has been carried to a varying degree of detail.
Language: Ethiopic
License: BY-NC-ND 4.0 (academic)
B4 Otfrid
The reference corpus Old German contains (annotated) data from the oldest language monuments of German before the continuous written transduction around 750 until 1050 with approx. 650,000 text words.
Language: Old High German
License: Creative Commons Attribution 3.0 Unported License (academic)