Linguistic Corpora at the HZSK Repository
Hamburg Dependency Treebank
The Hamburg Dependency Treebank is to our knowledge the largest dependency treebank currently available. It consists of genuine dependency annotations, i.e. they have not been transformed from phrase structures.
Language: German
License: HZSK-ACA (Text) / CC-by-sa-4.0 (Annotation) (academic)
Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650)
The reference corpus of Middle Low German and Low Rhenish texts is based on manuscripts, prints and inscriptions. It is intended to provide an insight into the culture of speech and writing in Middle Low German and Low Rhenish regions. This spectrum of texttypes can be used to trace the linguistic development on the base of diatopic and diacronic subcategorisation. The aim of the project is the publication of diplomatic transcribed, lemmatised and grammatically annotated texts. The processed data – especially on the grammatical level – enables a linguistic analysis of the Middle Low German and Low Rhenish language, which goes far beyond what was possible until now.
Language: Undefined
License: CC-BY 4.0 (public)
Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650)
The reference corpus of Middle Low German and Low Rhenish texts is based on manuscripts, prints and inscriptions. It is intended to provide an insight into the culture of speech and writing in Middle Low German and Low Rhenish regions. This spectrum of texttypes can be used to trace the linguistic development on the base of diatopic and diacronic subcategorisation. The aim of the project is the publication of diplomatic transcribed, lemmatised and grammatically annotated texts. The processed data – especially on the grammatical level – enables a linguistic analysis of the Middle Low German and Low Rhenish language, which goes far beyond what was possible until now.
Language: Undefined
License: CC-BY 4.0 (public)
Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650)
The reference corpus of Middle Low German and Low Rhenish texts is based on manuscripts, prints and inscriptions. It is intended to provide an insight into the culture of speech and writing in Middle Low German and Low Rhenish regions. This spectrum of texttypes can be used to trace the linguistic development on the base of diatopic and diacronic subcategorisation. The aim of the project is the publication of diplomatic transcribed, lemmatised and grammatically annotated texts. The processed data – especially on the grammatical level – enables a linguistic analysis of the Middle Low German and Low Rhenish language, which goes far beyond what was possible until now.
Language: Undefined
License: CC-BY 4.0 (public)
Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650)
The reference corpus of Middle Low German and Low Rhenish texts is based on manuscripts, prints and inscriptions. It is intended to provide an insight into the culture of speech and writing in Middle Low German and Low Rhenish regions. This spectrum of texttypes can be used to trace the linguistic development on the base of diatopic and diacronic subcategorisation. The aim of the project is the publication of diplomatic transcribed, lemmatised and grammatically annotated texts. The processed data – especially on the grammatical level – enables a linguistic analysis of the Middle Low German and Low Rhenish language, which goes far beyond what was possible until now.
Language: Undefined
License: CC-BY 4.0 (public)
Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650)
The reference corpus of Middle Low German and Low Rhenish texts is based on manuscripts, prints and inscriptions. It is intended to provide an insight into the culture of speech and writing in Middle Low German and Low Rhenish regions. This spectrum of texttypes can be used to trace the linguistic development on the base of diatopic and diacronic subcategorisation. The aim of the project is the publication of diplomatic transcribed, lemmatised and grammatically annotated texts. The processed data – especially on the grammatical level – enables a linguistic analysis of the Middle Low German and Low Rhenish language, which goes far beyond what was possible until now.
Language: Middle Low German, Low Rhenish
License: CC-BY 4.0 (public)
Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650)
The reference corpus of Middle Low German and Low Rhenish texts is based on manuscripts, prints and inscriptions. It is intended to provide an insight into the culture of speech and writing in Middle Low German and Low Rhenish regions. This spectrum of texttypes can be used to trace the linguistic development on the base of diatopic and diacronic subcategorisation. The aim of the project is the publication of diplomatic transcribed, lemmatised and grammatically annotated texts. The processed data – especially on the grammatical level – enables a linguistic analysis of the Middle Low German and Low Rhenish language, which goes far beyond what was possible until now.
Language: Undefined
License: CC-BY 4.0 (public)
Commented Learner Corpus Academic Writing
Authentic texts written by students of the University of Hamburg as part of their studies, the students have various L1 languages and study various subjects, all of the texts were subject of a writing counseling at the Writing Center Multilingualism (Schreibwerkstatt Mehrsprachigkeit), for some of the texts comments by peer tutors and several versions are available.
Language: German
License: HZSK-ACA (academic)
Covert translation: Business Communication (new)
Translation corpora of original texts with translations and comparable texts from the genre external business communication.
Language: German, English
License: HZSK-ACA (academic)
Covert translation: Business Communication (old)
Translation corpora of original texts with translations and comparable texts from the genre external business communication
Language: German, English
License: HZSK-ACA (academic)
Covert translation: popular science
Translation corpora of original texts with translations and comparable texts from the genre popular scientific prose.
Language: German, English
License: HZSK-ACA (academic)
B4 Historisches Predigtenkorpus zum Nachfeld
HIPKON is the first corpus based on only one text type (sermons) and on one dialect area, Upper German (Bavarian-Alemannic). The sermons cover the time from Middle High German to the beginning of the New High German period. They were accurately selected so that each of them is representative of one century. Among others, syntax, information structure and discourse structure were annotated in the corpus.
Language: New High German
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
Commented Learner Corpus Academic Writing (KoLaS 1.1)
Authentic texts written by students of the University of Hamburg as part of their studies, the students have various L1 languages and study various subjects, all of the texts were subject of a writing counseling at the Writing Center Multilingualism (Schreibwerkstatt Mehrsprachigkeit), for some of the texts comments by peer tutors and several versions are available.
Language: German
License: HZSK-ACA (academic)
B2 Hausa
Hausa: complete set, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function, sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA.
Language: Hausa
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B4 Heliand
Heliand 1, 4 and 5: complete text, status: final, digitalization, translation to Modern German, manually annotated with parts of speech, syntactic categories, grammatical functions, clause status, numbers of syllables (per constituent), alliteration, information status, topic/comment, position of phrase in sentence, definiteness, focus/background, focus-marker, comments on context, source (bibliography).
Language: Old High German
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B4 Muspilli
Complete text, status: work in progress, digitalization, translation to English, manually annotated with parts of speech, syntactic category, grammatical function, clause status, numbers of syllables (per constituent), information status, topic/comment, position of constituent in sentence, definiteness, focus/background, focus marker, comments, source (bibliography).
Language: Old High German
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B4 Ludolf
The texts of this corpus, Ludolf von Sudheims Reise ins Heilige Land (Ludolf of Sudheim’s Journey to the Holy Land), is a journey diary describing the adventures of a group of pilgrims, written in Middle Low German and dated back to 1350. For information on the properties of the text, including the manuscripts, see Blust-Thiele (1985). This corpus uses the text edition by Stapelmohr (1937). The first 20 pages of it are tagged for clause type and grammatical function. The corpus includes 6,690 tokens.
Language: German Middle Low
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B4 Tatian Corpus of Deviating Examples 2.1
The present corpus, the Tatian Corpus of Deviating Examples T-CODEX 2.1, provides morpho-syntactic and information structural annotation of parts of the Old High German translation attested in the MS St. Gallen Cod. 56, traditionally called the OHG Tatian, one of the largest prose texts from the classical OHG period. This corpus was designed and annotated by Project B4 of Collaborative Research Center on Information Structure at Humboldt University Berlin. The present corpus compiles ca. 2.000 deviating examples found in the text portions of the scribes α, β, γ and ε. Each clause structure represents an extra file annotated with the annotation tool EXMARaLDA and searchable via ANNIS, a general-purpose tool for the publication, visualisation and querying of linguistic data collections, developed by Project D1 of the Collaborative Research Center on Information Structure at Potsdam University.
Language: Latin, Old High German
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (public)
Commented Learner Corpus Academic Writing (KoLaS 1.0)
Authentic texts written by students of the University of Hamburg as part of their studies, the students have various L1 languages and study various subjects, all of the texts were subject of a writing counseling at the Writing Center Multilingualism (Schreibwerkstatt Mehrsprachigkeit), for some of the texts comments by peer tutors and several versions are available.
Language: German
License: HZSK-ACA (academic)
B7 Wolof (Wikipedia)
The corpus comprises out of a collection of texts from the Wolof Wikipedia, randomly chosen for their near-standard like orthography and language, and treating different topics. The texts are translated manually by a mother tongue speaker and automatically tagged by a part-of-speech tagger. No further annotation is provided.
Language: Wolof
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B2 Guruntum
Guruntum sample: sample, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function, sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA.
Language: Guruntum
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
B7 Wolof (web)
The corpus comprises out of a collection of texts from discussion forums in the web, randomly chosen for their near-standard like orthography and language, and treating different topics. The texts are translated manually by a mother tongue speaker and automatically tagged by a part-of-speech tagger. No further annotation is provided.
Language: Wolof
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
Commented Learner Corpus Academic Writing (KoLaS 2.0)
Authentic texts written by students of the University of Hamburg as part of their studies, the students have various L1 languages and study various subjects, all of the texts were subject of a writing counseling at the Writing Center Multilingualism (Schreibwerkstatt Mehrsprachigkeit), for some of the texts comments by peer tutors and several versions are available.
Language: German
License: HZSK-ACA (academic)
B4 Sächsische Weltchronik
The corpus contains a chronic from the 13th century in Middle Low German.
Language: Old High German
License: Creative Commons Attribution-NonCommercial 3.0 Unported License (academic)
Hamburg Corpus of Old Swedish with Syntactic Annotations (HaCOSSA)
Religious and secular prose, law texts, non-fiction literature (geographical, theological, historic, natural science), diploma.
Language: English, German, Latin, Old Swedish, Swedish
License: FID-AKA (restricted)
Hamburg Old Scandinavian Text Collection (HOSTCol)
Law texts, chap books, miscellaneous literature in Old Swedish and Old Danish.
Language: Old Swedish, Old Danish
License: FID-AKA (restricted)
TraCES
Corpus of the Classical Ethiopic Language (Ge'ez), produced by the TraCES project (https://www.traces.uni-hamburg.de/en/about.html) in 2014-2019. The corpus is morphologically annotated and freely accessible for online search. The current corpus is a beta test run and should be treated as work in progress, as annotation has been carried to a varying degree of detail.
Language: Ethiopic
License: BY-NC-ND 4.0 (academic)
B4 Otfrid
The reference corpus Old German contains (annotated) data from the oldest language monuments of German before the continuous written transduction around 750 until 1050 with approx. 650,000 text words.
Language: Old High German
License: Creative Commons Attribution 3.0 Unported License (academic)