The Hamburg Dependency Treebank

Overview

Typetreebank
Description The Hamburg Dependency Treebank is to our knowledge the largest dependency treebank available (at the date of its publication). It consists of genuine dependency annotations, i.e. they have not been transformed from phrase structures.
The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The content of the articles ranges from formulaic periodic updates on new BIOS revisions and processor models or quarterly earnings of tech companies over features about general trends in the hardware and software market to general coverage of social, legal and political issues in cyberspace, sometimes in the form of extensive weekly editorial comments. The mapping from sentences to articles and authors is retained, allowing, e.g. analysis of individual style. The creation of the treebank through manual annotation was largely interleaved with the creation of a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

The HDT consists of three parts:
1. manually annotated and checked for consistency with DECCA (part A, 101,999 sentences)
2. manually annotated but not checked with DECCA (part B, 104,795 sentences)
3. automatically parsed with WCDG (part C, 55,027 sentences)
Data ownerWolfgang Menzel, menzel@informatik.uni-hamburg.de
LanguagesGerman (deu)
Size261.821 sentences
LicenseHZSK-ACA (Text) / CC-by-sa-4.0 (Annotation)
Persistent Identifier http://hdl.handle.net/11022/0000-0000-7FC7-2

Download

HDT v.1.0.1 (tar-xz archive - 45,7MB)
Handle PID: http://hdl.handle.net/11022/0000-0000-91AF-7; MD5 checksum: ab18005793a5c8c1ade21941ac939d27
SHA512 checksum: c3d2d354bea01c25df23489312be3780eb914a06db5062d32298f438df177d0148b60a04f92e9623ab395d6edf650c55e34b761c56b9e454d7880997319edaa0

HDT v.1.0.1 in CONLL format (tar-xz archive - 22,1MB)
Handle PID: http://hdl.handle.net/11022/0000-0000-91AE-8; MD5 checksum: 3ec910ef5854bd83837406216d8eccb1
SHA512 checksum: 50c38068e63487845dfc98e3414bddfae3e6e463b8cdb97a91f30d64c37637893342ac5bc8af584749397039c00287c19eaa14262b7abe62b2ca7bd53b14bcd0

HDT v.1.0 (tar-xz archive - 45,7MB)
Handle PID: http://hdl.handle.net/11022/0000-0000-7FCD-C; MD5 checksum: 224bab002dbbee9c5a233f99dd84b473
SHA512 checksum: 9ad19a8b38f9c7c081f8f91e5a9dfd6e57060ee8cccc191e8080e15a4172197ec39387c18bd445105caab3e4bffc0216ecfd3082016d007c755979f2747f736f

HDT v.1.0 in CONLL format (tar-xz archive - 22,1MB)
Handle PID: http://hdl.handle.net/11022/0000-0000-8494-3; MD5 checksum: 84b6d463ef23983776d486ecad01e2b5
SHA512 checksum: 1fd6e8ca8d05138080e2cbb5cea717dce2675d4741197ace4cd0005dc2fe0b16837f3e6c3e96f5afa792523d422c0b5153a2392f9d7119e8e1abc1c15d1189ab

Interfaces

The HDT v.1.0 is connected to central CLARIN interfaces (protected by single sign-on):

  • TÜNDRA: sentences can be visualized and exported into SVG, PNG and JPG (currently only single sentence can be transferred)
  • WebLicht: the annotation layers of a sentence can be visualized (incl. export to EXCEL and CSV) and it can be further processed automatically

Select subcorpus:
Enter sentence number (between 1 and 101999):
Open in: TÜNDRA | Weblicht Visualizer
Download: TCF | CONLL-X

Publications

The paper describing the HDT: Because Size Does Matter: The Hamburg Dependency Treebank

The annotation guidelines: Eine umfassende Constraint-Dependenz-Grammatik des Deutschen

Software

the toolbox, containing all sorts of helper scripts (e.g. to convert from the cda format to CoNLL-X)

cda_parse, a python library for parsing cda files

cobacose, a web-based treebank search system

jwcdg, the successor of the parser used for initial automatic annotation