The CoNLL Shared Task 2007 has been announced. The extended data of PADT will be used in the competition, and we provide their rough characteristics:
Total 116,800 tokens 3,044 trees 378 files annotated on the levels of analytical syntax and morphology AEP 9,500 tokens 242 trees 29 files Arabic English Parallel News AFE 13,000 tokens 411 trees 48 files Arabic 10K-word English Translation ALH 14,500 tokens 312 trees 41 files Arabic Gigaword ANN 12,500 tokens 209 trees 17 files Arabic Gigaword HYT 25,500 tokens 457 trees 47 files Arabic Gigaword XIA 26,500 tokens 888 trees 111 files Arabic Gigaword XNH 15,000 tokens 525 trees 85 files Arabic Gigaword
This year's data differ from the last year's set in two important respects:
- The extent and the quality of annotations have improved. We added new data sources, esp. AEP and AFE (with paragraph-aligned translations available). Other data sources are the newspaper texts published by Al Hayat, An Nahar, Ummah Press Service, and Xinhua.
- The morphology of the former data has been reannotated using MorphoTrees, so that the format of all data is consistent now and the informativity of the morphological tags is considerably higher. Lemmas and glosses based on the Buckwalter lexicon are also provided.
The morphological class identifiers consist of the part-of-speech category and its refinement, and their meanings read:
VI VP VC
imperfect, perfect, and imperative verb forms N- A- D-
nouns, adjectives, and adverbs C- P- I-
conjunctions, prepositions, interjections G- Q- Y-
graphical symbols, numbers, abbreviations F- FN FI
particles, esp. negative and interrogative S- SD SR
pronouns, esp. demonstrative and relative --
isolated definite articles Z-
proper names
The attributes and morphosyntactic features associated with individual tokens, i.e. the nodes in the dependency tree, include the following kinds of information. A feature can be linguistically applicable but unresolved by the annotation, in which case it is not listed with the token:
Mood I
ndicative,S
ubjunctive, orJ
ussive of imperfect verbs, withD
if undecided betweenS
andJ
Voice A
ctive orP
assivePerson 1
speaker,2
addressee,3
othersGender morphologically overt 'gender', M
asculine orF
eminineNumber morphologically overt 'number', S
ingular,D
ual, orP
luralCase 1
nominative,2
genitive,4
accusativeDefin morphological 'definiteness', I
ndefinite,D
efinite,R
educed, orC
omplexMemberOf member of syntactic Co
ordination orAp
positionClauseHead the token is the head of the given type of a subordinate clause GramCoref the pronoun S-
is a grammatical coreferent, unlike other pronouns that are textual coreferentsInputForm the token is the first of all that analyze the given orthographical word in the original input TokenGloss a clue to the morphological structure of the token
The inventory of analytical dependency functions is further explained in one document or another:
Pred verbal predicate Coord coordination Pnom nominal predicate Apos apposition PredE existential predicate Ante anteposition PredC conjunction as the clause's head AuxC conjunction PredP preposition as the clause's head AuxP preposition Sb subject AuxE emphasizing expression Obj object AuxM modifying expression Adv adverbial AuxY auxiliary, part of compound Atr attribute AuxG graphical symbol Atv complement AuxK sentence separator ExD ellipsis, no actual dependency _ excessive token, esp. due to typo
The conversion script from the original FS format to the CoNLL format produces files with the .conll
extension. The script is run as follows:
btred -Qm padt-conll.btred syntax/*.syntax.fs mkdir conll mv syntax/*.syntax.fs.conll conll/
The data use the UTF-8 encoding as required. It might however be preferred to view the data in the Buckwalter transliteration, if rendering the Arabic script poses some problems. We recommend using the Encode Arabic libraries in Perl or Haskell to easily convert the data.
For using the Perl library from a command line, a code like this would do:
# calling the module's functions in a one-liner cat PADT-data-in-CoNLL-format | \ perl -MEncode::Arabic -pe '$_ = encode "buckwalter", decode "utf8", $_' # running the scripts installed with the module cat PADT-data-in-CoNLL-format | encode "buckwalter"
To use the module for reducing the vocalization, or to choose the XML-compliant variant of the Buckwalter transliteration, one can set the modes of conversion easily. Consider e.g. the following script, which removes any vocalization marks from the tokenized word forms supplied in the second column of the CoNLL data:
use Encode::Arabic ':modes'; enmode "buckwalter", 'full', 'xml'; demode "buckwalter", 'noneplus', 'xml'; while ($line = <>) { @cols = split /\t/, decode "utf8", $line; if (@cols < 2) { print $line; next; } unless ($cols[1] =~ /[\x{20}-\x{7F}]/) { $in_buck = encode "buckwalter", $cols[1]; $cols[1] = decode "buckwalter", $in_buck; warn $in_buck . "\n"; } print encode "utf8", join "\t", @cols; }
More examples are available in the CPAN documentation.
The link to the last year's CoNLL-X and the proceedings ...