Prague Arabic Dependency Treebank ++: January 2007

The CoNLL Shared Task 2007 has been announced. The extended data of PADT will be used in the competition, and we provide their rough characteristics:

Total 116,800 tokens 3,044 trees 378 files annotated on the levels of analytical syntax and morphology

AEP 9,500 tokens 242 trees 29 files Arabic English Parallel News

AFE 13,000 tokens 411 trees 48 files Arabic 10K-word English Translation

ALH 14,500 tokens 312 trees 41 files Arabic Gigaword

ANN 12,500 tokens 209 trees 17 files Arabic Gigaword

HYT 25,500 tokens 457 trees 47 files Arabic Gigaword

XIA 26,500 tokens 888 trees 111 files Arabic Gigaword

XNH 15,000 tokens 525 trees 85 files Arabic Gigaword

Total	116,800 tokens	3,044 trees	378 files	annotated on the levels of analytical syntax and morphology
AEP	9,500 tokens	242 trees	29 files	Arabic English Parallel News
AFE	13,000 tokens	411 trees	48 files	Arabic 10K-word English Translation
ALH	14,500 tokens	312 trees	41 files	Arabic Gigaword
ANN	12,500 tokens	209 trees	17 files	Arabic Gigaword
HYT	25,500 tokens	457 trees	47 files	Arabic Gigaword
XIA	26,500 tokens	888 trees	111 files	Arabic Gigaword
XNH	15,000 tokens	525 trees	85 files	Arabic Gigaword

This year's data differ from the last year's set in two important respects:

The extent and the quality of annotations have improved. We added new data sources, esp. AEP and AFE (with paragraph-aligned translations available). Other data sources are the newspaper texts published by Al Hayat, An Nahar, Ummah Press Service, and Xinhua.
The morphology of the former data has been reannotated using MorphoTrees, so that the format of all data is consistent now and the informativity of the morphological tags is considerably higher. Lemmas and glosses based on the Buckwalter lexicon are also provided.

The morphological class identifiers consist of the part-of-speech category and its refinement, and their meanings read:

VI VP VC imperfect, perfect, and imperative verb forms N- A- D- nouns, adjectives, and adverbs

C- P- I- conjunctions, prepositions, interjections G- Q- Y- graphical symbols, numbers, abbreviations

F- FN FI particles, esp. negative and interrogative S- SD SR pronouns, esp. demonstrative and relative

-- isolated definite articles Z- proper names

The attributes and morphosyntactic features associated with individual tokens, i.e. the nodes in the dependency tree, include the following kinds of information. A feature can be linguistically applicable but unresolved by the annotation, in which case it is not listed with the token:

Mood Indicative, Subjunctive, or Jussive of imperfect verbs, with D if undecided between S and J

Voice Active or Passive

Person 1 speaker, 2 addressee, 3 others

Gender morphologically overt 'gender', Masculine or Feminine

Number morphologically overt 'number', Singular, Dual, or Plural

Case 1 nominative, 2 genitive, 4 accusative

Defin morphological 'definiteness', Indefinite, Definite, Reduced, or Complex

MemberOf member of syntactic Coordination or Apposition

ClauseHead the token is the head of the given type of a subordinate clause

GramCoref the pronoun S- is a grammatical coreferent, unlike other pronouns that are textual coreferents

InputForm the token is the first of all that analyze the given orthographical word in the original input

TokenGloss a clue to the morphological structure of the token

The inventory of analytical dependency functions is further explained in one document or another:

Pred verbal predicate Coord coordination

Pnom nominal predicate Apos apposition

PredE existential predicate Ante anteposition

PredC conjunction as the clause's head AuxC conjunction

PredP preposition as the clause's head AuxP preposition

Sb subject AuxE emphasizing expression

Obj object AuxM modifying expression

Adv adverbial AuxY auxiliary, part of compound

Atr attribute AuxG graphical symbol

Atv complement AuxK sentence separator

ExD ellipsis, no actual dependency _ excessive token, esp. due to typo

The conversion script from the original FS format to the CoNLL format produces files with the .conll extension. The script is run as follows:

  btred -Qm padt-conll.btred syntax/*.syntax.fs
  mkdir conll
  mv syntax/*.syntax.fs.conll conll/

The data use the UTF-8 encoding as required. It might however be preferred to view the data in the Buckwalter transliteration, if rendering the Arabic script poses some problems. We recommend using the Encode Arabic libraries in Perl or Haskell to easily convert the data.

For using the Perl library from a command line, a code like this would do:

  # calling the module's functions in a one-liner

  cat PADT-data-in-CoNLL-format | \
      perl -MEncode::Arabic -pe '$_ = encode "buckwalter", decode "utf8", $_'

  # running the scripts installed with the module

  cat PADT-data-in-CoNLL-format | encode "buckwalter"

To use the module for reducing the vocalization, or to choose the XML-compliant variant of the Buckwalter transliteration, one can set the modes of conversion easily. Consider e.g. the following script, which removes any vocalization marks from the tokenized word forms supplied in the second column of the CoNLL data:

  use Encode::Arabic ':modes';

  enmode "buckwalter", 'full', 'xml';
  demode "buckwalter", 'noneplus', 'xml';

  while ($line = <>) {

      @cols = split /\t/, decode "utf8", $line;

      if (@cols < 2) {

          print $line;
          next; 
      }

      unless ($cols[1] =~ /[\x{20}-\x{7F}]/) {

          $in_buck = encode "buckwalter", $cols[1];
          $cols[1] = decode "buckwalter", $in_buck;
         
          warn $in_buck . "\n";
      }
      
      print encode "utf8", join "\t", @cols;
  }

More examples are available in the CPAN documentation.

The link to the last year's CoNLL-X and the proceedings ...

Prague Arabic Dependency Treebank ++

Wednesday, January 17, 2007

CoNLL Shared Task 2007

Contact

Projects

Quickies

Links

Archive

`VI VP VC`	imperfect, perfect, and imperative verb forms	`N- A- D-`	nouns, adjectives, and adverbs
`C- P- I-`	conjunctions, prepositions, interjections	`G- Q- Y-`	graphical symbols, numbers, abbreviations
`F- FN FI`	particles, esp. negative and interrogative	`S- SD SR`	pronouns, esp. demonstrative and relative
`--`	isolated definite articles	`Z-`	proper names

Mood	`I`ndicative, `S`ubjunctive, or `J`ussive of imperfect verbs, with `D` if undecided between `S` and `J`
Voice	`A`ctive or `P`assive
Person	`1` speaker, `2` addressee, `3` others
Gender	morphologically overt 'gender', `M`asculine or `F`eminine
Number	morphologically overt 'number', `S`ingular, `D`ual, or `P`lural
Case	`1` nominative, `2` genitive, `4` accusative
Defin	morphological 'definiteness', `I`ndefinite, `D`efinite, `R`educed, or `C`omplex
MemberOf	member of syntactic `Co`ordination or `Ap`position
ClauseHead	the token is the head of the given type of a subordinate clause
GramCoref	the pronoun `S-` is a grammatical coreferent, unlike other pronouns that are textual coreferents
InputForm	the token is the first of all that analyze the given orthographical word in the original input
TokenGloss	a clue to the morphological structure of the token

Pred	verbal predicate	Coord	coordination
Pnom	nominal predicate	Apos	apposition
PredE	existential predicate	Ante	anteposition
PredC	conjunction as the clause's head	AuxC	conjunction
PredP	preposition as the clause's head	AuxP	preposition
Sb	subject	AuxE	emphasizing expression
Obj	object	AuxM	modifying expression
Adv	adverbial	AuxY	auxiliary, part of compound
Atr	attribute	AuxG	graphical symbol
Atv	complement	AuxK	sentence separator
ExD	ellipsis, no actual dependency	_	excessive token, esp. due to typo