Wednesday, January 17, 2007

CoNLL Shared Task 2007

The CoNLL Shared Task 2007 has been announced. The extended data of PADT will be used in the competition, and we provide their rough characteristics:

Total 116,800 tokens 3,044 trees 378 files annotated on the levels of analytical syntax and morphology
AEP 9,500 tokens 242 trees 29 files Arabic English Parallel News
AFE 13,000 tokens 411 trees 48 files Arabic 10K-word English Translation
ALH 14,500 tokens 312 trees 41 files Arabic Gigaword
ANN 12,500 tokens 209 trees 17 files Arabic Gigaword
HYT 25,500 tokens 457 trees 47 files Arabic Gigaword
XIA 26,500 tokens 888 trees 111 files Arabic Gigaword
XNH 15,000 tokens 525 trees 85 files Arabic Gigaword

This year's data differ from the last year's set in two important respects:

  1. The extent and the quality of annotations have improved. We added new data sources, esp. AEP and AFE (with paragraph-aligned translations available). Other data sources are the newspaper texts published by Al Hayat, An Nahar, Ummah Press Service, and Xinhua.
  2. The morphology of the former data has been reannotated using MorphoTrees, so that the format of all data is consistent now and the informativity of the morphological tags is considerably higher. Lemmas and glosses based on the Buckwalter lexicon are also provided.

The morphological class identifiers consist of the part-of-speech category and its refinement, and their meanings read:

VI VP VCimperfect, perfect, and imperative verb forms N- A- D-nouns, adjectives, and adverbs
C- P- I-conjunctions, prepositions, interjections G- Q- Y-graphical symbols, numbers, abbreviations
F- FN FIparticles, esp. negative and interrogative S- SD SRpronouns, esp. demonstrative and relative
--isolated definite articles Z-proper names

The attributes and morphosyntactic features associated with individual tokens, i.e. the nodes in the dependency tree, include the following kinds of information. A feature can be linguistically applicable but unresolved by the annotation, in which case it is not listed with the token:

MoodIndicative, Subjunctive, or Jussive of imperfect verbs, with D if undecided between S and J
VoiceActive or Passive
Person1 speaker, 2 addressee, 3 others
Gendermorphologically overt 'gender', Masculine or Feminine
Numbermorphologically overt 'number', Singular, Dual, or Plural
Case1 nominative, 2 genitive, 4 accusative
Definmorphological 'definiteness', Indefinite, Definite, Reduced, or Complex
MemberOfmember of syntactic Coordination or Apposition
ClauseHeadthe token is the head of the given type of a subordinate clause
GramCorefthe pronoun S- is a grammatical coreferent, unlike other pronouns that are textual coreferents
InputFormthe token is the first of all that analyze the given orthographical word in the original input
TokenGlossa clue to the morphological structure of the token

The inventory of analytical dependency functions is further explained in one document or another:

Predverbal predicateCoordcoordination
Pnomnominal predicateAposapposition
PredEexistential predicateAnteanteposition
PredCconjunction as the clause's headAuxCconjunction
PredPpreposition as the clause's headAuxPpreposition
SbsubjectAuxEemphasizing expression
ObjobjectAuxMmodifying expression
AdvadverbialAuxYauxiliary, part of compound
AtrattributeAuxGgraphical symbol
AtvcomplementAuxKsentence separator
ExDellipsis, no actual dependency_excessive token, esp. due to typo

The conversion script from the original FS format to the CoNLL format produces files with the .conll extension. The script is run as follows:

  btred -Qm padt-conll.btred syntax/*.syntax.fs
  mkdir conll
  mv syntax/*.syntax.fs.conll conll/

The data use the UTF-8 encoding as required. It might however be preferred to view the data in the Buckwalter transliteration, if rendering the Arabic script poses some problems. We recommend using the Encode Arabic libraries in Perl or Haskell to easily convert the data.

For using the Perl library from a command line, a code like this would do:

  # calling the module's functions in a one-liner

  cat PADT-data-in-CoNLL-format | \
      perl -MEncode::Arabic -pe '$_ = encode "buckwalter", decode "utf8", $_'

  # running the scripts installed with the module

  cat PADT-data-in-CoNLL-format | encode "buckwalter"

To use the module for reducing the vocalization, or to choose the XML-compliant variant of the Buckwalter transliteration, one can set the modes of conversion easily. Consider e.g. the following script, which removes any vocalization marks from the tokenized word forms supplied in the second column of the CoNLL data:

  use Encode::Arabic ':modes';

  enmode "buckwalter", 'full', 'xml';
  demode "buckwalter", 'noneplus', 'xml';

  while ($line = <>) {

      @cols = split /\t/, decode "utf8", $line;

      if (@cols < 2) {

          print $line;
          next; 
      }

      unless ($cols[1] =~ /[\x{20}-\x{7F}]/) {

          $in_buck = encode "buckwalter", $cols[1];
          $cols[1] = decode "buckwalter", $in_buck;
         
          warn $in_buck . "\n";
      }
      
      print encode "utf8", join "\t", @cols;
  }

More examples are available in the CPAN documentation.

The link to the last year's CoNLL-X and the proceedings ...