HLT-NAACL 2003 Workshop: Building and Using Parallel Texts
Data Driven Machine Translation and Beyond , pp. 36-39
Edmonton, May-June 2003
TREQ-AL: A word alignment system with limited language resources
Dan Tufiş, Ana-Maria Barbu, Radu Ion
Romanian Academy Institute for Artificial Intelligence
13, “13 Septembrie”, 74311, Bucharest 5, Romania
{tufis,abarbu,radu}@racai.ro
Abstract
dictionary. This is in sharp contrast with the alignment
We provide a rather informal presentation of a
task where each occurrence of the same pair equally
prototype system for word alignment based on
counts.
our previous translation equivalence approach,
Another differentiating feature between the two
discuss the problems encountered in the
tasks is the status of functional word links. In extracting
shared-task on word-aligning of a parallel
translation equivalents one is usually interested only in
Romanian-English text, present the preliminary
the major categories (open classes). In our case (because
evaluation results and suggest further ways of
of the WordNet centered approach of our current
improving the alignment accuracy.
projects) we were especially interested in POS-
preserving translation equivalents. However, since in
1 Introduction
EuroWordNet and Balkanet one can define cross-POS
links, the different POS translation equivalents became
In (Tufiş and Barbu, 2002; Tufiş, 2002) we largely of interest (provided these categories are major ones).
described our extractor of translation equivalents, called
The word alignment task requires each word
TREQ. It was aimed at building translation dictionaries
(irrespective of its POS) or punctuation mark in both
from parallel corpora. We described in (Ide et al. 2002)
parts of the bitext be assigned a translation in the other
how this program is used in word clustering and in part (or the null translation if the case).
checking out the validity of the cross-lingual links
Finally, the evaluations of the two tasks, even if
between the monolingual wordnets of the multilingual both use the same measures as precision or recall, have
Balkanet lexical ontology (Stamatou et al. 2002). In this
to be differently judged. The null alignments in a
paper we describe the TREQ-AL system, which builds dictionary extraction task have no significance, while in
on TREQ and aims at generating a word-alignment map
a word alignment task they play an important role (in
for a parallel text (a bitext). TREQ-AL was built in less
the Romanian-English gold standard data the null
than two weeks for the Shared Task proposed by the alignments represent 13,35% of the total number of
organizers of the workshop on “Building and Using links).
Parallel Texts:Data Driven Machine Translation and
Beyond” at the HLT-NAACL 20031 conference. It can 2 The preliminary data processing
be improved in several ways that became conspicuous
when we analyzed the evaluation results. TREQ-AL has
The TREQ system requires sentence aligned parallel
no need for an a priori bilingual dictionary, as this will text, tokenized, tagged and lemmatized. The first
be automatically extracted by TREQ. However, if such problem we had with the training and test data was
a dictionary is available, both TREQ and TREQ-AL related to the tokenization. In the training data there
know to make best use of it. This ability allows both were several occurrences of glued words (probably due
systems to work in a bootstrapping mode and to produce
to a problem in text export of the initial data files) plus
larger dictionaries and better alignments as they are an unprintable character (hexadecimal code A0) that
used.
generated several tagging errors due to guesser
The word alignment, as it was defined in the shared
imperfect performance (about 70% accurate).
task is different and harder than the problem of
To remedy these inconveniences we wrote a script
translation equivalence as previously addressed. In a that automatically split the glued words and eliminated
dictionary extraction task one translation pair is the unprintable characters occurring in the training data.
considered correct, if there is at least one context in
The set of splitting rules, learnt from the training
which it has been rightly observed. A multiply data was posted on the site of the shared task. The set of
occurring pair would count only once for the final rules is likely to be incomplete (some glued words
might have survived in the training data) and also might
1 http://www.cs.unt.edu/~rada/wpt/index.html#shared
produce wrong splitting in some cases (e.g. turnover prefixed with a common symbol, given that verb-
being split always in turn over).
adjective, noun-verb, noun-adjective and the other
The text tokenization, as considered by the combinations are typical for Romanian-English
evaluation protocol, was the simplest possible one, with
translation equivalents that do not preserve the POS.
white spaces and punctuation marks taken as separators.
With these prefixes, the initial algorithm for extracting
The hyphen (‘-‘) was always considered a separator and
POS-preserving translation equivalents could be used
consequently taken to be always a token by itself. without any further modifications. Using the tag-
However, in Romanian, the hyphen is more frequently prefixes seems to be a good idea not only for legitimate
used as an elision marker (as in “intr-o”= “intru o”/in a),
POS-alternating translations, but also for overcoming
a clitics separator (as in “da-mi-l”=”da –mi –l”=”da mie
some typical tagging errors, such as participles versus
el”/give to me it/him) or as a compound marker (as in adjectives. In both languages, this is by far the most
“terchea-berchea” /(approx.) loafer) than as a separator. frequent tagging error made by our tagger.
In such cases the hyphen cannot be considered a token.
The last preprocessing phase is encoding the corpus
A similar problem appeared in English with respect to in a XCES-Align-ana format as used in the MULTEXT-
the special quote character, which was dealt with in EAST corpus (see http://nl.ijs.si/ME/V2/) which is the
three different ways: it was sometimes split as a distinct
standard input for the TREQ translation equivalents
token (we’ll = we + ’ + ll), sometimes was adjoined to extraction program. Since the description of TREQ is
the string (a contracted positive form or a genitival) extensively given elsewhere, we will not go into further
immediately following it (I’m = I + ’m, you’ve = details, except of saying that the resulted translation
you+’ve, man’s = man + ’s etc.) and systematically left
dictionary extracted from the training data contains
untouched in the negative contracted forms (couldn’t, 49283 entries (lemma-form). The filtering of the
wasn’t, etc).
translation equivalents candidates (Tufiş and Barbu,
Since our processing tools (especially the tokeniser)
2002) was based on the log-likelihood and the cognate
were built with a different segmentation strategy in scores with a threshold value set to 15 and 0,43
mind, we generated the alignments based on our own respectively. We roughly estimated the accuracy of this
tokenization and, at the end, we “re-tokenised” the text dictionary based on the aligned gold standard: precision
according to the test data model (and consequently re-
is about 85% and recall is about 78% (remember, the
index) all the linking pairs.
dictionary is evaluated in terms of lemma entries, and
For tagging the Romanian side of the training bitext
the non-matching meta-category links are excluded).
we used the tiered-tagging approach (Tufiş, 1999) but
we had to construct a new language model since our 3 The TREQ-AL linking program
standard model was created from texts containing
diacritics. As the Romanian training data did not contain
This program takes as input the dictionary created by
diacritical characters, this was by no means a trivial task
TREQ and the parallel text to be word-aligned. The
in the short period of time at our disposal (actually it alignment procedure is a greedy one and considers the
took most of the training time). The lack of diacritics in
aligned translation units independent of the other
the training data and the test data induced spurious translation units in the parallel corpus. It has 4 steps:
ambiguities that degraded the tagging accuracy with at 1. left-to-right pre-alignment
least 1%. This is to say that we estimate that on a 2. right-to-left adjustment of the pre-alignment
normal Romanian text (containing the diacritical 3. determining alignment zones and filtering them out
characters) the performance of our system would have 4. the word-alignment inside the alignment zones
been better. The English training data was tagged by
Eric Gaussier, warmly acknowledged here. As the 3.1 The left-to-right pre-alignment
tagsets used for the two languages in the parallel
training corpus were quite different, we defined a tagset
For each sentence-alignment unit, this step scans the
mapping and translated the tagging of the English part words from the first to the last in the source-language
into a tagging closer to the Romanian one. This part (Romanian). The considered word is initially linked
mapping introduced some ambiguities that were solved to all the words in the target-language part (English) of
by hand. Based on the training data (both Romanian and
the current sentence-alignment unit, which are found in
English texts), tagged with similar tagsets, we built the the translation dictionary as potential translations. If for
language models used for the test data alignment.
the source word no translations are identified in the
POS-preserving translation equivalence is a too target part of the translation unit, the control advances to
restrictive condition for the present task and we defined
the next source word. The cognate score and the relative
a meta-tagset, common for both languages that distance are decision criteria to choose among the
considered frequent POS alternations. For instance, the possible links. When consecutive words in the source
verb, noun and adjective tags, in both languages were part are associated with consecutive or close to each
other words in the target part, these are taken as forming
adjacent to the starting or ending zones. The failures of
an “alignment chain” and, out of the possible links, are
this filtering were in the majority of cases due to a
considered those that correspond to the densest wrong use of punctuation in one or the other part of the
grouping of words in each language. High cognate translation unit (such as omitted comma, a comma
scores in an alignment chain reinforce the alignment. between the subject and predicate).
One should note that at the end of this step it is possible
to have 1-to-many association links if multiple 3.4 The word-alignment inside the alignment zones
translations of one or more source words are found in
the target part of the current translation unit (and, For each un-linked word in the starting zone the
obviously, they satisfy the selection criteria).
algorithm looks for a word in the ending zone/s of the
same category (not meta-category). If such a mapping
was not possible, the algorithm tries to link the source
3.2 The right-to-left adjustment of the pre-alignment
word to a target word of the same meta-category, thus
This step tries to correct the pre-alignment errors (when
resulting in a cross-POS alignment. The possible meta-
possible) and makes a 1-1 choice in case of the 1-m category mappings are specified by the user in an
links generated before. The alignment chains (found in external mapping file. Any word in the source or target
the previous step) are given the highest priority in languages that is not assigned a link after the four
alignment disambiguation. That is, if for one word in processing steps described above is automatically
the source language there are several alignment assigned a null link.
possibilities, the one that belongs to an alignment chain
is always selected. Then, if among the competing 4 Post-processing
alignments one has a cognate score higher than the
others then this is the preferred one (this heuristics is As said in the second section, our tokenization was
particularly useful in case of several proper names different from the tokenization in the training and test
occurring in the same translation unit). Finally, the data. To comply with the evaluation protocol, we had to
relative position of words in the competing links is re-tokenize the aligned text and re-compute the indexes
taken into account to minimize the distance between the
of the links. Re-tokenizing the text meant splitting
surrounding already aligned words.
compounds and contracted future forms and gluing
The first two phases result in a 1-1 word mapping. together the previously split negative contracted forms
The next two steps use general linguistic knowledge (do+n’t=don’t). Although the re-tokenization was a
trying to align the words that remain unaligned (either post-processing phase, transparent for the task itself, it
due to no translation equivalents or because of failure to
was a source of missing some links for the negative
meet the alignment criteria) after the previous steps. contracted forms. In our linking the English “n’t” was
This could result in n-m word alignments, but also in always linked to the Romanian negation and the English
unlinking two previously linked words since a wrong auxiliary/modal plus the main verb were linked to the
translation pair existing in the extracted dictionary Romanian translation equivalent found for the main
might license a wrong link.
verb. Some multi-word expressions recognized by the
tokenizer as one token, such as dates (25 Ianuarie,
3.3 Alignment zones and filtering suspicious links out
2001), compound prepositions (de la, pina la),
An alignment zone (in our approach) is a piece of text conjunctions (pentru ca, de cind, pina cind) or adverbs
that begins with a conjunction, a preposition, or a (de jur imprejur, in fata) as well as the hyphen
punctuation mark and ends with the token preceding the
separated nominal compounds (mass-media, prim-
next conjunction, preposition, punctuation or end of ministru) were split, their positions were re-indexed and
sentence. A source-language alignment zone is mapped the initial one link of a split compound was replaced
to one or more target-language alignment zones via the with the set obtained by adding one link for each
links assigned in the previous steps (based on the constituent of the compound to the target English word.
translation equivalents). One has to note that the If the English word was also a compound the number of
mapping of the alignment zones is not symmetric. An links generated for one aligned multiword expression
alignment zone that contains no link is called a virgin was equal to the N*M, where N represented the number
zone.
of words in the source compound and M the number of
In most of the cases the words in the source words in the target compound.
alignment zone (starting zone) are linked to words in the
target alignment zone/s (ending zone/s). The links with 5 Evaluation
either side outside the alignment zones are suspicious The results of the evaluation of TREQ-AL performance
and they are deleted. This filtering proved to be almost are shown in the Table 1. In our submission file the
100% correct in case the outlier resides in a zone non-
sentence no. 221 was left out by (our) mistake. We used
A major improvement will be to make the
the official evaluation program to re-evaluate our algorithm symmetric. There are many cases when
submission with the omitted sentence included and the reversing the source and target languages new links can
precision improved with 0,09%, recall with 0,45%, F-
be established. This can be explained by different
measure and AER with 0,33%.). The figures in the first
polysemy degrees of the translation equivalent words
and second columns of the Table 1 are those considered
and the way we associate alignment zones.
by the official evaluation. The last column contains the
The word order in Romanian and English to some
evaluation of the result that was our main target. Since extent is similar, but in the present version of TREQ-AL
TREQ-AL produces only “sure” links, AER (alignment this is not explicitly used. One obvious and easy
error rate - see the Shared Task web-page for further improvement of TREQ-AL performance would be to
details) reduces to 1 - F-measure.
take advantage of the similarity in word order and map
TREQ-AL uses no external bilingual-resources. A the virgin zones and afterwards, the words in the virgin
machine-readable bilingual dictionary would certainly zones.
improve the overall performance. The present version
Finally, we noticed in the gold standard some
of the system (which is far from being finalized) seems
wrong alignments. One example is the following:
to work pretty well on the non-null assignments and this
“… a XI – a …” = “… eleventh…”
is not surprising, because these links are supposed to be
Our program aligned all the 4 tokens in Romanian (a,
relevant for a translation dictionary extraction system XI, –, a) to the English token (eleventh), while the gold
and this was the very reason we developed TREQ. standard assigned only “XI” to “eleventh” and the other
Moreover if we consider only the content words (main three Romanian tokens were given a null link. We also
categories: noun, verbs, adjectives and general adverbs),
noticed some very hard to achieve alignments
which are the most relevant with respect to our (anaphoric links).
immediate goals (multilingual wordnets interlinking and
word sense disambiguation), we think TREQ-AL 7 References
performs reasonably well and is worth further
improving it.
Tufiş, D. Barbu, A.M.: „Revealing translators
knowledge: statistical methods in constructing
Non-null
Null links Dictionary
practical translation lexicons for language and speech
links only
included
entries
processing”, in International Journal of Speech
Technology. Kluwer Academic Publishers, no.5,
Precision 81,38% 60,43%
84,42%
pp.199-209, 2002.
Recall 60,71% 62,80% 77,72%
F-measure 69,54%
61,59%
80,93%
Tufiş, D. ”A cheap and fast way to build useful
AER 30,46%
38,41%
translation lexicons” in Proceedings of the 19th
International Conference on Computational
Table 1. Evaluation results
Linguistics, COLING2002, Taipei, 25-30 August,
6 Conclusions and further work
2002, pp. 1030-1036p.
Ide, N., Erjavec, T., Tufis, D.: „Sense Discrimination
TREQ-AL was developed in a short period of time and
with Parallel Corpora” in Proceedings of the SIGLEX
is not completely tested and debugged. At the time of
Workshop on Word Sense Disambiguation: Recent
writing we already noticed two errors that were
Successes and Future Directions. ACL2002, July
responsible for several wrong or missed links. There are
Philadelphia 2002, pp. 56-60.
also some conceptual limitations which, when removed,
are likely to further improve the performance. For Stamou, S., Oflazer K., Pala K., Christoudoulakis D.,
instance all the words in virgin alignment zones are
Cristea D., Tufis D., Koeva S., Totkov G., Dutoit
automatically given null links but the algorithm could
D., Grigoriadou M.. “BALKANET A Multilingual
be modified to assign all the links in the Cartesian
Semantic Network for the Balkan Languages”, in
product of the words in the corresponding virgin zones.
Proceedings of the International Wordnet Conference,
The typical example for such a case is represented by
Mysore, India, 21-25 January 2002.
the idiomatic expressions (tanda pe manda = the list
Tufiş, D. “Tiered Tagging and Combined
that sum up). A bilingual dictionary of idioms as an
Classifiers” In F. Jelinek, E. Nöth (eds) Text,
external resource certainly would significantly improve
Speech and Dialogue, Lecture Notes in Artificial
the results. Also, with an additional preprocessing
Intelligence 1692, Springer, 1999, pp. 28-33.
phase, for collocation recognition, many missing links
could be recovered. At present only those collocations
that represent 1-2 or 2-1 alignments are recovered.