Word Alignment in Bilingual Parallel Corpora





















Mohith Julapalli
Sameer Dhond
Natural Language Processing
Final Project Spring 2003

Introduction

This project attempts the task of word alignment using parallel text in different languages.
Specifically, it tries to build a bilingual dictionary given a set of French and English parallel texts
obtained from the Hansard's corpora. The utility of this word alignment model is as a part of a
larger machine translation system.

Initially, we intended on implementing algorithms by Kay and Roscheisen (1993) and Stanley F.
Chen (1997), and comparing the results. Both are Sentence-Alignment algorithms that produce
as a byproduct Word-Alignment information. However, after developing the method from Kay
and Roscheisen, we found that while it performed reasonably well on its intended task of
Sentence-Alignment, it did not produce as comprehensive a Word-Alignment model as we
hoped. And although there were various threshold parameters we could tune to produce more
words, the accuracy would then degrade considerably. In addition, the algorithm did not scale
well to larger sets of data (which made it even harder to generate a large bilingual dictionary).

We then decided to use an algorithm described by Dan Melamed (1997) specifically for aligning
words. This method performed better than Kay and Roscheisen in accuracy (perhaps because it
receives aligned sentences as input rather than generate the alignments), and also in the size of
the dictionary it produced. We could successfully run this algorithm on 10,000 sentences in each
language, whereas the Kay method was limited to some 1,000.

Both algorithms make a one-to-one assumption—Kay and Roscheisen at the sentence and word
level, and Melamed at the word level. While this may not be a completely accurate assumption,
it makes the problem tractable and produces decent results. Later, we discuss some of the erred
translations our algorithms produced as a result of this assumption.

Dictionary Evaluation

We wrote our algorithms so that they would output tab-delimited bilingual dictionaries. Initially,
we intended on calculating accuracy automatically using an English-French text dictionary and
supplementing it with some online web dictionaries (which we save into text files to avoid
overloading web servers). However, this did not work too well because most of the Hansard data
is conversational. In addition words appear in different morphological forms, making it difficult
to automate the verification. So instead, we manually checked our dictionaries, randomly
sampling a few hundred words when the data became too large.

Algorithm 1: Kay and Roscheisen

In this algorithm, words in the two languages are aligned based on a similarity score—e.g. how
similar their distributions in the sentences are.

Data structures:
Alignable Sentences Table
This is a 2D chart indicating which sentences in one language could possibly correspond to
sentences in the other language. It is initialized with fixed points at 0,0 and m,n (where m,n is
the number of sentences in each language). The possible correspondences are interpolated
between these points, becoming wider (on the order of sqrt(N)) towards the middle of the text.


Figure 1. Initial Alignable Sentence Table


Word Alignment Table
The WAT is basically a list of pairs of words and a score indicating how similar the distributions
in their corresponding sentences are. For our purpose, the final dictionary is pulled from the
WAT after the last iteration.

Sentence Alignment Table
The SAT contains sentence pairs that are deemed to be aligned. These pairs are a subset of
points in the AST, and are used to regenerate the AST after each iteration. (See the red points in
figures 4 and 5 below).

Figure 2. Algorithm flow



Alignable
Word

Sen tence
Cartesian
Alignment
pr
Table
oduct
of
all Table
words
in
all
sentences
of
AST
Relax
SAT




Use correspondence sets
using




to compute new SAT
interpolation
Sentence

Alignment

Table

Summary of the algorithm:
1. Initialize the AST (as described above).
2. Compute a WAT from the AST by taking a cartesian product of the words and
calculating their similarity.

Similarity = 2 * c / (Na(v) + Nb(w))


c is the size of the largest correspondence set. A correspondence set is the set of
sentences in which both words appear, with no overlap.


Na(v) and Nb(w) are the frequencies of word v in text A and w in B, respectively.

3. Threshold and sort the WAT. We have to partition the WAT into several segments based
on frequency because we give more value to the words that appear with high frequency
and high similarity.
4. Compute a SAT from the WAT by taking the maximum probability alignment. We
iterate through the WAT in order. For any words that are paired, pair their corresponding
sentences for the SAT. Throw out any WAT pairs that cause overlap in the sentence
alignments. For example, if sentence 1 in English is aligned to sentence 3 in French, then
2 in English cannot align to 1 in French.

Figure 3. Example of an invalid (overlapping) sentence alignment

English French
Sentence 1

Sentence 1

Sentence 2

Sentence 2

Sentence 3

Sentence 3


5. Use the entries from the SAT as fixed points in the AST, and interpolate between these
fixed points to produce a new AST. Interpolation is done in the same manner as the
initial AST, except now we interpolate between each pair of entries in the SAT.
6. Iterate steps 2-4 until the SAT converges. The SAT is the final aligned sentences, and the
final WAT is the aligned words.















Figure 4. AST after 1 iteration (with 12 entries in the SAT).



Figure 5. AST after convergence (with 75 entries in the SAT)


Optimizations
There are a few optimizations to the algorithm to help it run faster and also throw out any
unlikely data:

1. After computing the WAT, throw out any pairs whose similarity score is below a certain
threshold.

2. After computing the SAT, throw out any pairs that do not have a large number of
correspondences.

These two thresholds can be tuned to tradeoff between accuracy and recall; the lower the
threshold, the more words/sentences in the alignment, but the more likely chance for error. In
addition, the thresholds are not constant per iteration. Each iteration they are slightly lowered.

We decrease the thresholds in each iteration for the WAT and SAT because early in the process
we want high accuracy, particularly because the entries in the SAT persist between iterations.
Later in the process, we decrease them to allow more entries. This is valid because in later
iterations the space in the AST is much smaller (Figure 5. versus Figure 4), causing less error.

Testing
To test the algorithm, we used the Hansard data. Since the Hansard data already has aligned the
sentences, we threw out those alignments and tried to generate them with the algorithm.

We tried sets of 100, 200, 500, and 1000 sentences. The algorithm uses a cartesian product of
words in the sentences, so memory became an issue with the Word Alignment Table for any
larger sets of sentences. In their paper, Kay and Roscheisen report a similar limitation on data
size.

Results
The algorithm did fairly well with sentence alignment.

Number of sentences
Accuracy (number correct/total)
Recall (number found/total)
100 1.0
0.75
200 0.87
0.91
500 0.90
0.90
1000 0.93
0.91

We verified the resulting dictionaries by hand, and found that they were quite accurate, but
small. We could only get a few hundred word correspondences from the WAT.

Number of
Dictionary Accuracy
Number of Entries
Dictionary (sorted by
sentences
(approximate)
score descending)
100 0.76
38
link
200 0.69
83
link
500 0.82
131
link
1000 0.86
261
link
We suspect that if we spent a lot of time tuning the thresholds we could achieve even higher
accuracy and recall with sentence alignment. Changing these thresholds per iteration adds a few
variables: how much to decrement them, and how low they should go.

Algorithm 2: Melamed

The second algorithm we implemented came from a Melamed paper about parallel word
alignments.


Figure 6. Algorithm flow


Word Pair
Linked Tokens

Likelihoods

Competitive


Linking


Algorithm


Use model



Hill climb to maximize
to recalculate Mode l

P(links|model)
likelihoods
Parameters





Summary of Algorithm

1. Initialize likelihoods of each word pair (probability that these words are mutual
translations). We use as a score:

Initial Likelihood score = N(uv) / (Nu + Nv)

N(uv) is the number of times words u and v co-occur (two words u and v co-occur if u
appears in a segment in one text that is aligned with a segment in which v appears in the
other text). Since the Hansard corpus has aligned sentences, we use these sentences as
our “segments”.
Nu is the number of times u co-occurs with any v in the second language
Nv is the number of times v co-occurs with any u in the second language

2. Use the competitive linking algorithm to link tokens. For each segment, we find the
words u and v from the two texts that have the highest likelihood score among the pairs
of words in the segment. We mark tokens of these words as linked in the current
segment, and we increment a count that these words were linked. We continue to link
tokens in the current segment until either all words are linked or no more pairs have
likelihood scores (the remaining unlinked words are dropped).

3. Estimate parameters lambda-plus (true positive rate – the probability of a link given a
correct translation) and lambda-minus (false positive rate – probability of a link given an
incorrect translation). We find the parameters that maximize the expression below:

P(links|model) = Product for all pairs u,v of P(Kuv | Nuv, lambda-plus, lambda-minus)

Kuv is the number of links for words of type u and v.
Nuv is the number of co-occurrences for words of type u and v (these counts do not
change over the course of the algorithm)


The equation assumes a binomial distribution for Kuv. For more details, please see
Melamed 1997.

We use a hill-climbing search to find the maximum of this probability. To begin, we start
with a guess for lambda-plus and lambda-minus, find the probability with these
parameters, and then slightly move the lambdas in whichever direction produces a higher
probability. Hill-climbing could get stuck on local optima, so we choose a step size large
enough to leap over most of these local solutions (as Melamed suggests).

4. Use the lambda parameters to recalculate likelihoods.

L(u,v) = B(Kuv | Nuv, lambda-plus) / B(Kuv | Nuv, lambda-minus)

(where B represents a binomial)

5. Repeat steps 2-4 until convergence. We define convergence as when P(links|model)
stays relatively constant from the last iteration.

Optimizations
We made a number of optimizations to the above algorithm. Some of these are referred to in
Melamed’s assorted papers, and others we implemented on our own.

First, we made a series of memory optimizations in our code. For example, instead of
duplicating words, we refer to them by a unique code and keep a table of code to word mappings.

To initialize our likelihood scores (step 1), we tried using the words’ marginal frequencies
(instead of the counts we described above). The results did not change after that modification,
suggesting that the algorithm is not heavily weighted on the initial distribution (within reason).
Also, initially, if words appeared multiple times within a segment, their co-occurrence would be
described as the product of their respective frequencies (for example if “the” appeared 4 times
and “la” 3 times in corresponding segments, their co-occurrence would be 12). Later we
modified our definition of co-occurrence to be the minimum of their respective frequencies. Our
reason for this is because the previous formula implies that each word token has multiple
translations, whereas the latter stays true to our one-to-one assumption.

In addition, instead of using the strict definition of a binomial ((n choose k) * p^n * (1-p)^(n-k)),
we dropped the (n choose K) term. We dropped it because it was extremely computationally
expensive (we needed to convert our data types and do all computations with BigDecimal and
BigInteger in Java).

The effect of dropping this term was to drop the links of common words, like "the", "and", and
"or". Because the words are so common, their respective k and n values are in the hundreds.
Effectively, their probability portion (p^n * (1-p)^(n-k)) becomes extremely small. Without the
(n choose k) term, the number becomes negligible.

In practice we found that the dictionary did not suffer too much from these exclusions; often we
would find incorrect mappings (like "of" to "la" or "the" to "de") at intermediate iterations.


Finally, we noticed that we could considerably improve the speed of the hill-climbing search for
optimal parameters by using the values we found from the previous iteration for our initial guess
(instead of random values).

One of the key positive features of Melamed's algorithm is how few adjustable parameters there
are. As he states, there is only a single threshold, used to throw out word pairs on each iteration,
that affects the tradeoff between accuracy and recall.

Testing
We ran the algorithm on 100, 200, 500, 1000, 5000 aligned Hansard sentences.

Results

Progression of Model parameters on 2000 iterations
Iteration Lambda-plus
Lambda-minus
Log(P(links|model))
1 0.933531
0.030358
-76380.1
2 0.948599
0.011616
-18962.1
3 0.928099
0.010116
-18773.3
4 0.904099
0.007116
-15267.1
5 0.915599
1.16E-04
-9305.58
6 0.978099
1.16E-04
-6354.32
7 0.980099
1.16E-04
-6270.09
8 0.980099
1.16E-04
-6270.09

Resulting dictionary data
Number of
Dictionary
Number of Entries
Dictionary sorted Dictionary sorted
Sentences
Accuracy
alphabetically
by score
(approximate)
100 0.79 97
link
link
200
0.89
148
link
link
500 0.91 339
link
link
2000 0.93
1190
link
link
5000 0.91
2263
link
link

Figure 7 shows how the probability of the links given the model increases. Theoretically, as the
translations improve, lambda-plus approaches 1, and lambda-minus approaches 0.

Figures 8 and 9 show how the recall and accuracy change depending on the number of sentences
compared. Clearly the limitation of the Kay and Roscheisen algorithm is the entry count is not
very large, and the memory usage prevents us from trying larger data sets.








Figure 7. Model probability change over time
Probability change over tim e (2000 sentences)
Iteration num ber
10000
0 0
1
2
3
4
5
6
7
8
9
-10000
-20000
)
)

-30000
odel
m
-40000
ks|
l
i
n
-50000
P( -60000
l
og(

-70000
-80000
-90000
log(P(links|model))

Figure 8. Dictionary size versus data size
Dictionary size com parison
2500
2000
1500
r
y count
1000
Ent
Melamed
500
Kay-Roscheisen
0
0
500
1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Num ber of sentences compared



Figure 9. Dictionary accuracy versus data size
Accuracy comparison
1
0.9
0.8
0.7
0.6
acy 0.5
ccur 0.4
A
0.3
Melamed
0.2
Kay-Roscheisen
0.1
0
0
500
1000 1500
2000
2500
3000
3500
4000 4500
5000
5500
Number of sentences compared



Melamed's algorithm uses the one-to-one assumption at the word level. As we've discussed in
class, this assumption that each English word maps to a single French word (and vice versa) is
not entirely accurate but significantly improves algorithm formulation and efficiency. We did
notice the effects of the assumption, and it caused many incorrect translations in our dictionary.

For example, "if" mapped to the French word "l'affirmative", rather than "si". Upon further
inspection, we found that “if” appears many times in the phrase "if so" in the corpus, and that
phrase corresponds to "l'affirmative". In that sense, our results are not completely wrong.

Another common mistake was to swap definitions in phrases. For example, we incorrectly
mapped "throne" to "discours" and "speech" to "trone" (the definitions are reversed). In
actuality, the set of sentences in the Hansard corpus almost always had the phrase "Speech of the
Throne" in English, so there is really no way to distinguish which words correspond using purely
internal (and no syntactic) information.

Conclusions

For our purposes, the Melamed algorithm performed significantly better than the Sentence-
Alignment method we initially tried. However, it has the advantage that it uses aligned data at
the sentence level whereas that is the output of the sentence-alignment algorithm.

Perhaps an interesting extension of this project would be to combine the two algorithms to start
with parallel texts, and use Kay and Roscheisen or another sentence alignment algorithm to align
sentences and then Melamed's method to generate a dictionary of aligned words.

One of the biggest problems with the algorithms was the complexity. While Melamed's
algorithm performed better, both still took significant computation to analyze a few hundred or

thousand sentences. One idea we had was to subdivide the sentences into several sets and run
the computation in parallel. Then, using the dictionary produced by each instance we could have
a voting scheme generate the final dictionary.

Another extension would be to use morphological information to combine words like “cats”,
“cat’s”, and “cat” to a single dictionary entry.

Overall, both algorithms performed reasonably and as the authors described. The results met our
expectations for a word-alignment model that only uses information contained within parallel
texts.



References

Chen, Stanley F. “Aligning Sentences in Bilingual Corpora using lexical information”, In
Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics,
1993.

Kay, Martin and Roscheisen, Martin. “Text-Translation Alignment”, Computational Linguistics,
19:121-142, 1993.

Melamed, Dan. “A Word-to-Word Model of Translational Equivalence”. In Procs. of the ACL97.
pp 490–497. Madrid Spain, 1997.

Melamed, Dan. “Models of Co-occurrence”, ICRS Technical Report 98-05, University of
Pennsylvania, 1998.

Melamed, Dan. “A Portable Algorithm for Mapping Bitext Correspondence”, In ACL97, 1997.