198k views
0 votes
Imagine that you are given a very large corpus containing pairs of sentences that are translations of each other. For example, here are some English-French pairs from Web pages (with the sentences tokenized):

Offered to sell: imported cans containing CFCs. Pour avoir offert: vendre de les cannettes import 'ees contenants de les CFC.
You wish to sell your refrigerator, your game console or your brother-in-law ? Vous souhaitez vendre votre frigo, votre console de jeux ou votre beau-fr`ere ? Let me know if you are looking to buy or sell ?
Est-ce que vous cherchez acheter ou vendre ?
Suppose you are given the task of analyzing a corpus to create a scored translation lexicon - that is, triples (e,f,score) where e and fare English and French words, respectively, and score is higher when you're more confident that they are translations of each other. For example, training on a large corpus one might expect to produce (you,vous,s1) (sell, vendre,s2) with high values for the scores s1 and s2 since vous and you are translations of each other, and similarly sell and vendre are translations of each other.
One approach to discovering words that are translations of each other, given a parallel corpus like this, is to identify word pairs (e, f) that are "sticky" across the two languages, in the same way, we discover "sticky bigrams" (w1,w2), e.g. united states, within a single language. This can be done, for example, using pointwise mutual information (PMI).
(a) Briefly explain the definition of pointwise mutual information and why the value of PMI(x,y) tends to be higher when x and y co-occur in meaningful ways.
(b) Assume the above corpus is normalized by converting all the tokens to lowercase and throwing away tokens that are only punctuation. Compute PMI(you, vous) and PMI(sell, vendre), where count(e,f) for a word pair is defined as the number of paired sentences in which the two words co- occur. Despite all my reminders that zeroes are bad, you should not do smoothing when estimating probabilities for this problem.?
(c) Briefly discuss advantages and disadvantages of using this method to discover words that are translations of each other.

User Farhana
by
7.6k points

1 Answer

4 votes

Final Answer:

(a) Pointwise Mutual Information (PMI) is a measure used to assess the association between two words in a corpus. It is defined as \[ PMI(x, y) = \log \left( \frac{P(x, y)}{P(x)P(y)} \right) \], where \( P(x, y) \) is the probability of the co-occurrence of words \( x \) and \( y \), and \( P(x) \) and \( P(y) \) are their individual probabilities. The value of PMI tends to be higher when \( x \) and \( y \) co-occur in meaningful ways because it compares the observed co-occurrence with the expected co-occurrence under independence. Higher values indicate a stronger association than would be expected by chance.

(b) Assuming the corpus is normalized and tokenized, PMI is computed for the word pairs \( \text{PMI(you, vous)} \) and \( \text{PMI(sell, vendre)} \). The calculation involves counting the co-occurrences of these words in paired sentences. Despite zero smoothing, the formula involves the probabilities \( P(x, y) \), \( P(x) \), and \( P(y) \), which can be determined from the corpus statistics.

(c) The advantage of using PMI to discover translations is its ability to identify word pairs with strong associations, capturing nuanced relationships. However, it may struggle with rare words or those with multiple meanings. Additionally, it relies heavily on the quality and representativeness of the parallel corpus. The method is sensitive to word order and may not generalize well to different domains or contexts. Despite these limitations, PMI offers a valuable approach to extracting translation lexicons from large parallel corpora.

Step-by-step explanation:

(a) Pointwise Mutual Information (PMI) is a measure that assesses the association between two words in a corpus. It is defined as the logarithm of the ratio of the observed co-occurrence of words \( x \) and \( y \) (\( P(x, y) \)) to the expected co-occurrence under independence (\( P(x)P(y) \)). In other words, it quantifies the deviation of the observed co-occurrence from what would be expected by chance. Higher PMI values indicate a stronger association between words, suggesting they are more likely to be translations.

(b) In the context of the given corpus, assuming it's normalized and tokenized, the computation of PMI involves counting the co-occurrences of the specified word pairs. For example, \( \text{PMI(you, vous)} \) would be calculated using the formula and the relevant counts of co-occurrence. It's crucial to note that in this scenario, zero smoothing is recommended despite reminders against it, as probabilities are estimated directly from the counts.

(c) The advantages of using PMI for discovering translations lie in its ability to capture subtle semantic relationships and identify word pairs with strong associations. However, it has limitations, such as sensitivity to rare words or polysemous terms, reliance on the quality of the parallel corpus, and potential challenges in handling word order variations. Despite these drawbacks, PMI serves as a valuable tool in extracting translation lexicons, offering insights into the relationships between words in different languages based on their co-occurrence patterns.

User Iamdanchiv
by
7.8k points