Langmap

A COS 597A Project
Powered by Pinecone

This page explores how text in different languages affects OpenAI's ada-002 text embedding model. Code for the research itself and this site are freely available on GitHub under an MIT License.

Research Questions
Approach
Implementation
Results
Conclusion

Research Questions

How accurate is cosine similarity at detecting sentences with the same semantic meaning across languages?
How well does ada-002 identify and cluster inputs of the same language?

Approach

I sourced 6696 sentences from Tatoeba, a digital repository of sentences and translations. These 6696 sentences were translated into 8 different languages:

French (fr), Spanish (es), German (de), Chinese (zh), Japanese (jp), Russian (ru), Portuguese (pt), and English (en)

I chose these languages because they are among the most prevalent languages on the internet. Because they are widely spoken, well-documented languages, it was more likely that ada-002 could embed them accurately. Tatoeba had translations for 6966 sentences across these 8 languages, for a total of 53568 sentences.

A typical collection of 8 sentences looks like:

[ "Il faut que j'aille dormir.",
"Tengo que irme a dormir.",
"Ich muss jetzt schlafen.",
"我该去睡觉了。",
"私は眠らなければなりません。",
"Мне пора идти спать.",
"Preciso ir dormir.",
"I have to go to sleep." ]

These 53568 sentences were embedded using ada-002 and stored in a Pinecone vector database.

For each collection of 8 sentences, I compared each sentence in that collection to every other sentence using cosine similarity. This generated a symmetric "similarity matrix", which looks something like:

[[1.0, 0.889, 0.883, 0.874, 0.841, 0.832, 0.889, 0.908],
[0.889, 1.0, 0.866, 0.879, 0.823, 0.829, 0.907, 0.921],
[0.883, 0.866, 1.0, 0.868, 0.83, 0.825, 0.865, 0.894],
[0.874, 0.879, 0.868, 1.0, 0.853, 0.854, 0.861, 0.885],
[0.841, 0.823, 0.83, 0.853, 1.0, 0.81, 0.813, 0.843],
[0.832, 0.829, 0.825, 0.854, 0.81, 1.0, 0.831, 0.847],
[0.889, 0.907, 0.865, 0.861, 0.813, 0.831, 1.0, 0.901],
[0.908, 0.921, 0.894, 0.885, 0.843, 0.847, 0.901, 1.0]]

Where [i][j] is the similarity between the sentence in language i and language j, in the order of the typical collection above.

All 6696 of these matrices were averaged to produce a single mean similarity matrix:

	fr	es	de	zh	ja	ru	pt	en
fr	1.0	0.882	0.88	0.843	0.841	0.84	0.877	0.9
es	0.882	1.0	0.874	0.84	0.838	0.843	0.904	0.902
de	0.88	0.874	1.0	0.847	0.844	0.845	0.871	0.901
zh	0.843	0.84	0.847	1.0	0.878	0.834	0.843	0.873
ja	0.841	0.838	0.844	0.878	1.0	0.833	0.838	0.863
ru	0.84	0.843	0.845	0.834	0.833	1.0	0.844	0.863
pt	0.877	0.904	0.871	0.843	0.838	0.844	1.0	0.899
en	0.9	0.902	0.901	0.873	0.863	0.863	0.899	1.0

Download the data here.

This shows how similar, on average, a sentence in one language is to a sentence with the same meaning in another language, according to ada-002.

In addition, each sentence embedding was queried to the Pinecone index, with top_k=8. Matches from these queries which were not translations of the same sentence were recorded as outliers.

Below, rows are the language of the sentence that was embedded and columns are the languages of the sentences which were "wrongly" returned. Values are the absolute number of false matches, out of the 53568 for each language. Note that this matrix is not symmetric, for example, the number of English-language false positives from French queries is 432, while the number of French-language false positives from English-language queries is 249.

	fr	es	de	zh	ja	ru	pt	en
fr	27788	90	51	13	5	7	57	432
es	86	26284	14	18	5	2	704	330
de	52	21	30036	3	8	2	7	429
zh	9	16	5	35877	668	21	3	110
ja	3	4	3	434	40905	3	2	27
ru	0	1	0	12	3	42016	1	35
pt	59	655	11	5	2	4	27793	205
en	249	222	254	51	19	6	120	17365

Download the data here.

Finally, I used PCA projection as discussed in class to visualize how languages were clustered in three dimensions. Below is a 3D scatter plot of 250 sentences, randomly selected from the 53568, with a maximum of 1 from each of the 6696 groups of translations. Scroll down to analysis of the 3D scatter plot here.

Download the data here.

Interpretations of these results are discussed in the Results section.

Implementation

After downloading all the data from Tatoeba, run create_pairs.py (Link). This script processes multiple tab-separated values (TSV) files containing translation data. It reads each file, which corresponds to a different language dataset, and builds a dictionary where each key represents a unique identifier and each value is a list of translations of a phrase into various languages. The script ensures that each phrase has a translation in all the specified languages. It removes duplicate translations across languages and retains only those phrases with a complete set of translations. Finally, the script saves the filtered data as a JSON file, where each entry contains translations of a phrase across the specified languages and the unique identifier for that phrase.

The next script, embed_and_store.py (Link), processes a JSON file containing sentences and generate vector embeddings for these sentences using OpenAI's embedding model. It then stores these embeddings in a Pinecone index for efficient similarity searching and retrieval.

Key steps in embed_and_store.py include:

Loading environment variables, including API keys for OpenAI and Pinecone.
Initializing Pinecone with the provided API key and environment settings.
Reading sentences from a specified JSON file.
Checking if a Pinecone index with a given name already exists. If it doesn't, the script creates a new index with specified dimensions. If the index already exists, the script prompts for confirmation to proceed.
Connecting to the Pinecone index.
Iterating over the sentences, generating embeddings using the specified OpenAI model, and storing these embeddings along with metadata in the Pinecone index.
Saving all generated embeddings in a local JSON file for future use.

The final script is langmap.py (Link), which processes and visualizes linguistic data to analyze the relationship between sentences in different languages based on their vector representations. It performs three key functions:

Similarity Analysis: The script calculates cosine similarities between all pairs of sentence vectors for each of the 6696 groups, generating a matrix of similarity scores for each. This matrix reflects the relationships between the same sentence in different languages according to ada-002.
```
def calculate_all_similarities(vectors):
  similarities = [[] for _ in range(len(vectors))]
  for i in range(len(vectors)):
    for j in range(i, len(vectors)):
      similarity = cosine_similarity(vectors[i], vectors[j])
      similarities[i].append(similarity)
      if (i != j):
        similarities[j].append(similarity)
  return similarities
```
The 6696 matrices produced using the function above are averaged to produce a single mean similarity matrix.
Outlier Detection: The script queries the Pinecone index to find outliers, i.e., instances where sentences that are translations of each other in different languages do not closely match in vector space. This helps identify pairs of languages where translations may not align well in vector representations.
PCA Visualization: It performs Principal Component Analysis (PCA) to reduce the dimensionality of the vectors for visualization. The script then plots these reduced vectors in a 3D space, color-coded by language, allowing for a visual inspection of how sentences from different languages cluster together. To display the plot in 3D when the script runs, uncomment the following line:
```
166   # plt.show()
```

Finally, the langmap.py writes the computed similarity matrices, outlier tallies, and PCA points to JSON files.

To try it for yourself, download the code and follow the instructions in README.md.

Results

Mean Similarities

	fr	es	de	zh	ja	ru	pt	en
fr	1.0	0.882	0.88	0.843	0.841	0.84	0.877	0.9
es	0.882	1.0	0.874	0.84	0.838	0.843	0.904	0.902
de	0.88	0.874	1.0	0.847	0.844	0.845	0.871	0.901
zh	0.843	0.84	0.847	1.0	0.878	0.834	0.843	0.873
ja	0.841	0.838	0.844	0.878	1.0	0.833	0.838	0.863
ru	0.84	0.843	0.845	0.834	0.833	1.0	0.844	0.863
pt	0.877	0.904	0.871	0.843	0.838	0.844	1.0	0.899
en	0.9	0.902	0.901	0.873	0.863	0.863	0.899	1.0

Consistently, English has a higher cosine similarity to semantically similar sentences than the other languages. (You can also see this on the PCA visualization, where English is close to the center of the projection.) My hypothesis on this result is that because English is likely the most well-documented language in ada-002's training dataset, the embedding model is more sophisticated at associating English sentences with their translations.

The two most similar languages, unsurprisingly, are Spanish and Portuguese. Spanish and Portuguese are both Romance languages, descended directly from spoken Latin. They share much of their vocabulary and grammar, so it is not surprising that their ada-002 similarity would be high.

Chinese and Japanese are generally dissimilar to other languages. Their highest similarities are to each other, at 0.878. Although they are not linguistically related, Chinese and Japanese partially share a writing system due to Japanese kanji, which are adapted versions of Chinese hanzi.

Outlier Tally

	fr	es	de	zh	ja	ru	pt	en
fr	27788	90	51	13	5	7	57	432
es	86	26284	14	18	5	2	704	330
de	52	21	30036	3	8	2	7	429
zh	9	16	5	35877	668	21	3	110
ja	3	4	3	434	40905	3	2	27
ru	0	1	0	12	3	42016	1	35
pt	59	655	11	5	2	4	27793	205
en	249	222	254	51	19	6	120	17365

This can also be seen in the number of false positives in the Pinecone queries for each sentence in each language.

ada-002 returns false positives in the same language far more often than false positives in any other language, especially for Chinese, Japanese, and Russian, the three languages with alternate writing systems.

English-language queries are least likely to produce false positives. I suspect this result is also because it is the most well-documented language in ada-002's training dataset.

Across all languages, English sentences are returned as mistaken matches most frequently. However, Spanish and Portuguese, as well as Chinese and Japanese, are often returned in the top 8 most similar sentences to each other.

For example, the number of Japanese-language false positives from Chinese-language queries was 668. This is far more than the 9 outlier French sentences, 16 Spanish, 5 German, 21 Russian, and 3 Portuguese.

(An anecdote related to the elevated similarity scores between Russian and Chinese: when I was learning Mandarin, we used Soviet textbooks which had been translated from Russian because they were considered higher quality.)

PCA Projection

Scroll up to the 3D scatter plot here.

English sentences are closer to the center of the projection, reflecting their elevated ada-002 similarities to other languages. Spanish and Portuguese overlap, as do Chinese and Japanese. Russian, the only language of the 8 written in Cyrillic script, is isolated to one side. These observations conform with expectations that ada-002 is more influenced by the writing system than by the linguistic or grammatical structure of the languages, though it is influenced by both, as demonstrated in the similarity matrix.

Conclusion

Overall, this research suggests that between 10% and 15% of cosine similarity scores as calculated using ada-002 embeddings are language-related, with the remainder being related to content and length of input. ada-002 is OpenAI's foremost text embedding model, used widely in industry. Therefore, this result has relevant implications for applications like semantic search, where a better understanding of the balance between language-related and content-related similarities is important for accurate information retrieval.