Have you ever wondered how computers understand text, images, or even sounds? The secret lies in something called embeddings. If you’re new to this, don’t worry — we’re going to explain embeddings in a way that’s easy to understand.
What Are Embeddings?
Imagine you have a bunch of words like “apple,” “banana,” and “orange.” For a computer, these are just letters and don’t mean much. To help the computer understand these words better, we need to turn them into numbers.
Embeddings are a way of representing things (like words, images, or sounds) as numbers, usually in the form of a list. These lists capture the meaning or features of the thing they represent. For example:
- Apple: [1.2, 0.8, -0.5]
- Banana: [1.1, 0.9, -0.4]
These numbers are placed in a special “space” that helps the computer compare them and find relationships.
Why Do We Need Embeddings?
Computers are great at working with numbers, not text. By turning things into numbers (embeddings), we help computers perform tasks like:
- Finding Similarities: Are two words or images similar? Embeddings can tell us how close or far apart they are.
- Making Predictions: Embeddings help models guess what word might come next in a sentence.
- Organising Information: They help group similar things together, like categorising emails as “work” or “personal.”
How Do Embeddings Work?
Think of a map. Each city has coordinates (latitude and longitude) that tell us where it is. Similarly, embeddings give coordinates to words or images. For example:
- Words like “dog” and “cat” might have similar coordinates because they’re related.
- Words like “dog” and “car” will be further apart because they’re less related.
One amazing thing about embeddings is how they learn relationships. For example, in word embeddings:
- “King” — “Man” + “Woman” ≈ “Queen”
This means embeddings understand gender relationships, synonyms, and even context!
Generating Embeddings
We are using OpenAI API to generate embeddings, and for that we need to add some payment to the OpenAI to use the API for doing the same we need to follow these simple steps:
1. Create an OpenAI account
If you don’t already have an OpenAI account, you need to create one by following the steps on the OpenAI website.
2. Generate the OpenAI API key
Once you’ve created your OpenAI account or logged into an existing one, you’ll see your name’s initials and profile icon at the top-right corner of the OpenAI dashboard. To generate an OpenAI API key, tap on your name to view the dropdown menu. Click the ‘View API keys’ option.
At this stage, you’ll see a window with the option ‘Create new secret key’ near the center. If you don’t have an Open API key, proceed to click this option to get one. Ensure you save this newly generated API key as soon as possible. This is because you won’t be able to see the full OpenAI API key again once the window closes.
3. Billing Setup
OpenAI charges for API usage based on consumption. Ensure you have set up a payment method for billing by clicking on “Billing” in the left menu and then “Payment methods”. Enter your credit card and billing details, then click submit.
4. Security Reminder
Keep your API key secure and do not share it with anyone for security reasons
We are using python to make request to the OpenAI API endpoint to get our embeddings. for this we need to install some package using the commands in the brackets.
- opeanai (pip install openai)
- urllib (pip install urllib3)
now import these modules in your code.
import openai
import urllib
Configuring API key
We will save the api key in a file and then load the API key from the file and set that as Environment variable. I saved my API key inside “openai_key.txt” and the following code will help you to do the same.
with open(“../openai_key.txt”, “r”) as file:
openai_key = file.read()
import os
os.environ[“OPENAI_KEY”] = openai_key
Requesting OpenAI
The endpoint for text embedding is: https://api.openai.com/v1/embeddings and the model we are using is ‘text-embedding-ada-002’. The following code will help you to request OpenAI for embedding text.
import json
url = 'https://api.openai.com/v1/embeddings'
def get_openai_embedding(prompt):
headers = {
'Authorization': f'Bearer {openai_key}',
'Content-Type': 'application/json',
}
data = {
"input": prompt,
"model": "text-embedding-ada-002"
}
data = json.dumps(data).encode('utf-8')
req = urllib.request.Request(url, data=data, headers=headers, method='POST')
try:
response = urllib.request.urlopen(req)
response_data = json.loads(response.read().decode('utf-8'))
return response_data['data'][0]['embedding']
except urllib.error.HTTPError as e:
print(f"HTTP Error: {e}")
return None
dog = get_openai_embedding("dog")
cat = get_openai_embedding("cat")
car = get_openai_embedding("car")
we embedded ‘dog’, ‘cat’ and ‘car’ and stored it in variables. now we will find out the distance between them or how similar these words are.
Using Pythagoras Theorem or Euclidean distance
As we already discussed the formula for the Pythagoras Theorem or Euclidean distance in our previous blog we will use that here. If you still didn’t gone through that blog here is the link: Pythagoras Theorem blog.
def cal_dist(a,b):
dist_nd = sum((x-y)**2 for x,y in zip(a,b))**0.5
return dist_nd
print(cal_dist(dog, cat))
print(cal_dist(dog, car))
0.5235309855313564
0.5779158112105844
As we can see the distance between dog and cat is less than the distance between dog and car that implies dog and cat are more similar than the dog and car. Hence we have successfully used the embeddings to make computers learn the meaning behind text.
But here when we are computing the Pythagoras Theorem or Euclidean distance we are squaring and rooting again and again as these embeddings are in very high dimensions like: 1500+ dimensions and thus this method requires more computational power. So for that we already found out a method called cosine similarity.
Using Cosine Similarity
The OpenAI documents that they normalise the vectors (embeddings) as unit vectors means all the embedding have 1 as length from origin. We can check that by using this code below:
print((sum(x**2 for x in dog))**0.5)
print((sum(x**2 for x in cat))**0.5)
print((sum(x**2 for x in car))**0.5)
1.0000000351643519
0.9999999077861369
1.0000000341730662
we can see the length of all these are very close to 1 so we can use the cosine similarity very easily as we already discussed in this blog: cosine similarity.
def cosine_similarity(a, b):
similarity = 0
for i,j in zip(a,b):
similarity += i*j
return similarity
print(cosine_similarity(dog,cat))
print(cosine_similarity(dog,car))
0.862957596544778
0.8330067269138243
we can see that the value for cosine similarity for dog and cat is more than the dog and car which implies that dog and cat are more similar than dog and car.
Here we can see the cosine similarity uses only addition and multiplication which requires very less computational power as compare to Pythagoras Theorem or Euclidean distance, as we discussed in the cosine similarity blog.
Next we will use this embeddings concept to build something which is relevant to the real world problem or real world scenario.