Text Embeddings

Text embeddings are one of the most transformative concepts in modern artificial intelligence. While traditional methods of processing text rely on exact word matches, embeddings allow computers to understand the meaning behind the words. At its core, an embedding is a way of converting a word, a sentence, or even an entire document into a long list of numbers—a vector. These numbers aren't random; they are carefully learned coordinates in a multi-dimensional mathematical space. The magic of embeddings is that text with similar meanings will end up close together in this space, while unrelated text will be far apart. This allows a computer to "know" that a "car" and a "vehicle" are nearly the same thing, even though the words don't share any letters.

Visualizing the Vector Space

To understand how embeddings work, imagine a giant map where every word has its own specific location. In this map, words that describe similar concepts, like "Happy," "Joyful," and "Cheerful," would all be clustered in the same neighborhood. Words that represent opposites or completely different categories, like "Freezing" or "Bicycle," would be located far away. This spatial relationship is what gives embeddings their power. By measuring the distance between two points in this mathematical map, we can calculate exactly how semantically similar two pieces of text are. This is the foundation for modern search engines, recommendation systems, and the "brains" of large language models.

KingManQueenWomanApple

Word Arithmetic: King - Man + Woman = Queen

One of the most famous properties of high-quality embeddings is that they capture logical relationships through "vector arithmetic." If you take the vector for "King," subtract the vector for "Man," and add the vector for "Woman," the resulting point in space is incredibly close to the vector for "Queen." This proved that the model had learned not just individual words, but the underlying concepts of gender and royalty. Similar arithmetic works for verb tenses (e.g., "Walking" - "Walk" + "Swim" = "Swimming") and even geography (e.g., "Paris" - "France" + "Japan" = "Tokyo").

Measuring Similarity with Cosine Similarity

Because these embedding vectors can have hundreds or even thousands of dimensions, we need a reliable way to compare them. The industry-standard method is Cosine Similarity. Instead of measuring the straight-line distance between two points, cosine similarity measures the angle between two vectors.

  • A score of 1.0 means the vectors are pointing in exactly the same direction (identical meaning).
  • A score of 0 means they are perpendicular (unrelated).
  • A score of -1.0 means they are pointing in opposite directions (opposites). This method is preferred because it focuses on the "direction" of the meaning rather than just the "length" of the vector, making it much more robust for comparing different lengths of text.

Practical Applications: Semantic Search and RAG

Today, embeddings are used in almost every advanced AI application.

  • Semantic Search: Unlike traditional search that looks for keywords, semantic search finds documents that are about the same topic, even if they use different words.
  • Retrieval-Augmented Generation (RAG): This is the technique used to give LLMs like ChatGPT access to private or up-to-date data. We convert a huge database into embeddings, and when a user asks a question, we find the most relevant facts using vector similarity and feed them to the AI as context.
  • Clustering: We can use embeddings to group millions of documents into categories automatically without ever needing to read them ourselves.
# Using a pre-trained model to generate sentence embeddings
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "The cat sits outside",
    "A man is playing guitar",
    "The new movie is awesome"
]

# Generate the embeddings (vectors)
embeddings = model.encode(sentences)

print(f"Shape of one embedding: {embeddings[0].shape}") # e.g. (384,)

Vector Databases: The Storage for AI

As you scale your AI applications to handle millions of documents, you cannot just store embeddings in a regular list. You need a Vector Database (like Pinecone, Milvus, or Weaviate). These specialized databases are designed to perform lightning-fast similarity searches across billions of vectors. By mastering embeddings and vector databases, you are learning the core technology that allows machines to navigate the infinite complexity of human knowledge and language.