As a marketer at an AI startup, I often hear terms like "text embeddings" that sound highly technical. I didn't understand why they were so important to AI models. The concepts seemed vague and abstract to me at first.
Eventually, I unlocked the key role text embeddings play in natural language processing (NLP). It turns out they had been behind the scenes all along in many applications I used daily - powering the chatbots I conversed with and the search engines delivering amazingly relevant results.
So what changed? I finally grasped that text embeddings map words into a vector space that preserves the relationships between them. It clicked when I thought about words as stars, and embeddings connect them into constellations representing concepts. This numerical representation allows machines to analyze the underlying semantics in text on a much more nuanced level.
I want to guide you through this realization as well, so you can also harness text embeddings in your AI projects. In this post, I'll explain exactly what text embeddings are, list popular open source text embedding modes, and why they're so fundamental to AI today.
What is a Text Embedding Model?
Have you ever tried to explain the plot of a movie to a friend, but struggled to capture the magic and emotion you experienced? That frustration stems from the vast gap between human language and machine understanding. We express concepts through words and context that computer algorithms cannot easily interpret.
Text embedding models are the key to bridging that gap.
These models work like a translator, converting words and sentences into a numeric representation that retains the original meaning as much as possible. Imagine turning a book passage into a set of coordinates in space - the distance between points conveys the relationships between the words.
Instead of processing language at face value, text embeddings allow machines to analyze the underlying semantics.
Several techniques exist, from simpler count-based models like TF-IDF to sophisticated neural networks like BERT. While earlier methods only look at individual words, modern embeddings like Word2Vec leverage context so that related terms cluster together in their vector space. This enables nuanced understanding of natural language.
Text embeddings power a wide range of AI applications today:
- Search engines optimize results by mapping queries and documents into a common space. This allows matching words with similar embeddings even if the exact term doesn't appear.
- Machine translation services like Google Translate rely on embeddings to translate between languages. The model maps words and phrases to vectors in one language and finds the closest equivalent term in the target language.
- Sentiment analysis tools classify emotions in text by locating words in the vector space relative to points associated with positive, negative, or neutral sentiment.
- Chatbots (RAG) use embeddings to interpret user inputs and determine appropriate responses, facilitating more natural conversations.
The smart assistant that understands your movie plot summary? Text embeddings get us closer to that future. They transform the intricacies of human language so machines can really comprehend meaning.
So if you're working on any application involving natural language processing, from search to recommendations to analytics, start by integrating text embeddings. They provide the missing link to transform words into insight.
15 Open Source Text Embedding Models (updated April 2024)
To provide the full landscape of text embedding options, I consulted with Dan Woolridge, Machine Learning Engineer at Graft, to compile this list of 14 popular open source text embedding models.
His expert perspective shed light on the diverse capabilities and best uses for each one. Whether you need a blazing fast general purpose embedding or one tailored to scientific text, there’s a model here for you.
"Open source text embedding models offer visibility and control, letting me see their training data and inner workings. They evolve with collective AI research, and because they’re open, they are easy to retrain with the latest data. Plus, I can fine-tune them for specific datasets, ensuring both flexibility and trust in my AI systems."
Dan Woolridge, Machine Learning Engineer at Graft
Let’s explore!
- GTE-Base (Graft Default)
- GTE-Large
- GTE-Small
- E5-Small
- MultiLingual
- RoBERTa (2022)
- MPNet V2
- Scibert Science-Vocabulary Uncased
- Longformer Base 4096
- Distilbert Base Uncased
- Bert Base Uncased
- MultiLingual BERT
- E5-Base
- LED 16K
- SFR-Embedding-Mistral
*note: these are all available in Graft today.
1. GTE-Base (Graft Default)
- Model Name: gte-base
- Description: A good general model for similarity search or downstream enrichments
- Use for: General text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Source: thenlper/gte-base · Hugging Face
- Trained on: (refer to the paper for details)
- Paper: Towards General Text Embeddings with Multi-stage Contrastive Learning
- Embedding Dimension: 768
- Model Size: 219 MB
2. GTE-Large
- Model Name: gte-large
- Description: A higher quality general model for similarity search or downstream enrichments
- Use for: General text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Source: thenlper/gte-large · Hugging Face
- Trained on: (refer to the paper for details)
- Paper: Towards General Text Embeddings with Multi-stage Contrastive Learning
- Embedding Dimension: 1024
- Model Size: 670 MB
3. GTE-Small
- Model Name: gte-small
- Description: A good general model and faster for similarity search or downstream enrichments
- Use for: General text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Source: thenlper/gte-small · Hugging Face
- Trained on: (refer to the paper for details)
- Paper: Towards General Text Embeddings with Multi-stage Contrastive Learning
- Embedding Dimension: 384
- Model Size: 67 MB
4. E5-Small
- Model Name: e5-small-v2
- Description: A good small and fast general model for similarity search or downstream enrichments
- Use for: General text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Source: intfloat/e5-small-v2 · Hugging Face
- Trained on: CCPairs (details) (A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.)
- Paper: Text Embeddings by Weakly-SupervisedContrastive Pre-training
- Embedding Dimension: 384
- Model Size: 128 MB
5. MultiLingual
- Model Name: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- Description: A good general model to deal with multilingual datasets
- Use for: General text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Source: sentence-transformers/paraphrase-multilingual-mpnet-base-v2 · Hugging Face
- Trained on: Paraphrase Data (details)
- Paper: Making Monolingual Sentence Embeddings Multilingual usingKnowledge Distillation
- Embedding Dimension: 768
- Model Size: 1.04 GB
6. RoBERTa (2022)
- Model Name: olm/olm-roberta-base-dec-2022
- Description: A RoBERTa model train on data up to december 2022, for users who are familiar with BERT model family and want to use it in Graft
- Use for: General text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Source: olm/olm-roberta-base-dec-2022 · Hugging Face
- Trained on: A cleaned December 2022 snapshot of Common Crawl and Wikipedia. (details)
- Paper: RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Embedding Dimension: 768
- Model Size: 476 MB
7. MPNet V2
- Model Name: sentence-transformers/all-mpnet-base-v2
- Description: Mpnet model with Siamese architecture trained for text similarity
- Use for: Similarity search for text
- Limitations: Text longer than 512 tokens will be truncated
- Source: all-mpnet-base-v2 · Hugging Face
- Trained on: concatenation from multiple datasets to fine-tune model. The total number of sentence pairs is above 1 billion sentences.
- Paper: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- Embedding Dimension: 768
- Model Size: 420 MB
8. Scibert Science-Vocabulary Uncased
- Description: A BERT model pretrained on scientific text with a specific science focused vocabulary.
- Use for: Scientific text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Model Name: allenai/scibert_scivocab_uncased
- Source: allenai/scibert_scivocab_uncased · Hugging Face
- Trained on: Semantic Scholar - 1.14M scientific papers
- Paper: SCIBERT: A Pretrained Language Model for Scientific Text
- Embedding Dimension: 768
- Model Size: 442 MB
9. Longformer Base 4096
- Description: A transformer model for long text, based on RoBERTa
- Use for: Text up to 4096 tokens
- Limitations: text longer than 4096 tokens will be truncated
- Model Name: allenai/longformer-base-4096
- Source: allenai/longformer-base-4096 · Hugging Face
- Trained on: English Wikipedia and “BookCorpus”
- Paper: Longformer: The Long-Document Transformer
- Embedding Dimension: 768
- Model Size: 597 MB
10. Distilbert Base Uncased
- Model Name: distilbert-base-uncased
- Description: Relatively fast and small model, with near performance to BERT
- Use for: General text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Source: distilbert-base-uncased · Hugging Face
- Trained on: English Wikipedia and “BookCorpus”
- Paper: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- Embedding Dimension: 768
- Model Size: 268 MB
11. Bert Base Uncased
- Model Name: bert-base-uncased
- Description: The BERT - Language model trained on english text via masked language modeling and next sentence prediction.
- Use for: General text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Source: bert-base-uncased · Hugging Face
- Trained on: English Wikipedia and “BookCorpus”
- Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Embedding Dimension: 768
- Model Size: 440 MB
12. MultiLingual BERT
- Description: A multilingual version of BERT trained on 102 languages
- Use for: Scenarios where text of various languages exist
- Limitations: 512 max sequence length
- Model Name: bert-base-multilingual-cased
- Source: bert-base-multilingual-cased · Hugging Face
- Trained on: Wikipedia
- Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Embedding Dimension: 768
- Model Size: 714MB
13. E5-Base
- Model Name: e5-base
- Description: A good general model for similarity search or downstream enrichments
- Use for: General text blobs
- Limitations: Text longer than 512 tokens will be truncated
- Source: intfloat/e5-base · Hugging Face
- Trained on: CCPairs (details) (A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.)
- Paper: Text Embeddings by Weakly-SupervisedContrastive Pre-training
- Embedding Dimension: 768
- Model Size: 418 MB
14. LED 16K
- Description: A transformer model for very long text, based on BART
- Use for: Text up to 16384 tokens
- Limitations: Compressing 16Kish words into 768 dimensions will surely be lossy
- Model Name: allenai/led-base-16384
- Source: allenai/led-base-16384 · Hugging Face
- Trained on: English Wikipedia and “BookCorpus”
- Paper: Longformer: The Long-Document Transformer
- Embedding Dimension: 768
- Model Size: 648 MB
15. voyage-lite-02-instruct
- Description: Instruction-tuned model from first-generation of the Voyage family
- Use for: Instruction-tuned for classification, clustering, and sentence textual similarity tasks, which are the only recommended use cases.
- Limitations: Smallest text embedding model from second-generation of Voyage family
- Model Name: voyage-lite-02-instruct
- Source: Embeddings Docs - Voyage
- Trained on: N/A
- Paper: N/A
- Embedding Dimension: 1024
- Model Size: 1220 MB
Check out our Comprehensive Guide to the Best Open Source Vector Databases
Massive Text Embedding Benchmark (MTEB) Leaderboard
With rapid innovation in the field, the race for the best text embedding model is tighter than ever. The top contenders are packed so closely together that there's barely daylight between them in terms of performance. Based on extensive benchmarking and real-world testing, there's barely a 3 point difference between the top 10 open source text embedding models on HuggingFace. That's how competitive it is at the summit!
This intense jockeying for the pole position means you have an embarrassment of riches when selecting a text embedding model. You can feel confident picking from the leading options knowing that they are all operating at the cutting edge. It's a great time to integrate text embeddings, with multiple excellent models vying for first place and pushing each other to new heights.
The miniscule gaps between these elite models also highlight the importance of testing them for your specific use case. Certain datasets or downstream tasks may favor one model over another by a slim margin. With Graft's platform, you can easily compare top contenders side-by-side to find the ideal fit.
So rest assured that the top open source text embeddings are all performing at an elite level, separated by the narrowest of margins. Pick the one tailored to your needs and start reaping the benefits of these incredible models!
How to Compare the Performance of Multiple Text Embeddings
I explored 14 incredible open source text embedding models that are all available in Graft. With so many options, you're probably brimming with ideas for integrating embeddings into your NLP applications.
But before diving headfirst into implementation, consider this - open source flexibility comes at the cost of complexity. Integrating and comparing multiple models involves decisions, customization, and troubleshooting that can quickly become a labyrinth.
That's why Graft's AI platform offers a faster, simpler solution purpose-built for productions. Here's how Graft gives you an efficient onramp to advanced text embeddings:
- Experiment faster with one-click access to pre-integrated open source and commercial (OpenAI & Cohere) embedding models - no manual tinkering required.
- Side-by-side comparison for multiple models, so you can choose the optimal one for your use case.
- Seamless integration with your downstream AI tasks through a robust API.
- Scalability to production workloads while maintaining speed and cost-efficiency.
- Expert guidance from our team if you need help selecting and fine-tuning models.
With Graft, you get the versatility of open source models without the building and maintenance hassles. Now you can hit the ground running and capitalize on text embeddings for your AI applications.
Don't settle for duct-taped solutions. Choose Graft and unlock the true power of text embeddings today!
Check out the 3 Ways to Optimize Your Semantic Search Engine With Graft
From Confusion to Clarity: Key Takeaways
When I first heard the term "text embeddings," I glazed over like it was just more AI jargon. But now after unlocking their concepts in this post, I'm amazed by the quiet revolution embeddings have driven behind the scenes.
By mapping words into vector spaces capturing semantic relationships, text embeddings enable machines to truly comprehend language. Techniques like Word2Vec and BERT are the missing link powering today's magical NLP applications.
While open source models allow incredible innovation, platforms like Graft simplify production deployment. One-click access and comparison help you find the perfect text embedding for your use case.
My journey today sparked an enthusiasm to keep learning more about this field. I hope you feel empowered to start building the next generation of intelligent applications.
Text embeddings have already changed the AI landscape. Now it's your turn to harness their potential and create some magic!