Generating LLM embeddings with open source models in PostgresML

How to use the pgml.embed(...) function to generate embeddings with free and open source models in your own database.
By Montana Low
04/21/2023

Generating LLM embeddings with open source models in PostgresML

Author

Montana Low

April 21, 2023

PostgresML makes it easy to generate embeddings from text in your database using a large selection of state-of-the-art models with one simple call to pgml.embed(model_name, text). Prove the results in this series to your own satisfaction, for free, by signing up for a GPU accelerated database.

This article is the first in a multipart series that will show you how to build a post-modern semantic search and recommendation engine, including personalization, using open source models.

  1. Generating LLM Embeddings with HuggingFace models
  2. Tuning vector recall with pgvector
  3. Personalizing embedding results with application data
  4. Optimizing semantic results with an XGBoost ranking model - coming soon!

Introduction

In recent years, embeddings have become an increasingly popular technique in machine learning and data analysis. They are essentially vector representations of data points that capture their underlying characteristics or features. In most programming environments, vectors can be efficiently represented as native array datatypes. They can be used for a wide range of applications, from natural language processing to image recognition and recommendation systems.

They can also turn natural language into quantitative features for downstream machine learning models and applications.

Embeddings show us the relationships between rows in the database.

A popular use case driving the adoption of "vector databases" is doing similarity search on embeddings, often referred to as "Semantic Search". This is a powerful technique that allows you to find similar items in large datasets by comparing their vectors. For example, you could use it to find similar products in an e-commerce site, similar songs in a music streaming service, or similar documents given a text query.

Postgres is a good candidate for this type of application because it's a general purpose database that can store both the embeddings and the metadata in the same place, and has a rich set of features for querying and analyzing them, including fast vector indexes used for search.

This chapter is the first in a multipart series that will show you how to build a modern semantic search and recommendation engine, including personalization, using PostgresML and open source models. We'll show you how to use the pgml.embed function to generate embeddings from text in your database using an open source pretrained model. Further chapters will expand on how to implement many of the different use cases for embeddings in Postgres, like similarity search, personalization, recommendations and fine-tuned models.

It always starts with data

Most general purpose databases are full of all sorts of great data for machine learning use cases. Text data has historically been more difficult to deal with using complex Natural Language Processing techniques, but embeddings created from open source models can effectively turn unstructured text into structured features, perfect for more straightforward implementations.

In this example, we'll demonstrate how to generate embeddings for products on an e-commerce site. We'll use a public dataset of millions of product reviews from the Amazon US Reviews. It includes the product title, a text review written by a customer and some additional metadata about the product, like category. With just a few pieces of data, we can create a full-featured and personalized product search and recommendation engine, using both generic embeddings and later, additional fine-tuned models trained with PostgresML.

PostgresML includes a convenience function for loading public datasets from HuggingFace directly into your database. To load the DVD subset of the Amazon US Reviews dataset into your database, run the following command:

content_copy
SELECT *
FROM pgml.load_dataset('amazon_us_reviews', 'Video_DVD_v1_00');

It took about 23 minutes to download the 7.1GB raw dataset with 5,069,140 rows into a table within the pgml schema (where all PostgresML functionality is name-spaced). Once it's done, you can see the table structure with the following command:

content_copy
\d pgml.amazon_us_reviews
Column Type Collation Nullable Default
marketplace text
customer_id text
review_id text
product_id text
product_parent text
product_title text
product_category text
star_rating integer
helpful_votes integer
total_votes integer
vine bigint
verified_purchase bigint
review_headline text
review_body text
review_date text

Let's take a peek at the first 5 rows of data:

content_copy
SELECT *
FROM pgml.amazon_us_reviews
LIMIT 5;
marketplace customer_id review_id product_id product_parent product_title product_category star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date
US 27288431 R33UPQQUZQEM8 B005T4ND06 400024643 Yoga for Movement Disorders DVD: Rebuilding Strength, Balance, and Flexibility for Parkinson's Disease and Dystonia Video DVD 5 3 3 0 1 This was a gift for my aunt who has Parkinson's ... This was a gift for my aunt who has Parkinson's. While I have not previewed it myself, I also have not gotten any complaints. My prior experiences with yoga tell me this should be just what the doctor ordered. 2015-08-31
US 13722556 R3IKTNQQPD9662 B004EPZ070 685335564 Something Borrowed Video DVD 5 0 0 0 1 Five Stars Teats my heart out. 2015-08-31
US 20381037 R3U27V5QMCP27T B005S9EKCW 922008804 Les Miserables (2012) [Blu-ray] Video DVD 5 1 1 0 1 Great movie! Great movie. 2015-08-31
US 24852644 R2TOH2QKNK4IOC B00FC1ZCB4 326560548 Alien Anthology and Prometheus Bundle [Blu-ray] Video DVD 5 0 1 0 1 Amazing My husband was so excited to receive these as a gift! Great picture quality and great value! 2015-08-31
US 15556113 R2XQG5NJ59UFMY B002ZG98Z0 637495038 Sex and the City 2 Video DVD 5 0 0 0 1 Five Stars Love this series. 2015-08-31

Generating embeddings from natural language text

PostgresML provides a simple interface to generate embeddings from text in your database. You can use the pgml.embed function to generate embeddings for a column of text. The function takes a transformer name and a text value. The transformer will automatically be downloaded and cached on your connection process for reuse. You can see a list of potential good candidate models to generate embeddings on the Massive Text Embedding Benchmark leaderboard.

Since our corpus of documents (movie reviews) are all relatively short and similar in style, we don't need a large model. Alibaba-NLP/gte-base-en-v1.5 will be a good first attempt. The great thing about PostgresML is you can always regenerate your embeddings later to experiment with different embedding models.

It takes a couple of minutes to download and cache the Alibaba-NLP/gte-base-en-v1.5 model to generate the first embedding. After that, it's pretty fast.

Note how we prefix the text we want to embed with either passage: or query: , the e5 model requires us to prefix our data with passage: if we're generating embeddings for our corpus and query: if we want to find semantically similar content.

content_copy
SELECT pgml.embed('Alibaba-NLP/gte-base-en-v1.5', 'passage: hi mom');

This is a pretty powerful function, because we can pass any arbitrary text to any open source model, and it will generate an embedding for us. We can benchmark how long it takes to generate an embedding for a single review, using client-side timings in Postgres:

content_copy
\timing on

Aside from using this function with strings passed from a client, we can use it on strings already present in our database tables by calling pgml.embed on columns. For example, we can generate an embedding for the first review using a pretty simple query:

content_copy
SELECT
review_body,
pgml.embed('Alibaba-NLP/gte-base-en-v1.5', 'passage: ' || review_body)
FROM pgml.amazon_us_reviews
LIMIT 1;
timer 54.820 ms
content_copy
CREATE INDEX

Time to generate an embedding increases with the length of the input text, and varies widely between different models. If we up our batch size (controlled by LIMIT), we can see the average time to compute an embedding on the first 1000 reviews is about 17ms per review:

content_copy
SELECT
review_body,
pgml.embed('Alibaba-NLP/gte-base-en-v1.5', 'passage: ' || review_body) AS embedding
FROM pgml.amazon_us_reviews
LIMIT 1000;
timer 17955.026 ms

Comparing different models and hardware performance

This database is using a single GPU with 32GB RAM and 8 vCPUs with 16GB RAM. Running these benchmarks while looking at the database processes with htop and nvidia-smi, it becomes clear that the bottleneck in this case is actually tokenizing the strings which happens in a single thread on the CPU, not computing the embeddings on the GPU which was only 20% utilized during the query.

We can also do a quick sanity check to make sure we're really getting value out of our GPU by passing the device to our embedding function:

content_copy
SELECT
reviqew_body,
pgml.embed(
'Alibaba-NLP/gte-base-en-v1.5',
'passage: ' || review_body,
'{"device": "cpu"}'
) AS embedding
FROM pgml.amazon_us_reviews
LIMIT 1000;
timer 30421.491 ms

Forcing the embedding function to use cpu is almost 2x slower than cuda which is the default when GPUs are available.

If you're managing dedicated hardware, there's always a decision to be made about resource utilization. If this is a multi-workload database with other queries using the GPU, it's probably great that we're not completely hogging it with our multi-decade-Amazon-scale data import process, but if this is a machine we've spun up just for this task, we can up the resource utilization to 4 concurrent connections, all running on a subset of the data to more completely utilize our CPU, GPU and RAM.

Another consideration is that GPUs are much more expensive right now than CPUs, and if we're primarily interested in backfilling a dataset like this, high concurrency across many CPU cores might just be the price-competitive winner.

With 4x concurrency and a GPU, it'll take about 6 hours to compute all 5 million embeddings, which will cost $72 on PostgresML Cloud. If we use the CPU instead of the GPU, we'll probably want more cores and higher concurrency to plug through the job faster. A 96 CPU core machine could complete the job in half the time our single GPU would take and at a lower hourly cost as well, for a total cost of $24. It's overall more cost-effective and faster in parallel, but keep in mind if you're interactively generating embeddings for a user facing application, it will add double the latency, 30ms CPU vs 17ms for GPU.

For comparison, it would cost about $299 to use OpenAI's cheapest embedding model to process this dataset. Their API calls average about 300ms, although they have high variability (200-400ms) and greater than 1000ms p99 in our measurements. They also have a default rate limit of 200 tokens per minute which means it would take 1,425 years to process this dataset. You better call ahead.

Processor Latency Cost Time
CPU 30ms $24 3 hours
GPU 17ms $72 6 hours
OpenAI 300ms $299 millennia

You can also find embedding models that outperform OpenAI's text-embedding-ada-002 model across many different tests on the leaderboard. It's always best to do your own benchmarking with your data, models, and hardware to find the best fit for your use case.

HTTP requests to a different datacenter cost more time and money for lower reliability than co-located compute and storage.

Instructor embedding models

The current leading model is hkunlp/instructor-xl. Instructor models take an additional instruction parameter which includes context for the embeddings use case, similar to prompts before text generation tasks.

priority_high
Note
content_copy
"Alibaba-NLP/gte-base-en-v1.5" surpassed the quality of instructor-xl, and should be used instead, but we've left this documentation available for existing users

Instructions can provide a "classification" or "topic" for the text:

Classification

content_copy
SELECT pgml.embed(
transformer => 'hkunlp/instructor-xl',
text => 'The Federal Reserve on Wednesday raised its benchmark interest rate.',
kwargs => '{"instruction": "Represent the Financial statement:"}'
);
timer 17.912ms

They can also specify particular use cases for the embedding:

Querying

content_copy
SELECT pgml.embed(
transformer => 'hkunlp/instructor-xl',
text => 'where is the food stored in a yam plant',
kwargs => '{
"instruction": "Represent the Wikipedia question for retrieving supporting documents:"
}'
);
timer 24.263 ms

Indexing

content_copy
SELECT pgml.embed(
transformer => 'hkunlp/instructor-xl',
text => 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.',
kwargs => '{"instruction": "Represent the Wikipedia document for retrieval:"}'
);
timer 30.571 ms

Clustering

content_copy
SELECT pgml.embed(
transformer => 'hkunlp/instructor-xl',
text => 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity"}',
kwargs => '{"instruction": "Represent the Medicine sentence for clustering:"}'
);
timer 18.986 ms

Performance remains relatively good, even with the most advanced models.

Generating embeddings for a large dataset

For our use case, we want to generate an embedding for every single review in the dataset. We'll use the vector datatype available from the pgvector extension to store (and later index) embeddings efficiently. All PostgresML cloud installations include pgvector. To enable this extension in your database, you can run:

content_copy
CREATE EXTENSION vector;

Then we can add a vector column for our review embeddings, with 384 dimensions (the size of e5-small embeddings):

content_copy
ALTER TABLE pgml.amazon_us_reviews
ADD COLUMN review_embedding_e5_large vector(1024);

It's best practice to keep running queries on a production database relatively short, so rather than trying to update all 5M rows in one multi-hour query, we should write a function to issue the updates in smaller batches. To make iterating over the rows easier and more efficient, we'll add an id column with an index to our table:

content_copy
ALTER TABLE pgml.amazon_us_reviews
ADD COLUMN id SERIAL PRIMARY KEY;

Every language/framework/codebase has its own preferred method for backfilling data in a table. The 2 most important considerations are:

  1. Keep the number of rows per query small enough that the queries take less than a second
  2. More concurrency will get the job done faster, but keep in mind the other workloads on your database

Here's an example of a very simple back-fill job implemented in pure PGSQL, but I'd also love to see example PRs opened with your techniques in your language of choice for tasks like this.

content_copy
DO $$
BEGIN
FOR i in 1..(SELECT max(id) FROM pgml.amazon_us_reviews) by 10 LOOP
BEGIN RAISE NOTICE 'updating % to %', i, i + 10; END;
UPDATE pgml.amazon_us_reviews
SET review_embedding_e5_large = pgml.embed(
'Alibaba-NLP/gte-base-en-v1.5',
'passage: ' || review_body
)
WHERE id BETWEEN i AND i + 10
AND review_embedding_e5_large IS NULL;
COMMIT;
END LOOP;
END;
$$;

What's next?

That's it for now. We've got an Amazon scale table with state-of-the-art machine learning embeddings. As a premature optimization, we'll go ahead and build an index on our new column to make our future vector similarity queries faster. For the full documentation on vector indexes in Postgres see the pgvector docs.

content_copy
CREATE INDEX CONCURRENTLY index_amazon_us_reviews_on_review_embedding_e5_large
ON pgml.amazon_us_reviews
USING ivfflat (review_embedding_e5_large vector_cosine_ops)
WITH (lists = 2000);
timer 4068909.269 ms (01:07:48.909)
help
Tip

Create indexes CONCURRENTLY to avoid locking your table for other queries.

Building a vector index on a table with this many entries takes a while, so this is a good time to take a coffee break. In the next article we'll look at how to query these embeddings to find the best products and make personalized recommendations for users. We'll also cover updating an index in real time as new data comes in.