Montana Low
May 3, 2023
PostgresML makes it easy to generate embeddings using open source models from Huggingface and perform complex queries with vector indexes and application data unlike any other database. The full expressive power of SQL as a query language is available to seamlessly combine semantic, geospatial, and full text search, along with filtering, boosting, aggregation, and ML reranking in low latency use cases. You can do all of this faster, simpler and with higher quality compared to applications built on disjoint APIs like OpenAI + Pinecone. Prove the results in this series to your own satisfaction, for free, by signing up for a GPU accelerated database.
This article is the third in a multipart series that will show you how to build a post-modern semantic search and recommendation engine, including personalization, using open source models. You may want to start with the previous articles in the series if you aren't familiar with PostgresML's capabilities.
- Generating LLM Embeddings with HuggingFace models
- Tuning vector recall with pgvector
- Personalizing embedding results with application data
- Optimizing semantic results with an XGBoost ranking model - coming soon!
In the era of big data and advanced machine learning algorithms, personalization has become a critical component in many modern technologies. One application of personalization is in search and recommendation systems, where the goal is to provide users with relevant and personalized experiences. Embedding vectors have become a popular tool for achieving this goal, as they can represent items and users in a compact and meaningful way. However, standard embedding vectors have limitations, as they do not take into account the unique preferences and behaviors of individual users. To address this, a promising approach is to use aggregates of user data to personalize embedding vectors. This article will explore the concept of using aggregates to create new embedding vectors and provide a step-by-step guide to implementation.
We'll continue working with the same dataset from the previous articles. 5M+ customer reviews about movies from amazon over a decade. We've already generated embeddings for each review, and aggregated them to build a consensus view of the reviews for each movie. You'll recall that our reviews also include a customer_id as well.
content_copy
\d pgml.amazon_us_reviews
Column |
Type |
Collation |
Nullable |
Default |
marketplace |
text |
|
|
|
customer_id |
text |
|
|
|
review_id |
text |
|
|
|
product_id |
text |
|
|
|
product_parent |
text |
|
|
|
product_title |
text |
|
|
|
product_category |
text |
|
|
|
star_rating |
integer |
|
|
|
helpful_votes |
integer |
|
|
|
total_votes |
integer |
|
|
|
vine |
bigint |
|
|
|
verified_purchase |
bigint |
|
|
|
review_headline |
text |
|
|
|
review_body |
text |
|
|
|
review_date |
text |
|
|
|
In the previous article, we saw that we could aggregate all the review embeddings to create a consensus view of each movie. Now we can take that a step further, and aggregate all the movie embeddings that each customer has reviewed, to create an embedding for every customer in terms of the movies they've reviewed. We're not going to worry about if they liked the movie or not just yet based on their star rating. Simply the fact that they've chosen to review a movie indicates they chose to purchase the DVD, and reveals something about their preferences. It's always easy to create more tables and indexes related to other tables in our database.
content_copy
CREATE TABLE customers AS
SELECT
customer_id AS id,
count(*) AS total_reviews,
avg(star_rating) AS star_rating_avg,
pgml.sum(movies.review_embedding_e5_large)::vector(1024) AS movie_embedding_e5_large
FROM pgml.amazon_us_reviews
JOIN movies
ON movies.id = amazon_us_reviews.product_id
GROUP BY customer_id;
timer
458838.918 ms (07:38.839)
We've just created a table aggregating our 5M+ reviews into 2M+ customers, with mostly vanilla SQL. The query includes a JOIN between the pgml.amazon_us_reviews
we started with, and the movies
table we created to hold the movie embeddings. We're using pgml.sum()
again, this time to sum up all the movies a customer has reviewed, to create an embedding for the customer. We will want to be able to quickly recall a customers embedding by their ID whenever they visit the site, so we'll create a standard Postgres index on their ID. This isn't just a vector database, it's a full AI application database.
content_copy
CREATE INDEX customers_id_idx ON customers (id);
timer
2709.506 ms (00:02.710)
content_copy
CREATE INDEX
Now we can incorporate a customer embedding to personalize the results whenever they search. Normally, we'd have the customers.id
in our application already because they'd be searching and browsing our site, but we don't have an actual application or customers for this article, so we'll have to find one for our example. Let's find a customer that loves the movie Empire Strikes Back. No Star Wars made our original list, so we have a good opportunity to improve our previous results with personalization.
Now that we have customer embeddings around movies they've reviewed, we can incorporate those to personalize the results whenever they search. Normally, we'd have the customers.id
handy in our application because they'd be searching and browsing our app, but we don't have an actual application or customers for this article, so we'll have to find one for our example. Let's find a customer that loves the movie "Empire Strikes Back". No "Star Wars" made our original list of "Best 1980's scifi movie", so we have a good opportunity to improve our previous results with personalization.
We can find a customer that our embeddings model feels is close to the sentiment "I love all Star Wars, but Empire Strikes Back is particularly amazing". Keep in mind, we didn't want to take the time to build a vector index for queries against the customers table, so this is going to be slower than it could be, but that's fine because it's just a one-off exploration, not some frequently executed query in our application. We can still do vector searches, just without the speed boost an index provides.
content_copy
WITH request AS (
SELECT pgml.embed(
'Alibaba-NLP/gte-base-en-v1.5',
'query: I love all Star Wars, but Empire Strikes Back is particularly amazing'
)::vector(1024) AS embedding
)
SELECT
id,
total_reviews,
star_rating_avg,
1 - (
movie_embedding_e5_large <=> (SELECT embedding FROM request)
) AS cosine_similarity
FROM customers
ORDER BY cosine_similarity DESC
LIMIT 1;
timer
9098.883 ms (00:09.099)
id |
total_reviews |
star_rating_avg |
cosine_similarity |
44366773 |
1 |
2.0000000000000000 |
0.8831349398621555 |
Searching without indexes is slower (9s), but creating a vector index can take a very long time (remember indexing all the reviews took more than an hour). For frequently executed application queries, we always want to make sure we have at least 1 index available to improve speed. Anyway, it turns out we have a customer with a very similar embedding to our desired personalization. Semantic search is wonderfully powerful. Once you've generated embeddings, you can find all the things that are similar to other things, even if they don't share any of the same words. Whether this customer has actually ever even seen Star Wars, the model thinks their embedding is pretty close to a review like that...
It turns out we have a customer with a very similar embedding to our desired personalization. Semantic search is wonderfully powerful. Once you've generated embeddings, you can find all the things that are similar to other things, even if they don't share any of the same words. Whether this customer has actually ever even seen Star Wars, the model thinks their embedding is pretty close to a review like that... They seem a little picky though with 2-star rating average. I'm curious what the 1 review they've actually written looks like:
content_copy
SELECT product_title, star_rating, review_body
FROM pgml.amazon_us_reviews
WHERE customer_id = '44366773';
timer
25156.945 ms (00:25.157)
product_title |
star_rating |
review_body |
Star Wars, Episode V: The Empire Strikes Back (Widescreen Edition) |
2 |
The item was listed as new. The box was opened and had damage to the outside. |
This is odd at first glance. The review doesn't mention anything thing about Star Wars, and the sentiment is actually negative, even the star_rating
is bad. How did they end up with an embedding so close to our desired sentiment of "I love all Star Wars, but Empire Strikes Back is particularly amazing"? Remember we didn't generate embeddings from their review text directly. We generated customer embeddings from the movies they had bothered to review. This customer has only ever reviewed 1 movie, and that happens to be the movie closest to our sentiment. Exactly what we were going for!
If someone only ever bothered to write 1 review, and they are upset about the physical DVD, it's likely they are a big fan of the movie, and they are upset about the physical DVD because they wanted to keep it for a long time. This is a great example of how stacking and relating embeddings carefully can generate insights at a scale that is otherwise impossible, revealing the signal in the noise.
Now we can write our personalized SQL query. It's nearly the same as our query from the previous article, but we're going to include an additional CTE to fetch the customers embedding by id, and then tweak our final_score
. Here comes personalized query results, using that customer 44366773's embedding. Instead of the generic popularity boost we've been using, we'll calculate the cosine similarity of the customer embedding to all the movies in the results, and use that as a boost. This will push movies that are similar to the customer's embedding to the top of the results.
Now we can write our personalized SQL query. It's nearly the same as our query from the previous article, but we're going to include an additional CTE to fetch the customers embedding by id, and then tweak our final_score
. Instead of the generic popularity boost we've been using, we'll calculate the cosine similarity of the customer embedding to all the movies in the results, and use that as a boost. This will push movies that are similar to the customer's embedding to the top of the results. Here comes personalized query results, using that customer 44366773's embedding:
content_copy
-- create a request embedding on the fly
WITH request AS (
SELECT pgml.embed(
'Alibaba-NLP/gte-base-en-v1.5',
'query: Best 1980''s scifi movie'
)::vector(1024) AS embedding
),
-- retrieve the customers embedding by id
customer AS (
SELECT movie_embedding_e5_large AS embedding
FROM customers
WHERE id = '44366773'
),
-- vector similarity search for movies and calculate a customer_cosine_similarity at the same time
first_pass AS (
SELECT
title,
total_reviews,
star_rating_avg,
1 - (
review_embedding_e5_large <=> (SELECT embedding FROM request)
) AS request_cosine_similarity,
(1 - (
review_embedding_e5_large <=> (SELECT embedding FROM customer)
) - 0.9) * 10 AS customer_cosine_similarity,
star_rating_avg / 5 AS star_rating_score
FROM movies
WHERE total_reviews > 10
ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
LIMIT 1000
)
-- grab the top 10 results, re-ranked using a combination of request similarity and customer similarity
SELECT
title,
total_reviews,
round(star_rating_avg, 2) as star_rating_avg,
star_rating_score,
request_cosine_similarity,
customer_cosine_similarity,
request_cosine_similarity + customer_cosine_similarity + star_rating_score AS final_score
FROM first_pass
ORDER BY final_score DESC
LIMIT 10;
timer
127.639 ms (00:00.128)
title |
total_reviews |
star_rating_avg |
star_rating_score |
request_cosine_similarity |
customer_cosine_similarity |
final_score |
Star Wars, Episode V: The Empire Strikes Back (Widescreen Edition) |
78 |
4.44 |
0.88717948717948718000 |
0.8295302273865711 |
0.9999999999999998 |
2.716709714566058 |
Star Wars, Episode IV: A New Hope (Widescreen Edition) |
80 |
4.36 |
0.87250000000000000000 |
0.8339361274771777 |
0.9336656923446551 |
2.640101819821833 |
Forbidden Planet (Two-Disc 50th Anniversary Edition) |
255 |
4.82 |
0.96392156862745098000 |
0.8577616472530644 |
0.6676592605840725 |
2.489342476464588 |
The Day the Earth Stood Still |
589 |
4.76 |
0.95212224108658744000 |
0.8555529952535671 |
0.6733939449212423 |
2.4810691812613967 |
Forbidden Planet [Blu-ray] |
223 |
4.79 |
0.95874439461883408000 |
0.8479982398847651 |
0.6536320269646467 |
2.4603746614682462 |
John Carter (Four-Disc Combo: Blu-ray 3D/Blu-ray/DVD + Digital Copy) |
559 |
4.65 |
0.93059033989266548000 |
0.8338600628541288 |
0.6700415876545052 |
2.4344919904012996 |
The Terminator |
430 |
4.59 |
0.91813953488372094000 |
0.8428833221752442 |
0.6638043064287047 |
2.4248271634876697 |
The Day the Earth Stood Still (Two-Disc Special Edition) |
37 |
4.57 |
0.91351351351351352000 |
0.8419118958433142 |
0.6636373066510914 |
2.419062716007919 |
The Thing from Another World |
501 |
4.71 |
0.94291417165668662000 |
0.8511107698234265 |
0.6231913893834695 |
2.4172163308635826 |
The War of the Worlds (Special Collector's Edition) |
171 |
4.67 |
0.93333333333333334000 |
0.8460163011246516 |
0.6371641286728591 |
2.416513763130844 |
Bingo. Now we're boosting movies by (customer_cosine_similarity - 0.9) * 10
, and we've kept our previous boost for movies with a high average star rating. Not only does Episode V top the list as expected, Episode IV is a close second. This query has gotten fairly complex! But the results are perfect for me, I mean our hypothetical customer who is searching for "Best 1980's scifi movie" but has already revealed to us with their one movie review that they think like the comment "I love all Star Wars, but Empire Strikes Back is particularly amazing". I promise I'm not just doing all of this to find a new movie to watch tonight.
You can compare this to our non-personalized results from the previous article for reference Forbidden Planet used to be the top result, but now it's #3.
title |
total_reviews |
star_rating_avg |
final_score |
star_rating_score |
cosine_similarity |
Forbidden Planet (Two-Disc 50th Anniversary Edition) |
255 |
4.82 |
1.8216832158805154 |
0.96392156862745098000 |
0.8577616472530644 |
Back to the Future |
31 |
4.94 |
1.82090702765472 |
0.98709677419354838000 |
0.8338102534611714 |
Warning Sign |
17 |
4.82 |
1.8136734057737756 |
0.96470588235294118000 |
0.8489675234208343 |
Plan 9 From Outer Space/Robot Monster |
13 |
4.92 |
1.8126103400815046 |
0.98461538461538462000 |
0.8279949554661198 |
Blade Runner: The Final Cut (BD) [Blu-ray] |
11 |
4.82 |
1.8120690455673043 |
0.96363636363636364000 |
0.8484326819309408 |
The Day the Earth Stood Still |
589 |
4.76 |
1.8076752363401547 |
0.95212224108658744000 |
0.8555529952535671 |
Forbidden Planet [Blu-ray] |
223 |
4.79 |
1.8067426345035993 |
0.95874439461883408000 |
0.8479982398847651 |
Aliens (Special Edition) |
25 |
4.76 |
1.803194119705901 |
0.95200000000000000000 |
0.851194119705901 |
Night of the Comet |
22 |
4.82 |
1.802469182369724 |
0.96363636363636364000 |
0.8388328187333605 |
Forbidden Planet |
19 |
4.68 |
1.795573710000297 |
0.93684210526315790000 |
0.8587316047371392 |
timer
124.119 ms
Big improvement! We're doing a lot now to achieve filtering, boosting, and personalized re-ranking, but you'll notice that this extra work only takes a couple more milliseconds in PostgresML. Remember in the previous article when took over 100ms to just retrieve 5 embedding vectors in no particular order. All this embedding magic is pretty much free when it's done inside the database. Imagine how slow a service would be if it had to load 1000 embedding vectors (not 5) like our similarity search is doing, and then passing those to some HTTP API where some ML black box lives, and then fetching a different customer embedding from a different database, and then trying to combine that with the thousand results from the first query... This is why machine learning microservices break down at scale, and it's what makes PostgresML one step ahead of less mature vector databases.
We've got personalized results now, but (... - 0.9) * 10
is a bit of a hack I used to scale the personalization score to have a larger impact on the final score. Hacks and heuristics are frequently injected like this when a Product Manager tells an engineer to "just make it work", but oh no! Back To The Future is now nowhere to be found on my personalized list. We can do better! Those magic numbers are intended to optimize something our Product Manager is going for as a business metric. There's a way out of infinite customer complaints and one off hacks like this, and it's called machine learning.
Finding the optimal set of magic numbers that "just make it work" is what modern machine learning is all about from one point of view. In the next article, we'll look at building a real personalized ranking model using XGBoost on top of our personalized embeddings, that predicts how our customer will rate a movie on our 5-star review scale. Then we can rank results based on a much more sophisticated model's predicted star rating score instead of just using cosine similarity and made up numbers. With all the savings we're accruing in terms of latency and infrastructure simplicity, our ability to layer additional models, refinements and techniques will put us another step ahead of the alternatives.