Semantic Search Using Instructor Model

This tutorial demonstrates using the pgml SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. In this tutorial we use hkunlp/instructor-base, a more advanced embeddings model that takes parameters when doing embedding and recall.

Link to full JavaScript implementation

Link to full Python implementation

Imports and Setup

The SDK is imported and environment variables are loaded.

content_copy link edit
const pgml = require("pgml");
require("dotenv").config();

content_copy link edit
from pgml import Collection, Pipeline
from datasets import load_dataset
from time import time
from dotenv import load_dotenv
from rich.console import Console
import asyncio

Initialize Collection

A collection object is created to represent the search collection.

content_copy link edit
const main = async () => { // Open the main function, we close it at the bottom
// Initialize the collection
const collection = pgml.newCollection("qa_collection");

content_copy link edit
async def main(): # Start the main function, we end it after archiving
load_dotenv()
console = Console()
# Initialize collection
collection = Collection("squad_collection")

Create Pipeline

A pipeline encapsulating a model and splitter is created and added to the collection.

content_copy link edit
// Add a pipeline
const pipeline = pgml.newPipeline("qa_pipeline", {
text: {
splitter: { model: "recursive_character" },
semantic_search: {
model: "intfloat/e5-small",
},
},
});
await collection.add_pipeline(pipeline);

content_copy link edit
# Create and add pipeline
pipeline = Pipeline(
"squadv1",
{
"text": {
"splitter": {"model": "recursive_character"},
"semantic_search": {
"model": "hkunlp/instructor-base",
"parameters": {
"instruction": "Represent the Wikipedia document for retrieval: "
},
},
}
},
)
await collection.add_pipeline(pipeline)

Upsert Documents

Documents are upserted into the collection and indexed by the pipeline.

content_copy link edit
// Upsert documents, these documents are automatically split into chunks and embedded by our pipeline
const documents = [
{
id: "Document One",
text: "PostgresML is the best tool for machine learning applications!",
},
{
id: "Document Two",
text: "PostgresML is open source and available to everyone!",
},
];
await collection.upsert_documents(documents);

content_copy link edit
# Prep documents for upserting
data = load_dataset("squad", split="train")
data = data.to_pandas()
data = data.drop_duplicates(subset=["context"])
documents = [
{"id": r["id"], "text": r["context"], "title": r["title"]}
for r in data.to_dict(orient="records")
]
# Upsert documents
await collection.upsert_documents(documents[:200])

Query

A vector similarity search query is made on the collection.

content_copy link edit
// Perform vector search
const query = "What is the best tool for building machine learning applications?";
const queryResults = await collection.vector_search(
{
query: {
fields: {
text: { query: query }
}
}, limit: 1
}, pipeline);
console.log(queryResults);

content_copy link edit
# Query for answer
query = "Who won more than 20 grammy awards?"
console.print("Querying for context ...")
start = time()
results = await collection.vector_search(
{
"query": {
"fields": {
"text": {
"query": query,
"parameters": {
"instruction": "Represent the Wikipedia question for retrieving supporting documents: "
},
},
}
},
"limit": 5,
},
pipeline,
)
end = time()
console.print("\n Results for '%s' " % (query), style="bold")
console.print(results)
console.print("Query time = %0.3f" % (end - start))

Archive Collection

The collection is archived when finished.

content_copy link edit
await collection.archive();
} // Close the main function

content_copy link edit
await collection.archive()
# The end of the main function

Main

Boilerplate to call main() async function.

content_copy link edit
main().then(() => console.log("Done!"));

content_copy link edit
if __name__ == "__main__":
asyncio.run(main())