Semantic Search
This tutorial demonstrates using the pgml
SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished.
Link to full JavaScript implementation
Link to full Python implementation
Imports and Setup
The SDK is imported and environment variables are loaded.
const pgml = require("pgml");
require("dotenv").config();
from pgml import Collection, Pipeline
from datasets import load_dataset
from time import time
from dotenv import load_dotenv
from rich.console import Console
import asyncio
Initialize Collection
A collection object is created to represent the search collection.
const main = async () => { // Open the main function, we close it at the bottom
// Initialize the collection
const collection = pgml.newCollection("semantic_search_collection");
async def main(): # Start the main function, we end it after archiving
load_dotenv()
console = Console()
# Initialize collection
collection = Collection("quora_collection")
Create Pipeline
A pipeline encapsulating a model and splitter is created and added to the collection.
// Add a pipeline
const pipeline = pgml.newPipeline("semantic_search_pipeline", {
text: {
splitter: { model: "recursive_character" },
semantic_search: {
model: "intfloat/e5-small",
},
},
});
await collection.add_pipeline(pipeline);
# Create and add pipeline
pipeline = Pipeline(
"quorav1",
{
"text": {
"splitter": {"model": "recursive_character"},
"semantic_search": {"model": "intfloat/e5-small"},
}
},
)
await collection.add_pipeline(pipeline)
Upsert Documents
Documents are upserted into the collection and indexed by the pipeline.
// Upsert documents, these documents are automatically split into chunks and embedded by our pipeline
const documents = [
{
id: "Document One",
text: "document one contents...",
},
{
id: "Document Two",
text: "document two contents...",
},
];
await collection.upsert_documents(documents);
# Prep documents for upserting
dataset = load_dataset("quora", split="train")
questions = []
for record in dataset["questions"]:
questions.extend(record["text"])
# Remove duplicates and add id
documents = []
for i, question in enumerate(list(set(questions))):
if question:
documents.append({"id": i, "text": question})
# Upsert documents
await collection.upsert_documents(documents[:2000])
Query
A vector similarity search query is made on the collection.
// Perform vector search
const query = "Something that will match document one first";
const queryResults = await collection.vector_search(
{
query: {
fields: {
text: { query: query }
}
}, limit: 2
}, pipeline);
console.log("The results");
console.log(queryResults);
# Query
query = "What is a good mobile os?"
console.print("Querying for %s..." % query)
start = time()
results = await collection.vector_search(
{"query": {"fields": {"text": {"query": query}}}, "limit": 5}, pipeline
)
end = time()
console.print("\n Results for '%s' " % (query), style="bold")
console.print(results)
console.print("Query time = %0.3f" % (end - start))
Archive Collection
The collection is archived when finished.
await collection.archive();
} // Close the main function
await collection.archive()
# The end of the main function
Main
Boilerplate to call main() async function.
main().then(() => console.log("Done!"));
if __name__ == "__main__":
asyncio.run(main())