Semantic Search Using Instructor Model

This tutorial demonstrates using the pgml SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. In this tutorial we use hkunlp/instructor-base, a more advanced embeddings model that takes parameters when doing embedding and recall.

The SDK is imported and environment variables are loaded.


                content_copy
                link
                edit
            
const pgml = require("pgml");
require("dotenv").config();


                content_copy
                link
                edit
            
from pgml import Collection, Pipeline
from datasets import load_dataset
from time import time
from dotenv import load_dotenv
from rich.console import Console
import asyncio

Initialize Collection

A collection object is created to represent the search collection.


                content_copy
                link
                edit
            
const main = async () => { // Open the main function, we close it at the bottom
  // Initialize the collection
  const collection = pgml.newCollection("qa_collection");


                content_copy
                link
                edit
            
async def main(): # Start the main function, we end it after archiving
    load_dotenv()
    console = Console()

    # Initialize collection
    collection = Collection("squad_collection")

Create Pipeline

A pipeline encapsulating a model and splitter is created and added to the collection.


                content_copy
                link
                edit
            
  // Add a pipeline
  const pipeline = pgml.newPipeline("qa_pipeline", {
    text: {
      splitter: { model: "recursive_character" },
      semantic_search: {
        model: "intfloat/e5-small",
      },
    },
  });
  await collection.add_pipeline(pipeline);


                content_copy
                link
                edit
            
    # Create and add pipeline
    pipeline = Pipeline(
        "squadv1",
        {
            "text": {
                "splitter": {"model": "recursive_character"},
                "semantic_search": {
                    "model": "hkunlp/instructor-base",
                    "parameters": {
                        "instruction": "Represent the Wikipedia document for retrieval: "
                    },
                },
            }
        },
    )
    await collection.add_pipeline(pipeline)

Upsert Documents

Documents are upserted into the collection and indexed by the pipeline.


                content_copy
                link
                edit
            
  // Upsert documents, these documents are automatically split into chunks and embedded by our pipeline
  const documents = [
    {
      id: "Document One",
      text: "PostgresML is the best tool for machine learning applications!",
    },
    {
      id: "Document Two",
      text: "PostgresML is open source and available to everyone!",
    },
  ];
  await collection.upsert_documents(documents);


                content_copy
                link
                edit
            
    # Prep documents for upserting
    data = load_dataset("squad", split="train")
    data = data.to_pandas()
    data = data.drop_duplicates(subset=["context"])
    documents = [
        {"id": r["id"], "text": r["context"], "title": r["title"]}
        for r in data.to_dict(orient="records")
    ]

    # Upsert documents
    await collection.upsert_documents(documents[:200])

Query

A vector similarity search query is made on the collection.


                content_copy
                link
                edit
            
  // Perform vector search
  const query = "What is the best tool for building machine learning applications?";
  const queryResults = await collection.vector_search(
    {
      query: {
        fields: {
          text: { query: query }
        }
      }, limit: 1
    }, pipeline);
  console.log(queryResults);


                content_copy
                link
                edit
            
    # Query for answer
    query = "Who won more than 20 grammy awards?"
    console.print("Querying for context ...")
    start = time()
    results = await collection.vector_search(
        {
            "query": {
                "fields": {
                    "text": {
                        "query": query,
                        "parameters": {
                            "instruction": "Represent the Wikipedia question for retrieving supporting documents: "
                        },
                    },
                }
            },
            "limit": 5,
        },
        pipeline,
    )
    end = time()
    console.print("\n Results for '%s' " % (query), style="bold")
    console.print(results)
    console.print("Query time = %0.3f" % (end - start))

Archive Collection

The collection is archived when finished.


                content_copy
                link
                edit
            
  await collection.archive();
} // Close the main function


                content_copy
                link
                edit
            
    await collection.archive()
# The end of the main function

Main

Boilerplate to call main() async function.


                content_copy
                link
                edit
            
main().then(() => console.log("Done!"));


                content_copy
                link
                edit
            
if __name__ == "__main__":
    asyncio.run(main())