Pipelines

Pipelines are composed of a Model, Splitter, and additional optional arguments. Collections can have any number of Pipelines. Each Pipeline is ran everytime documents are upserted.

Models

Models are used for embedding chuncked documents. We support most every open source model on Hugging Face, and also OpenAI's embedding models.

Create a default Model "intfloat/e5-small" with default parameters: {}

content_copy link edit
const model = pgml.newModel()

content_copy link edit
model = Model()

Create a Model with custom parameters

content_copy link edit
const model = pgml.newModel(
"hkunlp/instructor-base",
"pgml",
{ instruction: "Represent the Wikipedia document for retrieval: " }
)

content_copy link edit
model = Model(
name="hkunlp/instructor-base",
parameters={"instruction": "Represent the Wikipedia document for retrieval: "}
)

Use an OpenAI model

content_copy link edit
const model = pgml.newModel("text-embedding-ada-002", "openai")

content_copy link edit
model = Model(name="text-embedding-ada-002", source="openai")

Splitters

Splitters are used to split documents into chunks before embedding them. We support splitters found in LangChain.

Create a default Splitter "recursive_character" with default parameters: {}

content_copy link edit
const splitter = pgml.newSplitter()

content_copy link edit
splitter = Splitter()

Create a Splitter with custom parameters

content_copy link edit
splitter = pgml.newSplitter(
"recursive_character",
{ chunk_size: 1500, chunk_overlap: 40 }
)

content_copy link edit
splitter = Splitter(
name="recursive_character",
parameters={"chunk_size": 1500, "chunk_overlap": 40}
)

Adding Pipelines to a Collection

When adding a Pipeline to a collection it is required that Pipeline has a Model and Splitter.

The first time a Pipeline is added to a Collection it will automatically chunk and embed any documents already in that Collection.

content_copy link edit
const model = pgml.newModel()
const splitter = pgml.newSplitter()
const pipeline = pgml.newPipeline("test_pipeline", model, splitter)
await collection.add_pipeline(pipeline)

content_copy link edit
model = Model()
splitter = Splitter()
pipeline = Pipeline("test_pipeline", model, splitter)
await collection.add_pipeline(pipeline)

Pipelines can take additional arguments enabling full text search. When full text search is enabled, in addition to automatically chunking and embedding, the Pipeline will create the necessary tsvectors to perform full text search.

For more information on full text search please see: Postgres Full Text Search.

content_copy link edit
const model = pgml.newModel()
const splitter = pgml.newSplitter()
const pipeline = pgml.newPipeline("test_pipeline", model, splitter, {
full_text_search: {
active: true,
configuration: "english"
}
})
await collection.add_pipeline(pipeline)

content_copy link edit
model = Model()
splitter = Splitter()
pipeline = Pipeline("test_pipeline", model, splitter, {
"full_text_search": {
"active": True,
"configuration": "english"
}
})
await collection.add_pipeline(pipeline)

Customizing the HNSW Index

By default the SDK uses HNSW indexes to efficiently perform vector recall. The default HNSW index sets m to 16 and ef_construction to 64. These defaults can be customized when the Pipeline is created.

content_copy link edit
const model = pgml.newModel()
const splitter = pgml.newSplitter()
const pipeline = pgml.newPipeline("test_pipeline", model, splitter, {
hnsw: {
m: 16,
ef_construction: 64
}
})
await collection.add_pipeline(pipeline)

content_copy link edit
model = Model()
splitter = Splitter()
pipeline = Pipeline("test_pipeline", model, splitter, {
"hnsw": {
"m": 16,
"ef_construction": 64
}
})
await collection.add_pipeline(pipeline)

Searching with Pipelines

Pipelines are a required argument when performing vector search. After a Pipeline has been added to a Collection, the Model and Splitter can be omitted when instantiating it.

content_copy link edit
const pipeline = pgml.newPipeline("test_pipeline")
const collection = pgml.newCollection("test_collection")
const results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()

content_copy link edit
pipeline = Pipeline("test_pipeline")
collection = Collection("test_collection")
results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()

Disable a Pipeline

Pipelines can be disabled or removed to prevent them from running automatically when documents are upserted.

content_copy link edit
const pipeline = pgml.newPipeline("test_pipeline")
const collection = pgml.newCollection("test_collection")
await collection.disable_pipeline(pipeline)

content_copy link edit
pipeline = Pipeline("test_pipeline")
collection = Collection("test_collection")
await collection.disable_pipeline(pipeline)

Disabling a Pipeline prevents it from running automatically, but leaves all chunks and embeddings already created by that Pipeline in the database.

Enable a Pipeline

Disabled pipelines can be re-enabled.

content_copy link edit
const pipeline = pgml.newPipeline("test_pipeline")
const collection = pgml.newCollection("test_collection")
await collection.enable_pipeline(pipeline)

content_copy link edit
pipeline = Pipeline("test_pipeline")
collection = Collection("test_collection")
await collection.enable_pipeline(pipeline)

Enabling a Pipeline will cause it to automatically run and chunk and embed all documents it may have missed while disabled.

Remove a Pipeline

const pipeline = pgml.newPipeline("test_pipeline")
const collection = pgml.newCollection("test_collection")
await collection.remove_pipeline(pipeline)

content_copy link edit
pipeline = Pipeline("test_pipeline")
collection = Collection("test_collection")
await collection.remove_pipeline(pipeline)

Removing a Pipeline deletes it and all associated data from the database. Removed Pipelines cannot be re-enabled but can be recreated.