Pipelines are a powerful feature for processing and preparing documents for efficient search and retrieval. They define a series of transformations applied to your data, enabling operations like text splitting, semantic embedding, and full-text search preparation. This guide will walk you through the process of constructing Pipeline schemas, allowing you to customize how your documents are processed and indexed.
If you are looking for information on how to work with Pipelines and Collections review the Pipelines API.
Pipelines are specified as JSON. If you are working in Python or JavaScript they are objects. For this guide we will be writing everything in Python but it can be easily translated to work with JavaScript, Rust, or C.
For this guide, we'll use a simple document structure as an example. Understanding your document structure is crucial for creating an effective Pipeline, as it determines which fields you'll process:
content_copy
example_document = {
"id": "doc_001", # Unique identifier for the document
"title": "Introduction to Machine Learning", # Document title
"text": "Machine learning is a branch of artificial intelligence..." # Main content
}
Your Pipeline will define how to process these fields.
Pipelines can apply three different transformations:
- Splitting
- Embedding
- Creating tsvectors
Here is an example Pipeline that will split, embed, and generate tsvectors for the text
key of documents.
content_copy
pipeline = Pipeline(
"v0",
{
"text": {
"splitter": {"model": "recursive_character"},
"semantic_search": {
"model": "Alibaba-NLP/gte-base-en-v1.5",
},
"full_text_search": {
"configuration": "english"
}
},
},
)
The first argument to the Pipeline
constructor is the name, the second is the schema.
Let's break the schema down.
First, as specified above, we are specifying the text
key. This means the transformation object applies only to the text
key of the document.
The text
object contains three different keys:
splitter
semantic_search
full_text_search
Let's break each down indiviually.
The splitter
object takes two parameters:
The model
is the string name of the model to use for splitting.
The parameters
is an optional object specifying what parameters to pass to the splitter model.
It is common to adjust the max chunk size and overlap for the recursive_character
splitter. An example pipeline doing this:
content_copy
pipeline = Pipeline(
"v0",
{
"text": {
"splitter": {
"model": "recursive_character",
"parameters": {
"chunk_size": 1500,
"chunk_overlap": 40
}
},
"semantic_search": {
"model": "Alibaba-NLP/gte-base-en-v1.5",
},
"full_text_search": {
"configuration": "english"
}
},
},
)
The semantic_search
object takes two parameters:
The model
is the string name of the model to use for embedding.
The parameters
is an optional object specifying what parameters to pass to the splitter model.
It is common for embedding models to require some kind of prompt when generating embeddings. For example the popular intfloat/e5-small-v2
requires that embeddings for storage be prefixed with passage:
. This can be done with the following Pipeline
:
content_copy
pipeline = Pipeline(
"v0",
{
"text": {
"splitter": {"model": "recursive_character"},
"semantic_search": {
"model": "intfloat/e5-small-v2",
"parameters": {
"prompt": "passage: "
}
},
"full_text_search": {
"configuration": "english"
}
},
},
)
The full_text_search
object only takes one key: configuration
. The configuration
key is passed directly to the to_tsvector
function.
This will most likely be the language you want to enable full text search for. A common one is english
.
If you want to perform hybrid search you must supply the full_text_search
key.
It is common to perform search over more than one field of a document. We must specify the keys we plan to search over in our Pipeline schema.
content_copy
pipeline = Pipeline(
"v0",
{
"abstract": {
"semantic_search": {
"model": "Alibaba-NLP/gte-base-en-v1.5",
},
"full_text_search": {
"configuration": "english"
}
},
"text": {
"splitter": {"model": "recursive_character"},
"semantic_search": {
"model": "Alibaba-NLP/gte-base-en-v1.5",
},
"full_text_search": {
"configuration": "english"
}
},
},
)
The Pipeline
above generates embeddings and tsvectors for the abstract
and splits and generates embeddings and tsvectors for the text
.
We can now perform search over both the text
and abstract
key of our documents. See the guide for vector search for more information on how to do this.