pgml.chunk()

Chunks are pieces of documents split using some specified splitter. This is typically done before embedding.

API

content_copy link edit
pgml.chunk(
splitter TEXT, -- splitter name
text TEXT, -- text to embed
kwargs JSON -- optional arguments (see below)
)

Example

content_copy link edit
SELECT pgml.chunk('recursive_character', 'test');
content_copy link edit
SELECT pgml.chunk('recursive_character', 'test', '{"chunk_size": 1000, "chunk_overlap": 40}'::jsonb);
content_copy link edit
SELECT pgml.chunk('markdown', '# Some test');

Note that the input text for those splitters is so small it isn't splitting it at all, a real world example would look more like:

content_copy link edit
SELECT pgml.chunk('recursive_character', content) FROM documents;

Where documents is some table that has a text column called content

Supported Splitters

We support the following splitters:

  • recursive_character
  • latex
  • markdown
  • ntlk
  • python
  • spacy

For more information on splitters see LangChain's docs