pgml.chunk()
Chunks are pieces of documents split using some specified splitter. This is typically done before embedding.
API
pgml.chunk(
splitter TEXT, -- splitter name
text TEXT, -- text to embed
kwargs JSON -- optional arguments (see below)
)
Example
SELECT pgml.chunk('recursive_character', 'test');
SELECT pgml.chunk('recursive_character', 'test', '{"chunk_size": 1000, "chunk_overlap": 40}'::jsonb);
SELECT pgml.chunk('markdown', '# Some test');
Note that the input text for those splitters is so small it isn't splitting it at all, a real world example would look more like:
SELECT pgml.chunk('recursive_character', content) FROM documents;
Where documents
is some table that has a text
column called content
Supported Splitters
We support the following splitters:
recursive_character
latex
markdown
ntlk
python
spacy
For more information on splitters see LangChain's docs