pgml.chunk()

Chunks are pieces of documents split using some specified splitter. This is typically done before embedding.

API


                content_copy
            
pgml.chunk(
    splitter TEXT,    -- splitter name
    text TEXT,        -- text to embed
    kwargs JSON       -- optional arguments (see below)
)

Examples


                content_copy
            
SELECT pgml.chunk('recursive_character', 'test');


                content_copy
            
SELECT pgml.chunk('recursive_character', 'test', '{"chunk_size": 1000, "chunk_overlap": 40}'::jsonb);


                content_copy
            
SELECT pgml.chunk('markdown', '# Some test');

Note that the input text for those splitters is so small it isn't splitting it at all, a real world example would look more like:


                content_copy
            
SELECT pgml.chunk('recursive_character', content) FROM documents;

Where documents is some table that has a text column called content

Supported Splitters

We support the following splitters:

recursive_character
latex
markdown
ntlk
python
spacy

For more information on splitters see LangChain's docs

PostgresML

Korvus PGML PpCat Learning PostgresML VPC

RAG NLP Supervised Learning Embedding Vector Database Search

Documentation

Blog

Pricing

About Careers Privacy Terms of Service Contact

GitHub Discord Formerly Twitter YouTube LinkedIn

This site uses cookies for usage analytics to improve our service. By continuing to browse this site, you agree to this use. See our Privacy Policy