SQL extension

PostgresML is a PostgreSQL extension which adds SQL functions to the database. Those functions provide access to AI models downloaded from Hugging Face, and classical machine learning algorithms like XGBoost and LightGBM.

Our SQL API is stable and safe to use in your applications, while the models and algorithms we support continue to evolve and improve.

Open-source LLMs

PostgresML defines two SQL functions which use 🤗 Hugging Face transformers and embeddings models, running directly in the database:

Function Description
pgml.embed() Generate embeddings using latest sentence transformers from Hugging Face.
pgml.transform() Text generation using LLMs like Llama, Mixtral, and many more, with models downloaded from Hugging Face.
pgml.transform_stream() Streaming version of pgml.transform(), which fetches partial responses as they are being generated by the model, substantially decreasing time to first token.
pgml.tune() Perform fine tuning tasks on Hugging Face models, using data stored in the database.

Example

Using a SQL function for interacting with open-source models makes things really easy:

content_copy link edit
SELECT pgml.embed(
'intfloat/e5-small',
'This text will be embedded using the intfloat/e5-small model.'
) AS embedding;

content_copy link edit
embedding
-------------------------------------------
{-0.028478337,-0.06275077,-0.04322059, [...]

Using the pgml SQL functions inside regular queries, it's possible to add embeddings and LLM-generated text inside any query, without the data ever leaving the database, removing the cost of a remote network call.

Classical machine learning

PostgresML defines four SQL functions which allow training regression, classification, and clustering models on tabular data:

Function Description
pgml.train() Train a model on PostgreSQL tables or views using any algorithm from Scikit-learn, with the additional support for XGBoost, LightGBM and Catboost.
pgml.predict() Run inference on live application data using a model trained with pgml.train().
pgml.deploy() Deploy a specific version of a model trained with pgml.train(), using your own accuracy metrics.
pgml.load_dataset() Load any of the toy datasets from Scikit-learn or any dataset from Hugging Face.

Example

Load data

Using pgml.load_dataset(), we can load an example classification dataset from Scikit-learn:

content_copy link edit
SELECT *
FROM pgml.load_dataset('digits');

content_copy link edit
table_name | rows
-------------+------
pgml.digits | 1797
(1 row)

Train a model

Once we have some data, we can train a model on this data using pgml.train():

content_copy link edit
SELECT *
FROM pgml.train(
project_name => 'My project name',
task => 'classification',
relation_name =>'pgml.digits',
y_column_name => 'target',
algorithm => 'xgboost',
);

content_copy link edit
INFO: Metrics: {
"f1": 0.8755124,
"precision": 0.87670505,
"recall": 0.88005465,
"accuracy": 0.87750554,
"mcc": 0.8645154,
"fit_time": 0.33504912,
"score_time": 0.001842427
}
project | task | algorithm | deployed
-----------------+----------------+-----------+----------
My project name | classification | xgboost | t
(1 row)

pgml.train() reads data from the table, using the target column as the label, automatically splits the dataset into test and train sets, and trains an XGBoost model. Our extension supports more than 50 machine learning algorithms, and you can train a model using any of them by just changing the name of the algorithm argument.

Real time inference

Now that we have a model, we can use it to predict new data points, in real time, on live application data:

content_copy link edit
SELECT
target,
pgml.predict(
'My project name',
image
) AS prediction
FROM
pgml.digits
LIMIT 1;

content_copy link edit
target | prediction
--------+------------
0 | 0
(1 row)

Change model version

The train function automatically deploys the best model into production, using the precision score relevant to the type of the model. If you prefer to deploy models using your own accuracy metrics, the pgml.deploy() function can manually change which model version is used for subsequent database queries:

content_copy link edit
SELECT *
FROM
pgml.deploy(
'My project name',
strategy => 'most_recent',
algorithm => 'xgboost'
);

content_copy link edit
project | strategy | algorithm
-----------------+-------------+-----------
My project name | most_recent | xgboost
(1 row)