# Vector database

Vectors are lists of numbers representing a measurement in multidimensional space. There are many types of vectors, e.g. embeddings used for semantic search, but ultimately they are all just arrays of floating points.

In Postgres, a vector is just another data type that can be stored in regular tables and queried together with other columns. At PostgresML, we're using *pgvector*, a Postgres extension that implements the *vector* data type, and many vector operations like inner product, cosine distance, and approximate nearest neighbor (ANN) search.

### Installing pgvector

If you're using our cloud or our Docker image, your database has *pgvector* installed already. If you're self-hosting PostgresML, take a look at our Self-hosting documentation.

### Working with vectors

Vectors can be stored in columns, just like any other data type. To add a vector column to your table, you need to specify the size of the vector. All vectors in a single column must be the same size, since there are no useful operations between vectors of different sizes.

#### Adding a vector column

Using the example from Tabular data, let's add a vector column to our USA House Prices table:

```
ALTER TABLE
usa_house_prices
ADD COLUMN
embedding VECTOR(384);
```

```
ALTER TABLE
```

#### Generating embeddings

At first, the column is empty. To generate embeddings, we can use the PostgresML pgml.embed() function and generate an embedding of another column in the same (or different) table. This is where machine learning inside the database really shines:

```
UPDATE
usa_house_prices
SET embedding = pgml.embed(
'Alibaba-NLP/gte-base-en-v1.5',
address
);
```

```
UPDATE 5000
```

That's it. We just created 5,000 embeddings of the values stored in the address column, all with just one SQL query. Let's take a look at what we created:

```
SELECT
address,
(embedding::real[])[1:5]
FROM usa_house_prices
WHERE
address = '1 Infinite Loop, Cupertino, California';
```

```
address | embedding
----------------------------------------+----------------------------------------------------------------
1 Infinite Loop, Cupertino, California | {-0.009034249,-0.055827666,-0.09911688,0.005093358,0.04053181}
(1 row)
```

The vectors contain 384 values each, but that won't fit on our screen, so we selected the first 5 values using the Postgres array slice notation `[1:5]`

(Postgres array indices start at one, not zero).

### Searching vectors

If your dataset is small enough, searching vectors doesn't require approximation. You can find the exact nearest neighbor match using any of the distance functions supported by *pgvector*: L2, cosine distance, inner product and cosine similarity.

Each distance function is implemented with its own operator and can be used as part of all SQL queries:

Distance function | Operator | Index operator |
---|---|---|

L2 | `<->` |
`vector_in_ops` |

Inner product | `<#>` |
`vector_l2_ops` |

Cosine distance | `<=>` |
`vector_cosine_ops` |

Cosine similarity | `1 - (a <=> b)` |
`vector_cosine_ops` |

For example, if we wanted to find three closest matching addresses to `1 Infinite Loop`

using cosine distance:

```
SELECT
address
FROM usa_house_prices
ORDER BY
embedding <=> pgml.embed(
'Alibaba-NLP/gte-base-en-v1.5',
'1 Infinite Loop'
)::vector(384)
LIMIT 3;
```

```
address
----------------------------------------
1 Infinite Loop, Cupertino, California
615 Larry Loop
Warrenberg, PR 37943
(5 rows)
```

This query uses pgml.embed() to generate an embedding on the fly and finds the exact closest neighbors to that embedding in the entire dataset.

### Approximate nearest neighbors

This example dataset only has 5,000 rows which, for Postgres, is really easy to scan. In the real world, these datasets grow to become very large and searching the entire table becomes too slow to be practical. When that happens, we can get closest matches using approximation. Approximate nearest neighbors, or ANN, is a commonly used technique to organize vectors to find results that are "close enough".

*pgvector* implements two ANN algorithms: IVFFlat and HNSW. Both have their pros and cons and can be used in production to search millions of vectors.

### IVFFlat

IVFFlat splits the list of vectors into roughly equal parts, grouped around centroids calculated using k-nearest neighbors (KNN). The lists are stored in a B-tree index, ordered by the centroid.

When searching for nearest neighbors, *pgvector* picks the list with the closest centroid to the candidate vector, fetches all the vectors from that list, sorts them, and returns the closest neighbors. Since the list represents only a fraction of all vectors, using an IVFFlat index is considerably faster than scanning the entire table.

The number of lists in an IVFFlat index is configurable on index creation. The more lists, the faster you can search them, but the nearest neighbor approximation becomes less precise. The best number of lists for a dataset is typically its square root, e.g. if a dataset has 5,000,000 vectors, the number of lists should be:

```
SELECT round(sqrt(5000000)) AS lists;
```

```
lists
-------
2236
```

#### Creating an IVFFlat index

You can create an IVFFlat index with just one query:

```
CREATE INDEX ON
usa_house_prices
USING ivfflat(embedding vector_cosine_ops)
WITH (lists = 71);
```

```
CREATE INDEX
```

71 is the approximate square root of 5,000 rows we have in that table. With the index created, if we `EXPLAIN`

the query we just ran, we'll get an index scan on the cosine distance index:

```
EXPLAIN SELECT
address
FROM usa_house_prices
ORDER BY
embedding <=> pgml.embed(
'Alibaba-NLP/gte-base-en-v1.5',
'1 Infinite Loop'
)::vector(384)
LIMIT 3;
```

```
Limit (cost=38.03..38.32 rows=3 width=55)
-> Index Scan using usa_house_prices_embedding_idx on usa_house_prices (cost=38.03..327.23 rows=5001 width=55)
Order By: (embedding <=> '[-0.033770584,-0.033374995, ...])
```

It's important to create an IVFFlat index after you have added a representative sample of vectors into your table. Without a representative sample, the calculated centroids will be incorrect and the approximation of nearest neighbors inaccurate.

#### Maintaining an IVFFlat index

IVFFlat is a simple algorithm and constructs an index quickly. Splitting, sorting and solving KNN is optimized using the Postgres query engine and vectorized CPU operations (e.g. AVX512 on modern CPUs) built into *pgvector*. When queried, the index provides good performance and approximation for most use cases.

On the other hand, because of the nature of centroids, if the dataset changes in a statistically significant way, the original KNN calculation becomes inaccurate. In that case, an IVFFlat index should be rebuilt which Postgres makes pretty easy:

```
REINDEX INDEX CONCURRENTLY
usa_house_prices_embedding_idx;
```

```
REINDEX
```

As of this writing, *pgvector* doesn't provide monitoring tools for index degradation. The user should monitor recall from their vector search operations, and if it starts dropping, run a reindex.

### HNSW

Home Navigable Small Worlds, or HNSW, is a modern ANN algorithm that constructs a multilayer graph using a greedy search with local minimums. Constructing HNSW requires multiple passes over the same data, so the time and memory cost of building it are higher, but it does have faster and better recall than IVFFlat.

#### Creating an HNSW index

You can create an HNSW index with just one query:

```
CREATE INDEX ON
usa_house_prices
USING
hnsw(embedding vector_cosine_ops);
```

```
CREATE INDEX
```

#### Maintaining an HNSW index

HNSW requires little to no maintenance. When new vectors are added, they are automatically inserted at the optimal place in the graph. However, as the graph gets bigger, rebalancing it becomes more expensive, and inserting new rows becomes slower. We address this trade-off and how to solve this problem in Partitioning.