Thread: vector search support

vector search support

From
Nathan Bossart
Date:
Attached is a proof-of-concept/work-in-progress patch set that adds
functions for "vectors" repreѕented with one-dimensional float8 arrays.
These functions may be used in a variety of applications, but I am
proposing them with the AI/ML use-cases in mind.  I am posting this early
in the v17 cycle in hopes of gathering feedback prior to PGCon.

With the accessibility of AI/ML tools such as large language models (LLMs),
there has been a demand for storing and manipulating high-dimensional
vectors in PostgreSQL, particularly around nearest-neighbor queries.  Many
of these vectors have more than 1500 dimensions.  The cube extension [0]
provides some of the distance functionality (e.g., taxicab, Euclidean, and
Chebyshev), but it is missing some popular functions (e.g., cosine
similarity, dot product), and it is limited to 100 dimensions.  We could
extend cube to support more dimensions, but this would require reworking
its indexing code and filling in gaps between the cube data type and the
array types.  For some previous discussion about using the cube extension
for this kind of data, see [1].

float8[] is well-supported and allows for effectively unlimited dimensions
of data.  float8 matches the common output format of many AI embeddings,
and it allows us or extensions to implement indexing methods around these
functions.  This patch set does not yet contain indexing support, but we
are exploring using GiST or GIN for the use-cases in question.  It might
also be desirable to add support for other linear algebra operations (e.g.,
operations on matrices).  The attached patches likely only scratch the
surface of the "vector search" use-case.

The patch set is broken up as follows:

 * 0001 does some minor refactoring of dsqrt() in preparation for 0002.
 * 0002 adds several vector-related functions, including distance functions
   and a kmeans++ implementation.
 * 0003 adds support for optionally using the OpenBLAS library, which is an
   implementation of the Basic Linear Algebra Subprograms [2]
   specification.  Basic testing with this library showed a small
   performance boost, although perhaps not enough to justify giving this
   patch serious consideration.

Of course, there are many open questions.  For example, should PostgreSQL
support this stuff out-of-the-box in the first place?  And should we
introduce a vector data type or SQL domains for treating float8[] as
vectors?  IMHO these vector search use-cases are an exciting opportunity
for the PostgreSQL project, so I am eager to hear what folks think.

[0] https://www.postgresql.org/docs/current/cube.html
[1] https://postgr.es/m/2271927.1593097400%40sss.pgh.pa.us
[2] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: vector search support

From
Giuseppe Broccolo
Date:
Hi Nathan, 

I find the patches really interesting. Personally, as Data/MLOps Engineer, I'm involved in a project where we use embedding techniques to generate vectors from documents, and use clustering and kNN searches to find similar documents basing on spatial neighbourhood of generated vectors. 

We finally opted for ElasticSearch as search engine, considering that it was providing what we needed:

* support to store dense vectors
* support for kNN searches (last version of ElasticSearch allows this)

An internal benchmark showed us that we were able to achieve the expected performance, although we are still lacking some points:

* clustering of vectors (this has to be done outside the search engine, using DBScan for our use case)
* concurrency in updating the ElasticSearch indexes storing the dense vectors

I found these patches really interesting, considering that they would solve some of open issues when storing dense vectors. Index support would help a lot with searches though. 

Not sure if it's the best to include in PostgreSQL core, but would be fantastic to have it as an extension.

All the best,
Giuseppe.

On Sat, 22 Apr 2023, 01:07 Nathan Bossart, <nathandbossart@gmail.com> wrote:
Attached is a proof-of-concept/work-in-progress patch set that adds
functions for "vectors" repreѕented with one-dimensional float8 arrays.
These functions may be used in a variety of applications, but I am
proposing them with the AI/ML use-cases in mind.  I am posting this early
in the v17 cycle in hopes of gathering feedback prior to PGCon.

With the accessibility of AI/ML tools such as large language models (LLMs),
there has been a demand for storing and manipulating high-dimensional
vectors in PostgreSQL, particularly around nearest-neighbor queries.  Many
of these vectors have more than 1500 dimensions.  The cube extension [0]
provides some of the distance functionality (e.g., taxicab, Euclidean, and
Chebyshev), but it is missing some popular functions (e.g., cosine
similarity, dot product), and it is limited to 100 dimensions.  We could
extend cube to support more dimensions, but this would require reworking
its indexing code and filling in gaps between the cube data type and the
array types.  For some previous discussion about using the cube extension
for this kind of data, see [1].

float8[] is well-supported and allows for effectively unlimited dimensions
of data.  float8 matches the common output format of many AI embeddings,
and it allows us or extensions to implement indexing methods around these
functions.  This patch set does not yet contain indexing support, but we
are exploring using GiST or GIN for the use-cases in question.  It might
also be desirable to add support for other linear algebra operations (e.g.,
operations on matrices).  The attached patches likely only scratch the
surface of the "vector search" use-case.

The patch set is broken up as follows:

 * 0001 does some minor refactoring of dsqrt() in preparation for 0002.
 * 0002 adds several vector-related functions, including distance functions
   and a kmeans++ implementation.
 * 0003 adds support for optionally using the OpenBLAS library, which is an
   implementation of the Basic Linear Algebra Subprograms [2]
   specification.  Basic testing with this library showed a small
   performance boost, although perhaps not enough to justify giving this
   patch serious consideration.

Of course, there are many open questions.  For example, should PostgreSQL
support this stuff out-of-the-box in the first place?  And should we
introduce a vector data type or SQL domains for treating float8[] as
vectors?  IMHO these vector search use-cases are an exciting opportunity
for the PostgreSQL project, so I am eager to hear what folks think.

[0] https://www.postgresql.org/docs/current/cube.html
[1] https://postgr.es/m/2271927.1593097400%40sss.pgh.pa.us
[2] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: vector search support

From
Oliver Rice
Date:
Hi Nathan,

A nice side effect of using the float8[] to represent vectors is that it allows for vectors of different sizes to coexist in the same column.

We most frequently see (pgvector) vector columns being used for storing ML embeddings. Given that different models produce embeddings with a different number of dimensions, the need to specify a vector’s size in DDL tightly couples the schema to a single model. Support for variable length vectors would be a great way to decouple those concepts. It would also be a differentiating feature from existing vector stores.

One drawback is that variable length vectors complicates indexing for similarity search because similarity measures require vectors of consistent length. Partial indexes are a possible solution to that challenge

-------
Oliver Rice


On Apr 21, 2023, at 7:07 PM, Nathan Bossart <nathandbossart@gmail.com> wrote:

Attached is a proof-of-concept/work-in-progress patch set that adds
functions for "vectors" repreѕented with one-dimensional float8 arrays.
These functions may be used in a variety of applications, but I am
proposing them with the AI/ML use-cases in mind.  I am posting this early
in the v17 cycle in hopes of gathering feedback prior to PGCon.

With the accessibility of AI/ML tools such as large language models (LLMs),
there has been a demand for storing and manipulating high-dimensional
vectors in PostgreSQL, particularly around nearest-neighbor queries.  Many
of these vectors have more than 1500 dimensions.  The cube extension [0]
provides some of the distance functionality (e.g., taxicab, Euclidean, and
Chebyshev), but it is missing some popular functions (e.g., cosine
similarity, dot product), and it is limited to 100 dimensions.  We could
extend cube to support more dimensions, but this would require reworking
its indexing code and filling in gaps between the cube data type and the
array types.  For some previous discussion about using the cube extension
for this kind of data, see [1].

float8[] is well-supported and allows for effectively unlimited dimensions
of data.  float8 matches the common output format of many AI embeddings,
and it allows us or extensions to implement indexing methods around these
functions.  This patch set does not yet contain indexing support, but we
are exploring using GiST or GIN for the use-cases in question.  It might
also be desirable to add support for other linear algebra operations (e.g.,
operations on matrices).  The attached patches likely only scratch the
surface of the "vector search" use-case.

The patch set is broken up as follows:

* 0001 does some minor refactoring of dsqrt() in preparation for 0002.
* 0002 adds several vector-related functions, including distance functions
  and a kmeans++ implementation.
* 0003 adds support for optionally using the OpenBLAS library, which is an
  implementation of the Basic Linear Algebra Subprograms [2]
  specification.  Basic testing with this library showed a small
  performance boost, although perhaps not enough to justify giving this
  patch serious consideration.

Of course, there are many open questions.  For example, should PostgreSQL
support this stuff out-of-the-box in the first place?  And should we
introduce a vector data type or SQL domains for treating float8[] as
vectors?  IMHO these vector search use-cases are an exciting opportunity
for the PostgreSQL project, so I am eager to hear what folks think.

[0] https://www.postgresql.org/docs/current/cube.html
[1] https://postgr.es/m/2271927.1593097400%40sss.pgh.pa.us
[2] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
<v1-0001-Refactor-dsqrt.patch>

Re: vector search support

From
"Jonathan S. Katz"
Date:
Hi,

On 4/21/23 8:07 PM, Nathan Bossart wrote:
> Attached is a proof-of-concept/work-in-progress patch set that adds
> functions for "vectors" repreѕented with one-dimensional float8 arrays.
> These functions may be used in a variety of applications, but I am
> proposing them with the AI/ML use-cases in mind.  I am posting this early
> in the v17 cycle in hopes of gathering feedback prior to PGCon.

Thanks for proposing this. Looking forward to discussing more in person. 
There's definitely demand to use PostgreSQL to store / search over 
vector data, and I do think we need to improve upon this in core.

> With the accessibility of AI/ML tools such as large language models (LLMs),
> there has been a demand for storing and manipulating high-dimensional
> vectors in PostgreSQL, particularly around nearest-neighbor queries.  Many
> of these vectors have more than 1500 dimensions. 

1536 seems to be a popular one from LLMs, but I've been seeing much 
higher dimensionality (8K, 16K etc). My hunch is that at a practical 
level, apps are going to favor data sets / sources that use a reduced 
dimensionality, but I wouldn't be shocked if we see vectors of all sizes.

> The cube extension [0]
> provides some of the distance functionality (e.g., taxicab, Euclidean, and
> Chebyshev), but it is missing some popular functions (e.g., cosine
> similarity, dot product), and it is limited to 100 dimensions.  We could
> extend cube to support more dimensions, but this would require reworking
> its indexing code and filling in gaps between the cube data type and the
> array types.  For some previous discussion about using the cube extension
> for this kind of data, see [1].

I've stared at the cube code quite a bit over the past few months. There 
are definitely some clever methods in it for handling searches over 
(now) lower dimensionality data, but I generally agree we should add 
functionality that's specific to ARRAY types.

I'll start making specific comments on the patches below.


> float8[] is well-supported and allows for effectively unlimited dimensions
> of data.  float8 matches the common output format of many AI embeddings,
> and it allows us or extensions to implement indexing methods around these
> functions.  This patch set does not yet contain indexing support, but we
> are exploring using GiST or GIN for the use-cases in question.  It might
> also be desirable to add support for other linear algebra operations (e.g.,
> operations on matrices).  The attached patches likely only scratch the
> surface of the "vector search" use-case.
> 
> The patch set is broken up as follows:
> 
>   * 0001 does some minor refactoring of dsqrt() in preparation for 0002.

This seems pretty benign and may as well do anyway, though we may need 
to expand on it based on comments on next patch. Question on:

+static inline float8
+float8_sqrt(const float8 val)
+{
+    float8        result;
+
+    if (unlikely(val < 0))

Should this be:

   if (unlikely(float8_lt(val, 0))

Similarly:

+    if (unlikely(result == 0.0) && val != 0.0)

   if (unlikely(float8_eq(result,0.0)) && float8_ne(val, 0.0))


>   * 0002 adds several vector-related functions, including distance functions
>     and a kmeans++ implementation.

Nice. Generally I like this patch. The functions seems to match the most 
commonly used vector distance functions I'm seeing, and it includes a 
function that can let a user specify a constraint on an ARRAY column so 
they can ensure it contains valid vectors.

While I think supporting float8 is useful, I've been seeing a mix of 
data types in the different AI/ML vector embeddings, i.e. float4 / 
float8. Additionally, it could be helpful to support integers as well, 
particularly based on some of the dimensionality reduction techniques 
I've seen. I think this holds double true for kmeans, which is often 
used in those calculations.

I'd suggest ensure these functions support:

* float4, float8
* int2, int4, int8

There's probably some nuance of how we document this too, given our 
docs[1] specify real / double precision, and smallint, int, bigint.

(Separately, we mention the int2/int4/int8 aliases in [1], but not 
float4/float8, which seems like a small addition we should make).

If you agree, I'd be happy to review more closely once that's implemented.

Other things:

* kmeans -- we're using kmeans++, should the function name reflect that? 
Do you think we could end up with a different kmeans algo in the future? 
Maybe we allow the user to specify the kmeans algo from the function 
name (with the default / only option today being kmeans++)?

>   * 0003 adds support for optionally using the OpenBLAS library, which is an
>     implementation of the Basic Linear Algebra Subprograms [2]
>     specification.  Basic testing with this library showed a small
>     performance boost, although perhaps not enough to justify giving this
>     patch serious consideration.

It'd be good to see what else we could use OpenBLAS with. Maybe that's a 
discussion for PGCon.

> Of course, there are many open questions.  For example, should PostgreSQL
> support this stuff out-of-the-box in the first place?  

Yes :) One can argue an extension (and pgvector[2] already does a lot 
here), but I think native support would be generally helpful for users. 
It does remove the friction of starting out.

There's also an interesting use-case downthread (I'll comment on that 
there) that demonstrates why it's helpful to have variability in vector 
size in an ARRAY column, which is an argument for supporting it there.

> And should we
> introduce a vector data type or SQL domains for treating float8[] as
> vectors?  IMHO these vector search use-cases are an exciting opportunity
> for the PostgreSQL project, so I am eager to hear what folks think.

Having a vector type could give us some advantages in how we 
store/search over the data. For example, we perform validation checks up 
front, normalize the vector, etc. and any index implementations will 
have less work to do on that front. We may also be able to give more 
options to tune how the vector is stored, e.g. perform inversion on 
insert/update.

Again, it's a fair argument that this can be done in an extension, but 
historically we've seen reduced friction when we add support in core. 
It'd also make building additional functionality easier, whether in core 
or an extension.

Thanks,

Jonathan

[1] https://www.postgresql.org/docs/current/datatype-numeric.html
[2] https://github.com/pgvector/pgvector

Attachment

Re: vector search support

From
"Jonathan S. Katz"
Date:
On 5/25/23 1:48 PM, Oliver Rice wrote:

> A nice side effect of using the float8[] to represent vectors is that it 
> allows for vectors of different sizes to coexist in the same column.
> 
> We most frequently see (pgvector) vector columns being used for storing 
> ML embeddings. Given that different models produce embeddings with a 
> different number of dimensions, the need to specify a vector’s size in 
> DDL tightly couples the schema to a single model. Support for variable 
> length vectors would be a great way to decouple those concepts. It would 
> also be a differentiating feature from existing vector stores.

I hadn't thought of that, given most of what I've seen (or at least my 
personal bias in designing systems) is you keep a vector of one 
dimensionality in a column. But this sounds like where having native 
support in a variable array would help.

> One drawback is that variable length vectors complicates indexing for 
> similarity search because similarity measures require vectors of 
> consistent length. Partial indexes are a possible solution to that challenge

Yeah, that presents a challenge. This may also be an argument for a 
vector data type, since that would eliminate the need to check for 
consistent dimensionality on the indexing.

Jonathan

Attachment

Re: vector search support

From
"Jonathan S. Katz"
Date:
On 4/26/23 9:31 AM, Giuseppe Broccolo wrote:
> Hi Nathan,
> 
> I find the patches really interesting. Personally, as Data/MLOps 
> Engineer, I'm involved in a project where we use embedding techniques to 
> generate vectors from documents, and use clustering and kNN searches to 
> find similar documents basing on spatial neighbourhood of generated 
> vectors.

Thanks! This seems to be a pretty common use-case these days.

> We finally opted for ElasticSearch as search engine, considering that it 
> was providing what we needed:
> 
> * support to store dense vectors
> * support for kNN searches (last version of ElasticSearch allows this)

I do want to note that we can implement indexing techniques with GiST 
that perform K-NN searches with the "distance" support function[1], so 
adding the fundamental functions to help with this around known vector 
search techniques could add this functionality. We already have this 
today with "cube", but as Nathan mentioned, it's limited to 100 dims.

> An internal benchmark showed us that we were able to achieve the 
> expected performance, although we are still lacking some points:
> 
> * clustering of vectors (this has to be done outside the search engine, 
> using DBScan for our use case)

 From your experience, have you found any particular clustering 
algorithms better at driving a good performance/recall tradeoff?

> * concurrency in updating the ElasticSearch indexes storing the dense 
> vectors

I do think concurrent updates of vector-based indexes is one area 
PostgreSQL can ultimately be pretty good at, whether in core or in an 
extension.

> I found these patches really interesting, considering that they would 
> solve some of open issues when storing dense vectors. Index support 
> would help a lot with searches though.

Great -- thanks for the feedback,

Jonathan

[1] https://www.postgresql.org/docs/devel/gist-extensibility.html

Attachment

Re: vector search support

From
Giuseppe Broccolo
Date:
Hi Jonathan,

On 5/26/23 3:38 PM, Jonathan S. Katz <jkatz@postgresql.org> wrote:
On 4/26/23 9:31 AM, Giuseppe Broccolo wrote:
> We finally opted for ElasticSearch as search engine, considering that it
> was providing what we needed:
>
> * support to store dense vectors
> * support for kNN searches (last version of ElasticSearch allows this)

I do want to note that we can implement indexing techniques with GiST
that perform K-NN searches with the "distance" support function[1], so
adding the fundamental functions to help with this around known vector
search techniques could add this functionality. We already have this
today with "cube", but as Nathan mentioned, it's limited to 100 dims.

Yes, I was aware of this. It would be enough to define the required support functions for GiST
indexing (I was a bit in the loop when it was tried to add PG14 presorting support to GiST indexing
in PostGIS[1]). That would be really helpful indeed. I was just mentioning it because I know about
other teams using ElasticSearch as a storage of dense vectors only for this.
 
> An internal benchmark showed us that we were able to achieve the
> expected performance, although we are still lacking some points:
>
> * clustering of vectors (this has to be done outside the search engine,
> using DBScan for our use case)

 From your experience, have you found any particular clustering
algorithms better at driving a good performance/recall tradeoff?

Nope, it really depends on the use case: the point of using DBScan above
was mainly because it's a way of clustering without knowing a priori the number
of clusters the algorithm should be able to retrieve, which is actually a parameter
needed for Kmeans. Depending on the use case, DBScan might have better
performance in noisy datasets (i.e. entries that really do not belong to a cluster in
particular). Noise in vectors obtained with embedding models is quite normal,
especially when the embedding model is not properly tuned/trained.

In our use case, DBScan was more or less the best choice, without biasing the
expected clusters.

Also PostGIS includes an implementation of DBScan for its geometries[2].
 
> * concurrency in updating the ElasticSearch indexes storing the dense
> vectors

I do think concurrent updates of vector-based indexes is one area
PostgreSQL can ultimately be pretty good at, whether in core or in an
extension.

Oh, it would save a lot of overhead in updating indexed vectors! It's something needed
when embedding models are re-trained, vectors are re-generated and indexes need to
be updated.

Regards,
Giuseppe.
 

Re: vector search support

From
Giuseppe Broccolo
Date:
Hi Nathan,

I noticed you implemented a closest_vector function which returns the closest vector to a given one using the
Euclidean distance: would it make sense to change the implementation in order to include also different distance
definitions rather than the Euclidean one (for instance, cosine similarity)? Depending on the use cases, some
metrics could make more sense than others.

Giuseppe.

On 4/22/23 1:07 AM, Nathan Bossart <nathandbossart@gmail.com> wrote:
Attached is a proof-of-concept/work-in-progress patch set that adds
functions for "vectors" repreѕented with one-dimensional float8 arrays.
These functions may be used in a variety of applications, but I am
proposing them with the AI/ML use-cases in mind.  I am posting this early
in the v17 cycle in hopes of gathering feedback prior to PGCon.

With the accessibility of AI/ML tools such as large language models (LLMs),
there has been a demand for storing and manipulating high-dimensional
vectors in PostgreSQL, particularly around nearest-neighbor queries.  Many
of these vectors have more than 1500 dimensions.  The cube extension [0]
provides some of the distance functionality (e.g., taxicab, Euclidean, and
Chebyshev), but it is missing some popular functions (e.g., cosine
similarity, dot product), and it is limited to 100 dimensions.  We could
extend cube to support more dimensions, but this would require reworking
its indexing code and filling in gaps between the cube data type and the
array types.  For some previous discussion about using the cube extension
for this kind of data, see [1].

float8[] is well-supported and allows for effectively unlimited dimensions
of data.  float8 matches the common output format of many AI embeddings,
and it allows us or extensions to implement indexing methods around these
functions.  This patch set does not yet contain indexing support, but we
are exploring using GiST or GIN for the use-cases in question.  It might
also be desirable to add support for other linear algebra operations (e.g.,
operations on matrices).  The attached patches likely only scratch the
surface of the "vector search" use-case.

The patch set is broken up as follows:

 * 0001 does some minor refactoring of dsqrt() in preparation for 0002.
 * 0002 adds several vector-related functions, including distance functions
   and a kmeans++ implementation.
 * 0003 adds support for optionally using the OpenBLAS library, which is an
   implementation of the Basic Linear Algebra Subprograms [2]
   specification.  Basic testing with this library showed a small
   performance boost, although perhaps not enough to justify giving this
   patch serious consideration.

Of course, there are many open questions.  For example, should PostgreSQL
support this stuff out-of-the-box in the first place?  And should we
introduce a vector data type or SQL domains for treating float8[] as
vectors?  IMHO these vector search use-cases are an exciting opportunity
for the PostgreSQL project, so I am eager to hear what folks think.

[0] https://www.postgresql.org/docs/current/cube.html
[1] https://postgr.es/m/2271927.1593097400%40sss.pgh.pa.us
[2] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com