Re: Slow "not in array" operation - Mailing list pgsql-performance

From Rick Otten
Subject Re: Slow "not in array" operation
Date
Msg-id CAMAYy4Jsyo4_Gs1hRvWd1vYnpPftAGWzx=OdhLAZC_WiU5CuTQ@mail.gmail.com
Whole thread Raw
In response to Re: Slow "not in array" operation  (Morris de Oryx <morrisdeoryx@gmail.com>)
List pgsql-performance


On Wed, Nov 13, 2019 at 5:47 AM Morris de Oryx <morrisdeoryx@gmail.com> wrote:
Disclaimer: Out over my skis again.

From what you say here, and over on SO, it sounds like you've got two problems:

* Matching on huge numbers of records because of common tags.

* A dynamic collection of tags as they're customer driven/configured.

An "ideal" solution might look like a bit-index for each tag+tuple, but Postgres does not have such a structure. The closest I've seen are Bloom filter based indexes. That's likely not going to work here as you don't know the collection of tags at any one time. If, however, you create your own frequency count estimates for tags, you may well find that there are a small number of common tags, and a large number of rare tags. That would be good to find out. If you do have some super common (non selective) tags, then perhaps a Bloom index based on that collection could be effective. Or expression indexes on the very common tags. In your SaaS setup, you might need counts/indexes tied to some kind of customer/tenancy distinction ID, understood. But, for simplicity, I'm just saying a single set of frequency counts, etc.

Here's a recent article on Bloom filter based indexes in Postgres that looks decent:

One other question might be whether you are always querying for a specific tag or small set of tags, or if your queries are for relatively random tags.  ie, if you are always looking for the same 2 or 3 tags, then maybe you could use a functional index or trigger-populate a new column on insert/update that indicates whether those tags are present.

It is possible that you want a Graph model for this data instead of a Relational model.  ie, if you are finding a bunch of users with common features, you may find traversing a graph (such as Neo4j - or if you _have_ to stay with a PG backend, something like Cayley.io) to be much more efficient and flexible.

 

pgsql-performance by date:

Previous
From: Morris de Oryx
Date:
Subject: Re: Slow "not in array" operation
Next
From: Jeff Janes
Date:
Subject: Re: Slow "not in array" operation