Re: Slow "not in array" operation - Mailing list pgsql-performance

From Marco Colli
Subject Re: Slow "not in array" operation
Date
Msg-id CAFvCgN5eATzgQcx2TtqM4yXzksxNVPVs_CtnDTpt1OvDXCnQJA@mail.gmail.com
Whole thread Raw
In response to Re: Slow "not in array" operation  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
I am not a PostgreSQL expert, however I think that the following algorithm should be possible and fast:

1. find the bitmap of all subscriptions in a project that are not trashed (it can use the index and takes only ~500ms)
2. find the bitmap of all subscriptions that match the above condition and HAVE the tag (~7s)
3. calculate [bitmap 1] - [bitmap 2] to find the subscriptions of the project that DON'T HAVE the tag



On Tue, Nov 12, 2019 at 9:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Marco Colli <collimarco91@gmail.com> writes:
> 3) Here's the query plan that I get after disabling the seq scan:

>  Finalize Aggregate  (cost=2183938.89..2183938.90 rows=1 width=8) (actual
> time=94972.253..94972.254 rows=1 loops=1)

So, this is slower than the seqscan, which means the planner made the
right choice.

You seem to be imagining that there's some way the index could be used
with the NOT clause, but there isn't.  Indexable clauses are of the form
        indexed_column indexable_operator constant
and there's no provision for a NOT in that.  If we had a "not contained
in" array operator, the NOT could be folded to be of this syntactic form,
but it's highly unlikely that any index operator class would consider such
an operator to be a supported indexable operator.  It doesn't lend itself
to searching an index.

So the planner is doing the best it can, which in this case is a
full-table scan.

A conceivable solution, if the tags array is a lot smaller than
the table as a whole and the table is fairly static, is that you could
make a btree index on the tags array and let the planner fall back
to an index-only scan that is just using the index as a cheaper
source of the array data.  (This doesn't work for your existing GIST
index because GIST can't reconstruct the original arrays on-demand.)
I suspect though that this wouldn't win much, even if you disregard
the maintenance costs for the extra index.  The really fundamental
problem here is that a large fraction of the table satisfies the
NOT-in condition, and no index is going to beat a seqscan by all that
much when that's true.  Indexes are good at retrieving small portions
of tables.

                        regards, tom lane

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Slow "not in array" operation
Next
From: Jeff Janes
Date:
Subject: Re: Slow "not in array" operation