Thread: BUG #18831: Particular queries using gin-indexes are not interruptible, resulting is resource usage concerns.

The following bug has been logged on the website:

Bug reference:      18831
Logged by:          Niek
Email address:      niek.brasa@hitachienergy.com
PostgreSQL version: 17.4
Operating system:   Windows/Linux
Description:

As part of checking why some of the queries were not interruptible (even
though statement_timeout was set), I was able to reduce it to something
generic, as the usage of a gin index and the ?| operator. 
I have tested this in PostgreSQL 13.20, 16.4 and 17.4

Although performance is certainly a real issue as well, that is not the main
cause of filing a bug report. As you can imagine, it is relatively easy to
execute multiple of those statements, and put the server to a halt, which is
what happened in our case. We have 120 seconds for statement timeout, and it
was not being honored causing our services to reconnect after another
timeout was triggered, created another connection and executing the same
query. 

Output for 1000 iteration:

NOTICE:  Query for  1000/     10 took      0.001004 s
NOTICE:  Query for  1000/    100 took      0.000983 s
NOTICE:  Query for  1000/   1000 took      0.029361 s
NOTICE:  Query for  1000/  10000 took      3.364018 s
NOTICE:  Query for  1000/  50000 took     84.101790 s
NOTICE:  Query for  1000/ 100000 took    373.761158 s
NOTICE:  Query for  1000/ 150000 took    846.428113 s
NOTICE:  Query for  1000/ 200000 took   1486.085538 s
NOTICE:  Query for  1000/ 300000 took   3382.042434 s
NOTICE:  Query for  1000/ 400000 took   6249.934149 s
NOTICE:  Query for  1000/ 500000 took   9731.607982 s
NOTICE:  Query for  1000/ 600000 took  14113.382917 s
NOTICE:  Query for  1000/ 700000 took  19276.934712 s
NOTICE:  Query for  1000/ 800000 took  24991.138558 s

And one for 1000 and interrupted  at the 50000 point (you will notice, it
will return after the query is executed with exception)

NOTICE:  Query for  1000/  50000 took     93.362066 s, before being
canceled.

Below you find the reproducible block, let it run, when you see that you
have to wait a long time for a result, use the pg_terminate_backend or
pg_cancel_backend to stop the statement. You will notice that nothing is
happening, but the statement is actually executed, you will get a
query_canceled exception.

DO
$$
DECLARE
  _dummy int;
  _start timestamp;
  _a_t   int[] := ARRAY[1,10,100,1000,10000,100000];
  _a_l   int[] :=
ARRAY[10,100,1000,10000,50000,100000,150000,200000,300000,400000,500000,600000,700000,800000,900000,1000000];
  _i_l   int;
  _i_t   int;
BEGIN
    FOREACH _i_t IN ARRAY _a_t
    LOOP
        -- create a table with _i_t rows that contains a jsonb column 'doc',
with
        -- one attribute 'a' containing an array of two random uuids as text.
        -- using the gin index here, will cause the statement to be non
interruptible
        DROP TABLE IF EXISTS public.t1;
        CREATE TABLE public.t1 as (SELECT jsonb_build_object('a',
jsonb_build_array(gen_random_uuid()::text,gen_random_uuid()::text)) doc FROM
generate_series(1,_i_t));
        CREATE INDEX t1_idx ON public.t1 USING gin ((doc->'a') jsonb_ops);
        ANALYZE public.t1;
    
        FOREACH _i_l IN ARRAY _a_l
        LOOP
          BEGIN
            _start := clock_timestamp();
            
            -- the following query cannot be cancelled or terminated and the postgres
process
            -- will keep running, can only hard kill the process, which causes the
database to go
            -- into recovery mode. If statement takes longer than the statement
timeout, it will
            -- return with canceled statement exception.
            -- Although performance is horrendous with larger table contents, the
fact that it is
            -- not interruptible is the main concern raised here.
            --
            --  - Using an in memory table, will cause the statement to be
interruptible.
            --    ,t1 AS (select jsonb_build_object('a',
jsonb_build_array(gen_random_uuid()::text,gen_random_uuid()::text)) doc FROM
generate_series(1,_i_t))
            --  - Using an actual table without the gin index, will cause the
statement to be
            --    interruptible
            --  - Using a gin index, will cause the statemetnt to be non
interruptible.
            --
            WITH l1 AS (select gen_random_uuid()::text uuid FROM
generate_series(1,_i_l))
            SELECT 1 INTO _dummy
              FROM t1
             WHERE t1.doc->'a' ?| (SELECT array_agg(uuid) FROM l1);
          EXCEPTION WHEN query_canceled THEN          
            RAISE NOTICE 'Query for %/% took % s, before being canceled.'
                        ,LPAD(_i_t::text,5,' ')
                        ,LPAD(_i_l::text,7,' ')
                        ,LPAD(EXTRACT(epoch FROM (clock_timestamp() - _start))::text,13,'
');
            RETURN;
          END;
          RAISE NOTICE 'Query for %/% took % s'
                      ,LPAD(_i_t::text,5,' ')
                      ,LPAD(_i_l::text,7,' ')
                      ,LPAD(EXTRACT(epoch FROM (clock_timestamp() - _start))::text,13,'
');
        END LOOP;
    END LOOP;
END;
$$;


PG Bug reporting form <noreply@postgresql.org> writes:
> As part of checking why some of the queries were not interruptible (even
> though statement_timeout was set), I was able to reduce it to something
> generic, as the usage of a gin index and the ?| operator. 

Thanks for the report.  It looks like the issue boils down to trying
to do a GIN index search with a very large array on the RHS of the
?| operator.  extractQuery extracts all the array elements as search
keys, and then we spend O(N^2) time in a very stupid de-duplication
loop in ginFillScanEntry.  Also, if we get past that, there's O(N^2)
work in startScanKey.

It's not hard to remove the O(N^2) behavior in ginFillScanEntry:
the de-duplication is just an optimization AFAICS, and we can stop
doing it once there are too many keys.  A more principled approach
would be to sort the keys and then just compare adjacent ones to
de-dup, but I seriously doubt it's worth the extra code that
would be needed.

Half of the problem in startScanKey is just wasted work: it
re-initializes the entire entryRes[] array for each required-key
probe, when it actually would take less code to incrementally
update the array.  However, the other half is that we are doing
up to nentries calls of the triConsistentFn, which may be doing
O(nentries) work itself per call --- and indeed is doing so in
this specific case.  I don't see any solution for that that
doesn't involve changes in the API spec for GIN opclasses.
It does seem like a pretty bad way to be doing things for
operators that have simple AND or OR semantics, where we could
know the answer in constant time if we only had an API for it.
So maybe it's worth trying to do something about that, but don't
hold your breath.

For now I propose the attached, which bounds ginFillScanEntry's
de-dup efforts at 1000 keys, fixes the silly coding in startScanKey,
and adds a CHECK_FOR_INTERRUPTS in the startScanKey loop to cover
the fact that it's still O(N^2) in the worst case.  With this
I don't notice any objectionable delay in response to cancel
interrupts.  Also, while it's still O(N^2) in the end,
there's a pretty nice speedup for mid-size problems:
NOTICE:  Query for 10000/ 200000 took      6.957140 s
vs
NOTICE:  Query for 10000/ 200000 took    649.409895 s

            regards, tom lane

diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index f3b19d280c3..4a56f19390d 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -558,16 +558,18 @@ startScanKey(GinState *ginstate, GinScanOpaque so, GinScanKey key)
         qsort_arg(entryIndexes, key->nentries, sizeof(int),
                   entryIndexByFrequencyCmp, key);

+        for (i = 1; i < key->nentries; i++)
+            key->entryRes[entryIndexes[i]] = GIN_MAYBE;
         for (i = 0; i < key->nentries - 1; i++)
         {
             /* Pass all entries <= i as FALSE, and the rest as MAYBE */
-            for (j = 0; j <= i; j++)
-                key->entryRes[entryIndexes[j]] = GIN_FALSE;
-            for (j = i + 1; j < key->nentries; j++)
-                key->entryRes[entryIndexes[j]] = GIN_MAYBE;
+            key->entryRes[entryIndexes[i]] = GIN_FALSE;

             if (key->triConsistentFn(key) == GIN_FALSE)
                 break;
+
+            /* Make this loop interruptible in case there are many keys */
+            CHECK_FOR_INTERRUPTS();
         }
         /* i is now the last required entry. */

diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index 63ded6301e2..a50694f6776 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -68,8 +68,12 @@ ginFillScanEntry(GinScanOpaque so, OffsetNumber attnum,
      *
      * Entries with non-null extra_data are never considered identical, since
      * we can't know exactly what the opclass might be doing with that.
+     *
+     * Also, give up de-duplication once we have 1000 entries.  That avoids
+     * spending O(N^2) time on probably-fruitless de-duplication of large
+     * search-key sets.
      */
-    if (extra_data == NULL)
+    if (extra_data == NULL && so->totalentries < 1000)
     {
         for (i = 0; i < so->totalentries; i++)
         {

Hello,
I believe there may be an issue with the above patch - specifically in
the case of the triConsistent logic dealing with op-classes that use
the consistent function. In the shimTriConsistentFn, the key's
entryRes values are directly mutated to set to GIN_FALSE (if there's
fewer than MAX_MAYBE_ENTRIES entries). At the end of this method,
they're not reset back to GIN_MAYBE. Consequently, the next loop of
the ginFillScanEntry now may incorrectly assume that the remaining
entries are GIN_FALSE when they started as GIN_MAYBE: previously they
were hard reset for every iteration of the ginFillScanEntry loop. The
attached patch seems to fix it by resetting any mutated values back
before returning. However, this would also reset it during the regular
triConsistent check per tuple.

Would appreciate your thoughts on this (and open to other ways to
resolve this too).

Thanks,
Vinod




On Fri, 11 Apr 2025 at 15:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> PG Bug reporting form <noreply@postgresql.org> writes:
> > As part of checking why some of the queries were not interruptible (even
> > though statement_timeout was set), I was able to reduce it to something
> > generic, as the usage of a gin index and the ?| operator.
>
> Thanks for the report.  It looks like the issue boils down to trying
> to do a GIN index search with a very large array on the RHS of the
> ?| operator.  extractQuery extracts all the array elements as search
> keys, and then we spend O(N^2) time in a very stupid de-duplication
> loop in ginFillScanEntry.  Also, if we get past that, there's O(N^2)
> work in startScanKey.
>
> It's not hard to remove the O(N^2) behavior in ginFillScanEntry:
> the de-duplication is just an optimization AFAICS, and we can stop
> doing it once there are too many keys.  A more principled approach
> would be to sort the keys and then just compare adjacent ones to
> de-dup, but I seriously doubt it's worth the extra code that
> would be needed.
>
> Half of the problem in startScanKey is just wasted work: it
> re-initializes the entire entryRes[] array for each required-key
> probe, when it actually would take less code to incrementally
> update the array.  However, the other half is that we are doing
> up to nentries calls of the triConsistentFn, which may be doing
> O(nentries) work itself per call --- and indeed is doing so in
> this specific case.  I don't see any solution for that that
> doesn't involve changes in the API spec for GIN opclasses.
> It does seem like a pretty bad way to be doing things for
> operators that have simple AND or OR semantics, where we could
> know the answer in constant time if we only had an API for it.
> So maybe it's worth trying to do something about that, but don't
> hold your breath.
>
> For now I propose the attached, which bounds ginFillScanEntry's
> de-dup efforts at 1000 keys, fixes the silly coding in startScanKey,
> and adds a CHECK_FOR_INTERRUPTS in the startScanKey loop to cover
> the fact that it's still O(N^2) in the worst case.  With this
> I don't notice any objectionable delay in response to cancel
> interrupts.  Also, while it's still O(N^2) in the end,
> there's a pretty nice speedup for mid-size problems:
> NOTICE:  Query for 10000/ 200000 took      6.957140 s
> vs
> NOTICE:  Query for 10000/ 200000 took    649.409895 s
>
>                         regards, tom lane
>

Attachment
Vinod Sridharan <vsridh90@gmail.com> writes:
> I believe there may be an issue with the above patch - specifically in
> the case of the triConsistent logic dealing with op-classes that use
> the consistent function. In the shimTriConsistentFn, the key's
> entryRes values are directly mutated to set to GIN_FALSE (if there's
> fewer than MAX_MAYBE_ENTRIES entries). At the end of this method,
> they're not reset back to GIN_MAYBE. Consequently, the next loop of
> the ginFillScanEntry now may incorrectly assume that the remaining
> entries are GIN_FALSE when they started as GIN_MAYBE: previously they
> were hard reset for every iteration of the ginFillScanEntry loop. The
> attached patch seems to fix it by resetting any mutated values back
> before returning. However, this would also reset it during the regular
> triConsistent check per tuple.

This patch would be more convincing with a test case demonstrating
that there's a problem.

            regards, tom lane



 -- create textops operator class without triconsistent
 CREATE OPERATOR CLASS mytsvector_ops FOR TYPE tsvector USING gin AS
     OPERATOR 1 @@(tsvector,tsquery),
     OPERATOR 2 @@@(tsvector,tsquery),
     FUNCTION        1       gin_cmp_tslexeme,
     FUNCTION        2       gin_extract_tsvector(tsvector,internal,internal),
     FUNCTION        3
gin_extract_tsquery(tsvector,internal,int2,internal,internal,internal,internal),
     FUNCTION        4
gin_tsquery_consistent(internal,int2,tsvector,int4,internal,internal,internal,internal),
     FUNCTION        5       gin_cmp_prefix,
 STORAGE text;
REATE TABLE mytexttable (value tsvector);

-- insert a single row that has one field in it
INSERT INCREATE OPERATOR CLASS

 -- create a table with a tsvector column
 CREATE TABLE mytexttable (value tsvector);
eate index with the above opclass
CREATE INDEX ON CREATE TABLE

 -- insert a single row that has one field in it
 INSERT INTO mytexttable VALUES ('second'::tsvector);
INSERT 0 1

 -- create index with the opclass above
 CREATE INDEX ON mytexttable USING gin(value mytsvector_ops);
CREATE INDEX

 -- now query (this should return second)
 SELECT * FROM mytexttable WHERE value @@ '(second | third) & !first'::tsquery;
  value
----------
 'second'
(1 row)


 -- this no longer returns the row
set enable_seqscan to off;
SET
 SELECT * FROM mytexttable WHERE value @@ '(second | third) & !first'::tsquery;
 value
-------
(0 rows)

Not sure what the best place would be to add such a test but happy to
add one wherever you recommend.

Thanks,
Vinod

On Fri, 11 Apr 2025 at 16:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Vinod Sridharan <vsridh90@gmail.com> writes:
> > I believe there may be an issue with the above patch - specifically in
> > the case of the triConsistent logic dealing with op-classes that use
> > the consistent function. In the shimTriConsistentFn, the key's
> > entryRes values are directly mutated to set to GIN_FALSE (if there's
> > fewer than MAX_MAYBE_ENTRIES entries). At the end of this method,
> > they're not reset back to GIN_MAYBE. Consequently, the next loop of
> > the ginFillScanEntry now may incorrectly assume that the remaining
> > entries are GIN_FALSE when they started as GIN_MAYBE: previously they
> > were hard reset for every iteration of the ginFillScanEntry loop. The
> > attached patch seems to fix it by resetting any mutated values back
> > before returning. However, this would also reset it during the regular
> > triConsistent check per tuple.
>
> This patch would be more convincing with a test case demonstrating
> that there's a problem.
>
>                         regards, tom lane



Vinod Sridharan <vsridh90@gmail.com> writes:
>  -- create textops operator class without triconsistent

OK, got it, and I concur that we need to make shimTriConsistentFn()
restore the state of the entryRes array before it returns.  But
I don't understand why you're concerned about "However, this would
also reset it during the regular triConsistent check per tuple"?
I think the point is basically that this function is violating
the expectation that triconsistent functions not change the state
of that array.  That expectation doesn't depend on what the call
site is.

            regards, tom lane



Hey Tom,
I concur with your statement - from a correctness standpoint this
should be what we do.
Since this is also called in the regular consistent function, this
would be adding work in the regular consistent path - where the caller
happens to reset the array for every invocation currently.

If we're okay with that, I would prefer to have this contract clean
too - which would also make it better if the consistent path were to
change in the future.

And yeah in that case, I believe my patch is sufficient.

-Vinod

On Fri, 11 Apr 2025 at 20:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Vinod Sridharan <vsridh90@gmail.com> writes:
> >  -- create textops operator class without triconsistent
>
> OK, got it, and I concur that we need to make shimTriConsistentFn()
> restore the state of the entryRes array before it returns.  But
> I don't understand why you're concerned about "However, this would
> also reset it during the regular triConsistent check per tuple"?
> I think the point is basically that this function is violating
> the expectation that triconsistent functions not change the state
> of that array.  That expectation doesn't depend on what the call
> site is.
>
>                         regards, tom lane



Vinod Sridharan <vsridh90@gmail.com> writes:
> Since this is also called in the regular consistent function, this
> would be adding work in the regular consistent path - where the caller
> happens to reset the array for every invocation currently.

Ah, so you're just worried about the extra work in that path?
But MAX_MAYBE_ENTRIES is only 4, so I can't believe it'd amount
to much.

I am wondering about a test case.  I'm not thrilled about building
a specialized opclass just to test this.  contrib/hstore and
contrib/intarray already have opclasses with no triconsistent
function, so they should (and do) exercise shimTriConsistentFn
already.  But their tests failed to expose this bug.  I spent
a bit of time trying to add an example that would show the bug
in one or the other of those, and failed so far.  Any ideas?

            regards, tom lane



Hey Tom,
I managed to get a test that fails without my patch, and added it to
the intarray tests. Without the patch, the query returns 0 records.
With my patch, the results are correct (12 records).

Please find attached the updated patch with a fix & a test in the
intarray contrib. Thanks for the pointer to the contrib extensions.

-Vinod

On Fri, 11 Apr 2025 at 22:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Vinod Sridharan <vsridh90@gmail.com> writes:
> > Since this is also called in the regular consistent function, this
> > would be adding work in the regular consistent path - where the caller
> > happens to reset the array for every invocation currently.
>
> Ah, so you're just worried about the extra work in that path?
> But MAX_MAYBE_ENTRIES is only 4, so I can't believe it'd amount
> to much.
>
> I am wondering about a test case.  I'm not thrilled about building
> a specialized opclass just to test this.  contrib/hstore and
> contrib/intarray already have opclasses with no triconsistent
> function, so they should (and do) exercise shimTriConsistentFn
> already.  But their tests failed to expose this bug.  I spent
> a bit of time trying to add an example that would show the bug
> in one or the other of those, and failed so far.  Any ideas?
>
>                         regards, tom lane

Attachment
Vinod Sridharan <vsridh90@gmail.com> writes:
> I managed to get a test that fails without my patch, and added it to
> the intarray tests. Without the patch, the query returns 0 records.
> With my patch, the results are correct (12 records).

Cool, thanks.

> Please find attached the updated patch with a fix & a test in the
> intarray contrib. Thanks for the pointer to the contrib extensions.

Pushed with light editorialization.  I noticed one other bug in the
same area: shimTriConsistentFn wasn't paying attention to the recheck
flag returned by its first call to the underlying consistentFn.
I think this is probably just a latent issue, because if any of the
later calls return recheck = true we'll end with the correct
result anyway.  And at least for the contrib modules that depend
on this, the recheck flag doesn't depend on the key-presence inputs.
But it's theoretically possible we'd get the wrong answer.

            regards, tom lane