Thread: Fix brin_form_tuple to properly detoast data

Fix brin_form_tuple to properly detoast data

From
Tomas Vondra
Date:
Hi,

As pointed out in [1], BRIN is not properly handling toasted data, which
may easily lead to index tuples referencing TOAST-ed values. Which is
clearly wrong - it's trivial to trigger failues after a DELETE.

Attached is a patch that aims to fix this - AFAIK the brin_form_tuple
was simply missing the TOAST_INDEX_HACK stuff from index_form_tuple,
which ensures the data is detoasted and (possibly) re-compressed. The
code is mostly the same, with some BRIN-specific tweaks (looking at
oi_typecache instead of the index descriptor, etc.).

I also attach a simple SQL script that I used to trigger the issue. This
needs to be turned into a regression test, I'll work on that tomorrow.


A separate question is what to do about existing indexes - ISTM the only
thing we can do is to tell the users to reindex all BRIN indexes on
varlena values. Something like this:

     select * from pg_class
      where relam = (select oid from pg_am where amname = 'brin')
        and oid in (select attrelid from pg_attribute where attlen = -1
                                     and attstorage in ('e', 'x'));


regards


[1] https://www.postgresql.org/message-id/20201001184133.oq5uq75sb45pu3aw%40development

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Fix brin_form_tuple to properly detoast data

From
Tomas Vondra
Date:
On 11/4/20 2:05 AM, Tomas Vondra wrote:
> Hi,
> 
> As pointed out in [1], BRIN is not properly handling toasted data, which
> may easily lead to index tuples referencing TOAST-ed values. Which is
> clearly wrong - it's trivial to trigger failues after a DELETE.
> 
> Attached is a patch that aims to fix this - AFAIK the brin_form_tuple
> was simply missing the TOAST_INDEX_HACK stuff from index_form_tuple,
> which ensures the data is detoasted and (possibly) re-compressed. The
> code is mostly the same, with some BRIN-specific tweaks (looking at
> oi_typecache instead of the index descriptor, etc.).
> 
> I also attach a simple SQL script that I used to trigger the issue. This
> needs to be turned into a regression test, I'll work on that tomorrow.
> 

OK, so here's an improved version of the fix - aside from the code (same 
as in v1), there are two patches with regression tests. Ultimately those 
should be merged with the fix, but this way it's possible to apply the 
regression tests to trigger the issue.

The first test is fairly trivial - it simply builds index on toasted 
data and then shows how an insert and select fail. There's a caveat, 
that this requires a DELETE + VACUUM, and the VACUUM actually has to 
cleanup the rows. So there must be no concurrent transactions that might 
need the rows, which is unlikely in regression tests. So this requires 
waiting for all running transactions to finish - I did that by building 
an index concurrently. It's a bit strange, but it's better than any 
other solution I could think of (timeout or some custom wait for xacts).

The second test is a bit redundant - it merely checks that both CREATE 
INDEX and INSERT INTO fail the same way when the index tuple gets too 
large. Before the fix there were some inconsistencies - the CREATE INDEX 
succeeded because it used TOASTed data. So ultimately this tests the 
same thing, but from a different perspective.


regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Fix brin_form_tuple to properly detoast data

From
Alvaro Herrera
Date:
On 2020-Nov-04, Tomas Vondra wrote:

> The first test is fairly trivial - it simply builds index on toasted data
> and then shows how an insert and select fail. There's a caveat, that this
> requires a DELETE + VACUUM, and the VACUUM actually has to cleanup the rows.
> So there must be no concurrent transactions that might need the rows, which
> is unlikely in regression tests. So this requires waiting for all running
> transactions to finish - I did that by building an index concurrently. It's
> a bit strange, but it's better than any other solution I could think of
> (timeout or some custom wait for xacts).

There are recent changes in vacuum for temp tables (commit 94bc27b57680?)
that would maybe make this stable enough, without having to have the CIC
there.  At least, I tried it locally a few times and it appears to work well.
This won't work for older releases though, just master.  This is patch
0001 attached here.

> The second test is a bit redundant - it merely checks that both CREATE INDEX
> and INSERT INTO fail the same way when the index tuple gets too large.
> Before the fix there were some inconsistencies - the CREATE INDEX succeeded
> because it used TOASTed data. So ultimately this tests the same thing, but
> from a different perspective.

Hmm.  This one shows page size in the error messages, so it'll fail on
nonstandard builds.  I think we try to stay away from introducing those,
so I'd leave this test out.

The code fix looks all right -- I'd just move the #include lines to
their place.  Patch 0002.

You add this comment:

> +            /*
> +             * Do nothing if value is not of varlena type. We don't need to
> +             * care about NULL values here, thanks to bv_allnulls above.
> +             *
> +             * If value is stored EXTERNAL, must fetch it so we are not
> +             * depending on outside storage.
> +             *
> +             * XXX Is this actually true? Could it be that the summary is
> +             * NULL even for range with non-NULL data? E.g. degenerate bloom
> +             * filter may be thrown away, etc.
> +             */

I think the XXX comment points to a bug that we don't have right now,
since neither minmax nor inclusion can end up with a NULL summary
starting from non-NULL data.  But if the comment about bloom is correct,
then surely it'll become a bug when bloom is added.

I don't think we need the second part of this comment:

> +/*
> + * This enables de-toasting of index entries.  Needed until VACUUM is
> + * smart enough to rebuild indexes from scratch.
> + */

... because, surely, we're now never working on having VACUUM rebuild
indexes from scratch.   In fact, I wonder if we need the #define at
all.  I propose to remove all those #ifdef lines in your patch.

The fix looks good to me.  I just added a comment in 0003.


Attachment

Re: Fix brin_form_tuple to properly detoast data

From
Tomas Vondra
Date:

On 11/5/20 6:17 PM, Alvaro Herrera wrote:
> On 2020-Nov-04, Tomas Vondra wrote:
> 
>> The first test is fairly trivial - it simply builds index on toasted data
>> and then shows how an insert and select fail. There's a caveat, that this
>> requires a DELETE + VACUUM, and the VACUUM actually has to cleanup the rows.
>> So there must be no concurrent transactions that might need the rows, which
>> is unlikely in regression tests. So this requires waiting for all running
>> transactions to finish - I did that by building an index concurrently. It's
>> a bit strange, but it's better than any other solution I could think of
>> (timeout or some custom wait for xacts).
> 
> There are recent changes in vacuum for temp tables (commit 94bc27b57680?)
> that would maybe make this stable enough, without having to have the CIC
> there.  At least, I tried it locally a few times and it appears to work well.
> This won't work for older releases though, just master.  This is patch
> 0001 attached here.
> 

IIUC you're suggesting to use a temporary table in the test? 
Unfortunately, this does not work on older releases, and IMHO the test 
should be backpatched too. IMHO the CIC "hack" is acceptable, unless 
there's a better solution that I'm not aware of.

>> The second test is a bit redundant - it merely checks that both CREATE INDEX
>> and INSERT INTO fail the same way when the index tuple gets too large.
>> Before the fix there were some inconsistencies - the CREATE INDEX succeeded
>> because it used TOASTed data. So ultimately this tests the same thing, but
>> from a different perspective.
> 
> Hmm.  This one shows page size in the error messages, so it'll fail on
> nonstandard builds.  I think we try to stay away from introducing those,
> so I'd leave this test out.

Hmm, OK. I don't think having "redundant" test is a big deal, but I 
haven't thought about builds with different block sizes. I'll leave this 
out.

> 
> The code fix looks all right -- I'd just move the #include lines to
> their place.  Patch 0002.
> 

OK

> You add this comment:
> 
>> +            /*
>> +             * Do nothing if value is not of varlena type. We don't need to
>> +             * care about NULL values here, thanks to bv_allnulls above.
>> +             *
>> +             * If value is stored EXTERNAL, must fetch it so we are not
>> +             * depending on outside storage.
>> +             *
>> +             * XXX Is this actually true? Could it be that the summary is
>> +             * NULL even for range with non-NULL data? E.g. degenerate bloom
>> +             * filter may be thrown away, etc.
>> +             */
> 
> I think the XXX comment points to a bug that we don't have right now,
> since neither minmax nor inclusion can end up with a NULL summary
> starting from non-NULL data.  But if the comment about bloom is correct,
> then surely it'll become a bug when bloom is added.
> 

Yeah, but that'd be something for the bloom patch to fix, I think.

This got me thinking though - wouldn't it be better to handle too large 
values by treating the range as "degenerate" (i.e. store NULL and 
consider it as matching all queries), instead of failing the CREATE 
INDEX or DML? I find the current behavior rather annoying, because it 
depends on the other rows in the page range, not just on the one row the 
user deals with. Perhaps this might even be considered an information 
leak about the other data. Of course, not something this patch should 
deal with.

> I don't think we need the second part of this comment:
> 
>> +/*
>> + * This enables de-toasting of index entries.  Needed until VACUUM is
>> + * smart enough to rebuild indexes from scratch.
>> + */
> 
> ... because, surely, we're now never working on having VACUUM rebuild
> indexes from scratch.   In fact, I wonder if we need the #define at
> all.  I propose to remove all those #ifdef lines in your patch.
> 

That's a verbatim copy of a comment from indextuple.c. IMHO we should 
keep it the same in both places.

> The fix looks good to me.  I just added a comment in 0003.
> 

Thanks. Any opinions on fixing this in existing clusters? Any better 
ideas than just giving users the SQL query to list possibly-affected 
indexes?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Fix brin_form_tuple to properly detoast data

From
Tomas Vondra
Date:
On Thu, 5 Nov 2020 18:16:04 -0300
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> On 2020-Nov-05, Tomas Vondra wrote:
> 
> > On 11/5/20 6:17 PM, Alvaro Herrera wrote:  
> 
> > > There are recent changes in vacuum for temp tables (commit
> > > 94bc27b57680?) that would maybe make this stable enough, without
> > > having to have the CIC there.  At least, I tried it locally a few
> > > times and it appears to work well. This won't work for older
> > > releases though, just master.  This is patch 0001 attached here.  
> > 
> > IIUC you're suggesting to use a temporary table in the test?
> > Unfortunately, this does not work on older releases, and IMHO the
> > test should be backpatched too. IMHO the CIC "hack" is acceptable,
> > unless there's a better solution that I'm not aware of.  
> 
> Oh, sure, the CIC hack is acceptable for the older branches.  I'm just
> saying that you can use a temp table everywhere, and keep CIC in the
> old branches and no CIC in master.
> 
> > This got me thinking though - wouldn't it be better to handle too
> > large values by treating the range as "degenerate" (i.e. store NULL
> > and consider it as matching all queries), instead of failing the
> > CREATE INDEX or DML? I find the current behavior rather annoying,
> > because it depends on the other rows in the page range, not just on
> > the one row the user deals with. Perhaps this might even be
> > considered an information leak about the other data. Of course, not
> > something this patch should deal with.  
> 
> Hmm.  Regarding text I remember thinking we could just truncate values
> (as we do for LIKE, as I recall).  I suppose that strategy would work
> even for bytea.
> 
> 
> > > > +/*
> > > > + * This enables de-toasting of index entries.  Needed until
> > > > VACUUM is
> > > > + * smart enough to rebuild indexes from scratch.
> > > > + */  
> > > 
> > > ... because, surely, we're now never working on having VACUUM
> > > rebuild indexes from scratch.   In fact, I wonder if we need the
> > > #define at all.  I propose to remove all those #ifdef lines in
> > > your patch.  
> > 
> > That's a verbatim copy of a comment from indextuple.c. IMHO we
> > should keep it the same in both places.  
> 
> Sure, if you want to.
> 
> > > The fix looks good to me.  I just added a comment in 0003.  
> > 
> > Thanks. Any opinions on fixing this in existing clusters? Any
> > better ideas than just giving users the SQL query to list
> > possibly-affected indexes?  
> 
> None here.
> 

OK, pushed and backpatched all the way back to 9.5. I decided not to
use the temporary table - I'd still need to use the CIC trick on older
releases, and there were enough differences already.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company