Thread: [HACKERS] Valgrind-detected bug in partitioning code

[HACKERS] Valgrind-detected bug in partitioning code

From
Tom Lane
Date:
skink has been unhappy since commit d26fa4f went in, but I think
that just exposed a pre-existing bug.  Running valgrind here
duplicates the failure:

==00:00:02:01.653 16626== Conditional jump or move depends on uninitialised value(s)
==00:00:02:01.653 16626==    at 0x4BDF6B: btint4cmp (nbtcompare.c:97)
==00:00:02:01.653 16626==    by 0x81D6BE: FunctionCall2Coll (fmgr.c:1318)
==00:00:02:01.653 16626==    by 0x52D584: partition_bounds_equal (partition.c:627)
==00:00:02:01.653 16626==    by 0x80CF8E: RelationClearRelation (relcache.c:1203)
==00:00:02:01.653 16626==    by 0x80E601: RelationCacheInvalidateEntry (relcache.c:2662)
==00:00:02:01.653 16626==    by 0x803DD6: LocalExecuteInvalidationMessage (inval.c:568)
==00:00:02:01.653 16626==    by 0x803F53: ProcessInvalidationMessages.clone.0 (inval.c:444)
==00:00:02:01.653 16626==    by 0x8040C8: CommandEndInvalidationMessages (inval.c:1056)
==00:00:02:01.653 16626==    by 0x80C719: RelationSetNewRelfilenode (relcache.c:3490)
==00:00:02:01.653 16626==    by 0x5CD50A: ExecuteTruncate (tablecmds.c:1393)
==00:00:02:01.653 16626==    by 0x721AC7: standard_ProcessUtility (utility.c:532)
==00:00:02:01.653 16626==    by 0x71D943: PortalRunUtility (pquery.c:1163)

IOW, partition_bounds_equal() is testing uninitialized memory during
a TRUNCATE on a partitioned table.
        regards, tom lane



Re: [HACKERS] Valgrind-detected bug in partitioning code

From
Robert Haas
Date:
On Fri, Jan 20, 2017 at 8:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> skink has been unhappy since commit d26fa4f went in, but I think
> that just exposed a pre-existing bug.  Running valgrind here
> duplicates the failure:
>
> ==00:00:02:01.653 16626== Conditional jump or move depends on uninitialised value(s)
> ==00:00:02:01.653 16626==    at 0x4BDF6B: btint4cmp (nbtcompare.c:97)
> ==00:00:02:01.653 16626==    by 0x81D6BE: FunctionCall2Coll (fmgr.c:1318)
> ==00:00:02:01.653 16626==    by 0x52D584: partition_bounds_equal (partition.c:627)
> ==00:00:02:01.653 16626==    by 0x80CF8E: RelationClearRelation (relcache.c:1203)
> ==00:00:02:01.653 16626==    by 0x80E601: RelationCacheInvalidateEntry (relcache.c:2662)
> ==00:00:02:01.653 16626==    by 0x803DD6: LocalExecuteInvalidationMessage (inval.c:568)
> ==00:00:02:01.653 16626==    by 0x803F53: ProcessInvalidationMessages.clone.0 (inval.c:444)
> ==00:00:02:01.653 16626==    by 0x8040C8: CommandEndInvalidationMessages (inval.c:1056)
> ==00:00:02:01.653 16626==    by 0x80C719: RelationSetNewRelfilenode (relcache.c:3490)
> ==00:00:02:01.653 16626==    by 0x5CD50A: ExecuteTruncate (tablecmds.c:1393)
> ==00:00:02:01.653 16626==    by 0x721AC7: standard_ProcessUtility (utility.c:532)
> ==00:00:02:01.653 16626==    by 0x71D943: PortalRunUtility (pquery.c:1163)
>
> IOW, partition_bounds_equal() is testing uninitialized memory during
> a TRUNCATE on a partitioned table.

Hmm.  That's bad.  I kind of wonder how sane it is to think that we
can invoke SQL-callable functions during a relcache reload, because
couldn't we be processing an invalidation in the context of an aborted
transaction?  And I wonder why we really need or want to do that
anyway.  For purposes of equalPartitionDescs(), it seems like the
relevant test is datumIsEqual(), not the equality operator derived
from the partition opclass.

But I think the immediate problem here is some fuzzy thinking about
the handling of the values taken from b1->content and b2->content.
Those have to be checked before examining values from b1->datums
and/or b2->datums, and the latter should be inspected only if the
former are both identical and both RANGE_DATUM_FINITE.  I'll push a
fix for that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Valgrind-detected bug in partitioning code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Hmm.  That's bad.  I kind of wonder how sane it is to think that we
> can invoke SQL-callable functions during a relcache reload, because
> couldn't we be processing an invalidation in the context of an aborted
> transaction?

You're doing WHAT?
        regards, tom lane



Re: [HACKERS] Valgrind-detected bug in partitioning code

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Hmm.  That's bad.  I kind of wonder how sane it is to think that we
> > can invoke SQL-callable functions during a relcache reload, because
> > couldn't we be processing an invalidation in the context of an aborted
> > transaction?
>
> You're doing WHAT?

Uh.  +1.

Thanks!

Stephen

Re: [HACKERS] Valgrind-detected bug in partitioning code

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Hmm.  That's bad.  I kind of wonder how sane it is to think that we
>>> can invoke SQL-callable functions during a relcache reload,

>> You're doing WHAT?

> Uh.  +1.

Now that I've calmed down a bit: the right way to do this sort of thing is
simply to flush the invalidated data during reload, and recompute it when
it is next requested, which necessarily will be inside a valid
transaction.  Compare e.g. the handling of the lists of a relation's
indexes.
        regards, tom lane



Re: [HACKERS] Valgrind-detected bug in partitioning code

From
Robert Haas
Date:
On Fri, Jan 20, 2017 at 4:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> Hmm.  That's bad.  I kind of wonder how sane it is to think that we
>>>> can invoke SQL-callable functions during a relcache reload,
>
>>> You're doing WHAT?
>
>> Uh.  +1.
>
> Now that I've calmed down a bit: the right way to do this sort of thing is
> simply to flush the invalidated data during reload, and recompute it when
> it is next requested, which necessarily will be inside a valid
> transaction.  Compare e.g. the handling of the lists of a relation's
> indexes.

The existing handling of partition descriptors is modeled on and very
similar to the existing handling for other types of objects:

               keep_tupdesc = equalTupleDescs(relation->rd_att,
newrel->rd_att);               keep_rules = equalRuleLocks(relation->rd_rules,
newrel->rd_rules);               keep_policies = equalRSDesc(relation->rd_rsdesc,
newrel->rd_rsdesc);               keep_partkey = (relation->rd_partkey != NULL);               keep_partdesc =
equalPartitionDescs(relation->rd_partkey,
                relation->rd_partdesc,
                newrel->rd_partdesc);

And I think the reason is the same too, namely, if we've got a pointer
into partition descriptor in the relcache, we don't want that to
suddenly get swapped out and replaced with a pointer to an equivalent
data structure at a different address, because then our pointer will
be dangling.  That seems fine as far as it goes.

The difference is that those other equalBLAH functions call a
carefully limited amount of code whereas, in looking over the
backtrace you sent, I realized that equalPartitionDescs is calling
partition_bounds_equal which does this:
                      cmpval =
DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[j],
                                 key->partcollation[j],
                                 b1->datums[i][j],
                                 b2->datums[i][j]))


That's of course opening up a much bigger can of worms.  But apart
from the fact that it's unsafe, I think it's also wrong, as I said
upthread.  I think calling datumIsEqual() there should be better all
around.  Do you think that's unsafe here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Valgrind-detected bug in partitioning code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> The difference is that those other equalBLAH functions call a
> carefully limited amount of code whereas, in looking over the
> backtrace you sent, I realized that equalPartitionDescs is calling
> partition_bounds_equal which does this:
>                        cmpval =
> DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[j],
>                                   key->partcollation[j],
>                                   b1->datums[i][j],
>                                   b2->datums[i][j]))

Ah, gotcha.

> That's of course opening up a much bigger can of worms.  But apart
> from the fact that it's unsafe, I think it's also wrong, as I said
> upthread.  I think calling datumIsEqual() there should be better all
> around.  Do you think that's unsafe here?

That sounds like a plausible solution.  It is safe in the sense of
being a bounded amount of code.  It would return "false" in various
interesting cases like toast pointer versus detoasted equivalent,
but I think that would be fine in this application.

It would probably be a good idea to add something to datumIsEqual's
comment to the effect that trying to make it smarter would be a bad idea,
because some callers rely on it being stupid.
        regards, tom lane



Re: [HACKERS] Valgrind-detected bug in partitioning code

From
Amit Langote
Date:
On 2017/01/21 9:01, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> The difference is that those other equalBLAH functions call a
>> carefully limited amount of code whereas, in looking over the
>> backtrace you sent, I realized that equalPartitionDescs is calling
>> partition_bounds_equal which does this:
>>                        cmpval =
>> DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[j],
>>                                   key->partcollation[j],
>>                                   b1->datums[i][j],
>>                                   b2->datums[i][j]))
> 
> Ah, gotcha.
> 
>> That's of course opening up a much bigger can of worms.  But apart
>> from the fact that it's unsafe, I think it's also wrong, as I said
>> upthread.  I think calling datumIsEqual() there should be better all
>> around.  Do you think that's unsafe here?
> 
> That sounds like a plausible solution.  It is safe in the sense of
> being a bounded amount of code.  It would return "false" in various
> interesting cases like toast pointer versus detoasted equivalent,
> but I think that would be fine in this application.

Sorry for jumping in late.  Attached patch replaces the call to
partitioning-specific comparison function by the call to datumIsEqual().
I wonder if it is safe to assume that datumIsEqual() would return true for
a datum and copy of it made using datumCopy().  The latter is used to copy
a single datum from a bound's Const node (what is stored in the catalog
for every bound).

> It would probably be a good idea to add something to datumIsEqual's
> comment to the effect that trying to make it smarter would be a bad idea,
> because some callers rely on it being stupid.

I assume "making datumIsEqual() smarter" here means to make it account for
toasting of varlena datums, which is not a good idea because some of its
callers may be working in the context of an aborted transaction.  I tried
to update the header comment along these lines, though please feel to
rewrite it.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Valgrind-detected bug in partitioning code

From
Robert Haas
Date:
On Mon, Jan 23, 2017 at 12:45 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Sorry for jumping in late.  Attached patch replaces the call to
> partitioning-specific comparison function by the call to datumIsEqual().
> I wonder if it is safe to assume that datumIsEqual() would return true for
> a datum and copy of it made using datumCopy().  The latter is used to copy
> a single datum from a bound's Const node (what is stored in the catalog
> for every bound).

Thanks, committed.  I expanded the comment in partition.c because I
think you missed the other rationale for doing it this way, which is
that the partitioning operator might ignore some "unimportant" changes
(e.g. for numeric, the difference between 1.0 and 1.00) but for this
purpose it's better to update the relcache if there is *any* change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company