Thread: Don't overwrite scan key in systable_beginscan()

Don't overwrite scan key in systable_beginscan()

From
Peter Eisentraut
Date:
When systable_beginscan() and systable_beginscan_ordered() choose an 
index scan, they remap the attribute numbers in the passed-in scan keys 
to the attribute numbers of the index, and then write those remapped 
attribute numbers back into the scan key passed by the caller.  This 
second part is surprising and gratuitous.  It means that a scan key 
cannot safely be used more than once (but it might sometimes work, 
depending on circumstances).  Also, there is no value in providing these 
remapped attribute numbers back to the caller, since they can't do 
anything with that.

I propose to fix that by making a copy of the scan keys passed by the 
caller and make the modifications there.

In order to prove to myself that there are no other cases where 
caller-provided scan keys are modified, I went through and 
const-qualified all the APIs.  This works out correctly.  Several levels 
down in the stack, the access methods make their own copy of the scan 
keys that they store in their scan descriptors, and they use those in 
non-const-clean ways, but that's ok, that's their business.  As far as 
the top-level callers are concerned, they can rely on their scan keys to 
be const after this.

I'm not proposing this second patch for committing at this time, since 
that would modify the public access method APIs in an incompatible way. 
I've made a proposal of a similar nature in [0].  At some point, it 
might be worth batching these and other changes together and make the 
change.  I might come back to that later.

[0]: 
https://www.postgresql.org/message-id/flat/14c31f4a-0347-0805-dce8-93a9072c05a5%40eisentraut.org

While researching how the scan keys get copied around, I noticed that 
the index access methods all use memmove() to make the above-mentioned 
copy into their own scan descriptor.  This is fine, but memmove() is 
usually only used when something special is going on that would prevent 
memcpy() from working, which is not the case there.  So to avoid the 
confusion for future readers, I changed those to memcpy().  I suspect 
that this code has been copied between the different index AM over time. 
  (The nbtree version of this code is literally unchanged since July 1996.)
Attachment

Re: Don't overwrite scan key in systable_beginscan()

From
Robert Haas
Date:
On Thu, Aug 8, 2024 at 2:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> When systable_beginscan() and systable_beginscan_ordered() choose an
> index scan, they remap the attribute numbers in the passed-in scan keys
> to the attribute numbers of the index, and then write those remapped
> attribute numbers back into the scan key passed by the caller.  This
> second part is surprising and gratuitous.  It means that a scan key
> cannot safely be used more than once (but it might sometimes work,
> depending on circumstances).  Also, there is no value in providing these
> remapped attribute numbers back to the caller, since they can't do
> anything with that.
>
> I propose to fix that by making a copy of the scan keys passed by the
> caller and make the modifications there.

This does have the disadvantage of adding more palloc overhead.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Don't overwrite scan key in systable_beginscan()

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Aug 8, 2024 at 2:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>> I propose to fix that by making a copy of the scan keys passed by the
>> caller and make the modifications there.

> This does have the disadvantage of adding more palloc overhead.

It seems hard to believe that one more palloc + memcpy is going to be
noticeable in the context of an index scan.

            regards, tom lane



Re: Don't overwrite scan key in systable_beginscan()

From
Noah Misch
Date:
On Thu, Aug 08, 2024 at 08:46:35AM +0200, Peter Eisentraut wrote:
> When systable_beginscan() and systable_beginscan_ordered() choose an index
> scan, they remap the attribute numbers in the passed-in scan keys to the
> attribute numbers of the index, and then write those remapped attribute
> numbers back into the scan key passed by the caller.  This second part is
> surprising and gratuitous.  It means that a scan key cannot safely be used
> more than once (but it might sometimes work, depending on circumstances).
> Also, there is no value in providing these remapped attribute numbers back
> to the caller, since they can't do anything with that.
> 
> I propose to fix that by making a copy of the scan keys passed by the caller
> and make the modifications there.

No objection, but this would obsolete at least some of these comments (the
catcache.c ones if nothing else):

$ git grep -in scankey | grep -i copy
src/backend/access/gist/gistscan.c:257:                  * Copy consistent support function to ScanKey structure
instead
src/backend/access/gist/gistscan.c:330:                  * Copy distance support function to ScanKey structure instead
of
src/backend/access/nbtree/nbtutils.c:281:       ScanKey         arrayKeyData;   /* modified copy of scan->keyData */
src/backend/access/nbtree/nbtutils.c:3303: * We copy the appropriate indoption value into the scankey sk_flags
src/backend/access/nbtree/nbtutils.c:3318: * It's a bit ugly to modify the caller's copy of the scankey but in
practice
src/backend/access/spgist/spgscan.c:385:        /* copy scankeys into local storage */
src/backend/utils/cache/catcache.c:1474:                 * Ok, need to make a lookup in the relation, copy the scankey
and
src/backend/utils/cache/catcache.c:1794:                         * Ok, need to make a lookup in the relation, copy the
scankeyand
 
src/backend/utils/cache/relfilenumbermap.c:192:         /* copy scankey to local copy, it will be modified during the
scan*/
 

> In order to prove to myself that there are no other cases where
> caller-provided scan keys are modified, I went through and const-qualified
> all the APIs.  This works out correctly.  Several levels down in the stack,
> the access methods make their own copy of the scan keys that they store in
> their scan descriptors, and they use those in non-const-clean ways, but
> that's ok, that's their business.  As far as the top-level callers are
> concerned, they can rely on their scan keys to be const after this.

We do still have code mutating IndexScanDescData.keyData, but that's fine.  We
must be copying function args to form IndexScanDescData.keyData, or else your
proof patch would have noticed an assignment of const to non-const.



Re: Don't overwrite scan key in systable_beginscan()

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Thu, Aug 08, 2024 at 08:46:35AM +0200, Peter Eisentraut wrote:
>> I propose to fix that by making a copy of the scan keys passed by the caller
>> and make the modifications there.

> No objection, but this would obsolete at least some of these comments (the
> catcache.c ones if nothing else):

That ties into something I forgot to ask: aren't there any copy steps
or other overhead that we could remove, given this new API constraint?
That would help address Robert's concern.

            regards, tom lane



Re: Don't overwrite scan key in systable_beginscan()

From
Peter Eisentraut
Date:
On 09.08.24 06:55, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
>> On Thu, Aug 08, 2024 at 08:46:35AM +0200, Peter Eisentraut wrote:
>>> I propose to fix that by making a copy of the scan keys passed by the caller
>>> and make the modifications there.
> 
>> No objection, but this would obsolete at least some of these comments (the
>> catcache.c ones if nothing else):
> 
> That ties into something I forgot to ask: aren't there any copy steps
> or other overhead that we could remove, given this new API constraint?
> That would help address Robert's concern.

I added two more patches to the series here.

First (or 0004), some additional cleanup for code that had to workaround 
systable_beginscan() overwriting the scan keys, along the lines 
suggested by Noah.

Second (or 0005), an alternative to palloc is to make the converted scan 
keys a normal local variable.  Then it's just a question of whether a 
smaller palloc is preferred over an over-allocated local variable.  I 
think I still prefer the palloc version, but it doesn't matter much 
either way I think.

Attachment

Re: Don't overwrite scan key in systable_beginscan()

From
Peter Eisentraut
Date:
On 12.08.24 09:44, Peter Eisentraut wrote:
> On 09.08.24 06:55, Tom Lane wrote:
>> Noah Misch <noah@leadboat.com> writes:
>>> On Thu, Aug 08, 2024 at 08:46:35AM +0200, Peter Eisentraut wrote:
>>>> I propose to fix that by making a copy of the scan keys passed by 
>>>> the caller
>>>> and make the modifications there.
>>
>>> No objection, but this would obsolete at least some of these comments 
>>> (the
>>> catcache.c ones if nothing else):
>>
>> That ties into something I forgot to ask: aren't there any copy steps
>> or other overhead that we could remove, given this new API constraint?
>> That would help address Robert's concern.
> 
> I added two more patches to the series here.
> 
> First (or 0004), some additional cleanup for code that had to workaround 
> systable_beginscan() overwriting the scan keys, along the lines 
> suggested by Noah.
> 
> Second (or 0005), an alternative to palloc is to make the converted scan 
> keys a normal local variable.  Then it's just a question of whether a 
> smaller palloc is preferred over an over-allocated local variable.  I 
> think I still prefer the palloc version, but it doesn't matter much 
> either way I think.

Looks like the discussion here has settled, so I plan to go head with 
patches

[PATCH v2 1/5] Don't overwrite scan key in systable_beginscan()
[PATCH v2 3/5] Replace gratuitous memmove() with memcpy()
[PATCH v2 4/5] Update some code that handled systable_beginscan()
  overwriting scan key

(folding patch 4 into patch 1)




Re: Don't overwrite scan key in systable_beginscan()

From
Justin Pryzby
Date:
On Mon, Aug 12, 2024 at 09:44:02AM +0200, Peter Eisentraut wrote:
> Second (or 0005), an alternative to palloc is to make the converted scan
> keys a normal local variable.  Then it's just a question of whether a
> smaller palloc is preferred over an over-allocated local variable.  I think
> I still prefer the palloc version, but it doesn't matter much either way I
> think.

Since 811af9786b, the palloc'd idxkey's seem to be leaking/accumulating
throughout the command.

I noticed this on the master branch while running ANALYZE on partitioned
table with 600 attributes, even though only 6 were being analyzed.

LOG:  level: 3; BuildRelationExtStatistics: 1239963512 total in 278 blocks; 5082984 free (296 chunks); 1234880528 used

Several indexes are being scanned many thousands of times.

 310690 beginscan 2659
 310689 beginscan 2662
  77672 beginscan 2678

postgres=# SELECT 2659::REGCLASS;
regclass | pg_attribute_relid_attnum_index

postgres=# SELECT 2662::REGCLASS;
regclass | pg_class_oid_index

postgres=# SELECT 2678::REGCLASS;
regclass | pg_index_indrelid_index

> From 2f84e4d21ce611a4e6f428f72a18518e22776400 Mon Sep 17 00:00:00 2001
> From: Peter Eisentraut <peter@eisentraut.org>
> Date: Thu, 8 Aug 2024 08:27:26 +0200
> Subject: [PATCH v2 1/5] Don't overwrite scan key in systable_beginscan()
...
> Fix that by making a copy of the scan keys passed by the caller and
> make the modifications there.
> 
> diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
> index de751e8e4a3..b5eba549d3e 100644
> --- a/src/backend/access/index/genam.c
> +++ b/src/backend/access/index/genam.c
> @@ -419,17 +419,22 @@ systable_beginscan(Relation heapRelation,
>      if (irel)
>      {
>          int            i;
> +        ScanKey        idxkey;
>  
> -        /* Change attribute numbers to be index column numbers. */
> +        idxkey = palloc_array(ScanKeyData, nkeys);
> +
> +        /* Convert attribute numbers to be index column numbers. */
>          for (i = 0; i < nkeys; i++)



Re: Don't overwrite scan key in systable_beginscan()

From
Peter Eisentraut
Date:
On 26.11.24 14:56, Justin Pryzby wrote:
> Since 811af9786b, the palloc'd idxkey's seem to be leaking/accumulating
> throughout the command.
> 
> I noticed this on the master branch while running ANALYZE on partitioned
> table with 600 attributes, even though only 6 were being analyzed.
> 
> LOG:  level: 3; BuildRelationExtStatistics: 1239963512 total in 278 blocks; 5082984 free (296 chunks); 1234880528
used
> 
> Several indexes are being scanned many thousands of times.

Hmm, this patch inserts one additional palloc() call per 
systable_beginscan().  So it won't have much of an impact for isolated 
calls, but for thousands of scans you get thousands of small chunks of 
memory.

Does your test case get better if you insert corresponding pfree() calls?

A higher-level question would be whether there should be better memory 
management in the caller.  But it would be okay to address it with a 
pfree() as well, I think.




Re: Don't overwrite scan key in systable_beginscan()

From
Justin Pryzby
Date:
On Wed, Nov 27, 2024 at 04:33:25PM +0100, Peter Eisentraut wrote:
> On 26.11.24 14:56, Justin Pryzby wrote:
> > Since 811af9786b, the palloc'd idxkey's seem to be leaking/accumulating
> > throughout the command.
> > 
> > I noticed this on the master branch while running ANALYZE on partitioned
> > table with 600 attributes, even though only 6 were being analyzed.
> > 
> > LOG:  level: 3; BuildRelationExtStatistics: 1239963512 total in 278 blocks; 5082984 free (296 chunks); 1234880528
used
> > 
> > Several indexes are being scanned many thousands of times.
> 
> Hmm, this patch inserts one additional palloc() call per
> systable_beginscan().  So it won't have much of an impact for isolated
> calls, but for thousands of scans you get thousands of small chunks of
> memory.
> 
> Does your test case get better if you insert corresponding pfree() calls?

Yes -- I'd already checked.

-- 
Justin



Re: Don't overwrite scan key in systable_beginscan()

From
Peter Eisentraut
Date:
On 27.11.24 16:35, Justin Pryzby wrote:
> On Wed, Nov 27, 2024 at 04:33:25PM +0100, Peter Eisentraut wrote:
>> On 26.11.24 14:56, Justin Pryzby wrote:
>>> Since 811af9786b, the palloc'd idxkey's seem to be leaking/accumulating
>>> throughout the command.
>>>
>>> I noticed this on the master branch while running ANALYZE on partitioned
>>> table with 600 attributes, even though only 6 were being analyzed.
>>>
>>> LOG:  level: 3; BuildRelationExtStatistics: 1239963512 total in 278 blocks; 5082984 free (296 chunks); 1234880528
used
>>>
>>> Several indexes are being scanned many thousands of times.
>>
>> Hmm, this patch inserts one additional palloc() call per
>> systable_beginscan().  So it won't have much of an impact for isolated
>> calls, but for thousands of scans you get thousands of small chunks of
>> memory.
>>
>> Does your test case get better if you insert corresponding pfree() calls?
> 
> Yes -- I'd already checked.

Ok, I committed a fix that inserts some pfree() calls.  Thanks for the 
report.