On 10/16/25 20:12, Tom Lane wrote:
> Tomas Vondra <tomas@vondra.me> writes:
>> The reason is pretty simple - ResourceOwner tracks the resources in a
>> very simple hash table, with O(n^2) behavior with duplicates. This
>> happens with COPY, because COPY creates an array of a 1000 tuple slots,
>> and each slot references the same tuple descriptor. And the descriptor
>> is added to ResourceOwner for each slot.
>> ...
>> There's an easy way to improve this by allowing a single hash entry to
>> represent multiple references to the same resource. The attached patch
>> adds a "count" to the ResourceElem, tracking how many times that
>> resource was added. So if you add 1000 tuples slots, the descriptor will
>> have just one ResourceElem entry with count=1000.
>
> Hmm. I don't love the 50% increase in sizeof(ResourceElem) ... maybe
> that's negligible, or maybe it isn't.
I agree the +50% is not great. My plan was to maybe use a smaller data
type, say uint8 or uint16, and reduce the overhead that way (while still
massively reducing the number of entries). But I didn't realize there
will always be extra padding, making this pointless.
Maybe we could make the struct __packed__, and put the counter last (and
maybe make it uint16)? But I don't see any other packed structs, so I
guess there's a reason not to have them.
I'm not sure if this is (or is not) negligible. In cases with duplicates
this is actually saves a lot of memory, so it's a win. I'm not sure
about cases without duplicates - it'll use more memory, but I'm not sure
how many resource owners there usually are. Or how many entries they
contain. That'll probably determine if it's negligible.
My intuition was there wouldn't be that many, and that the typical
number of elements is fairly low (if not, why have the initial capacity
set to 64/128)? But maybe my intuition is wrong.
> Can you find evidence of this change being helpful for anything
> except this specific scenario in COPY?
I went through the ResourceOwnerRemember() calls, looking for other
cases that might create a lot of duplicates, similar to the tuple
descriptors, but I haven't found anything obvious. Other resources seem
to be either naturally unique or limited to very few duplicates.
> Because we could probably find some way to avoid registering all the
> doppelganger slots, if that's the only culprit.
I'm not against other approaches.
I didn't want to rework the foundation of the resource owner management,
and this seemed simple, and with minimal impact on cases without
duplicates (except for the struct getting larger).
thanks
--
Tomas Vondra