Optimizing ResouceOwner to speed up COPY - Mailing list pgsql-hackers

From Tomas Vondra
Subject Optimizing ResouceOwner to speed up COPY
Date
Msg-id 84f20db7-7d57-4dc0-8144-7e38e0bbe75d@vondra.me
Whole thread Raw
Responses Re: Optimizing ResouceOwner to speed up COPY
Re: Optimizing ResouceOwner to speed up COPY
List pgsql-hackers
Hi,

While reviewing and testing a nearby patch (using COPY for batching in
postgres_fdw), I noticed some of the COPY queries are spending a
substantial amount of time in ResourceOwnerAddToHash(). The exact figure
depends on amount of data in the COPY, but it was often close to 50%
(according to perf-record/report).

The reason is pretty simple - ResourceOwner tracks the resources in a
very simple hash table, with O(n^2) behavior with duplicates. This
happens with COPY, because COPY creates an array of a 1000 tuple slots,
and each slot references the same tuple descriptor. And the descriptor
is added to ResourceOwner for each slot.

The trouble is the ResourceOwner uses a simple open addressing hash
table, and the duplicate values end up in a long dense "train" of
entries, slowing down both additions and deletions. Starting with an
empty hash table, adding the first entry is fine - the first slot is
available. But the following additions need to skip over the current
entries - the n-th addition needs to skip over n-1 entries.

With N entries this means (N * (N-1) / 2). And then the same thing
happens for deletions, in reverse.

This is not the first time I noticed ResourceOwner in profiles, but I
never investigated it. I don't know if all those cases were caused by
this same behavior, but it's likely at least some were ...

There's an easy way to improve this by allowing a single hash entry to
represent multiple references to the same resource. The attached patch
adds a "count" to the ResourceElem, tracking how many times that
resource was added. So if you add 1000 tuples slots, the descriptor will
have just one ResourceElem entry with count=1000.

This reduces the insert/delete complexity - not quite to O(1), but much
lower than O(n^2). It also reduces the memory usage with duplicates, and
postpones when the hash table needs to be enlarged. Of course, if there
are no duplicates, it'll use a bit more memory.

The deduplication is "opportunistic" in the sense that you may still end
up with multiple entries for the same value/kind. When adding a resource
finds an empty slot before hitting the other entry, it'll use the slot.
Also, only the hash table is deduplicated, not the initial array (which
however quite small).

This is a reasonable trade off because it means it does not get more
expensive for cases with no duplicates, while still improving cases with
many duplicate values.

To demonstrate the benefits, here's a throughput (tps) from a COPY of a
certain number of rows from a file into an UNLOGGED table, with 1, 4 and
8 clients.

                           master                patched
    rows    |       1       4       8  |       1       4       8
   ---------|--------------------------|-------------------------
    10      |  112090  418421  594905  |  112871  419043  589646
    100     |   35265  121903  168238  |   42536  138578  183740
    1000    |    1450    5700   10725  |    5847   21594   30713
    10000   |     531    1943    2988  |     743    2619    3557
    100000  |      72     244     324  |      76     256     332

Or, as relative throughput:

                    patched / master
    rows    |         1        4        8
   ---------------------------------------
    10      |      101%     100%      99%
    100     |      121%     114%     109%
    1000    |      403%     379%     286%
    10000   |      140%     135%     119%
    100000  |      106%     105%     102%

Those are some nice improvements, especially with 1000 rows. Which makes
sense, because 1000 is the internal batch size in COPY. So the overhead
gradually increases up to 1000 rows, and then it amortizes over many
batches. It has very little impact for large COPY.

I've done the same test with a logged table. The results significantly
depend on the storage (how fast it can handle commits) - not surprising.
But with good storage the overall behavior is almost exactly the same.

I didn't do any benchmarking focused on cases with no/few duplicates
(although the COPY with 10 rows could qualify) yet. But as I explained
earlier, the behavior should not change at all. There's an extra uint32,
but that's it.


regards

-- 
Tomas Vondra
Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Next
From: Tom Lane
Date:
Subject: Re: Optimizing ResouceOwner to speed up COPY