Thread: Speeding up COPY TO for uuids and arrays
While analyzing a customer's performance problem, I noticed that the performance of pg_dump for large arrays is terrible. As a test case, I created a table with 10000 rows, each of which had an array of 10000 uuids. The table resided in shared buffers. The following took 24.5 seconds: COPY mytab TO '/dev/null'; Most of the time was spent in array_out and uuid_out. I tried binary copy, which took 4.4 seconds: COPY mytab TO '/dev/null' (FORMAT 'binary'); Here, a lot of time was spent in pq_begintypsend. So I looked for low-hanging fruit, and the result is the attached patch series. - Patch 0001 speeds up pq_begintypsend with a custom macro. That brought the binary copy down to 3.7 seconds, which is a speed gain of 15%. - Patch 0001 speeds up uuid_out by avoiding the overhead of a Stringinfo. This brings text mode COPY to 19.4 seconds, which is speed gain of 21%. - Patch 0003 speeds up array_out a bit by avoiding some zero byte writes. The measured speed gain is under 2%. Yours, Laurenz Albe
Attachment
Hi, On 2024-02-17 17:48:23 +0100, Laurenz Albe wrote: > - Patch 0001 speeds up pq_begintypsend with a custom macro. > That brought the binary copy down to 3.7 seconds, which is a > speed gain of 15%. Nice win, but I think we can do a bit better than this. Certainly it shouldn't be a macro, given the multiple evaluation risks. I don't think we actually need to zero those bytes, in fact that seems more likely to hide bugs than prevent them. I have an old patch series improving performance in this area. A big win is to simply not allocate as much memory for a new stringinfo, when we already know the upper bound, as we do in many of the send functions. > - Patch 0001 speeds up uuid_out by avoiding the overhead of > a Stringinfo. This brings text mode COPY to 19.4 seconds, > which is speed gain of 21%. Nice! I wonder if we should move the core part for converting to hex to numutils.c, we already have code the for the inverse. There does seem to be further optimization potential in the conversion, and that seems better done somewhere central rather than one type's output function. OTOH, it might not be worth it, given the need to add the dashes. > - Patch 0003 speeds up array_out a bit by avoiding some zero > byte writes. The measured speed gain is under 2%. Makes sense. Greetings, Andres Freund
On Sat, Feb 17, 2024 at 12:24:33PM -0800, Andres Freund wrote: > I wonder if we should move the core part for converting to hex to numutils.c, > we already have code the for the inverse. There does seem to be further > optimization potential in the conversion, and that seems better done somewhere > central rather than one type's output function. OTOH, it might not be worth > it, given the need to add the dashes. I'd tend to live with the current location of the code, but I'm OK if people feel differently on this one, so I'm OK with what Laurenz is proposing. >> - Patch 0003 speeds up array_out a bit by avoiding some zero >> byte writes. The measured speed gain is under 2%. > > Makes sense. +1. -- Michael
Attachment
On Sat, 2024-02-17 at 12:24 -0800, Andres Freund wrote: > On 2024-02-17 17:48:23 +0100, Laurenz Albe wrote: > > - Patch 0001 speeds up pq_begintypsend with a custom macro. > > That brought the binary copy down to 3.7 seconds, which is a > > speed gain of 15%. > > Nice win, but I think we can do a bit better than this. Certainly it shouldn't > be a macro, given the multiple evaluation risks. I don't think we actually > need to zero those bytes, in fact that seems more likely to hide bugs than > prevent them. > > I have an old patch series improving performance in this area. The multiple evaluation danger could be worked around with an automatic variable, or the macro could just carry a warning (like appendStringInfoCharMacro). But considering the thread in [1], I guess I'll retract that patch; I'm sure the more invasive improvement you have in mind will do better. > > - Patch 0001 speeds up uuid_out by avoiding the overhead of > > a Stringinfo. This brings text mode COPY to 19.4 seconds, > > which is speed gain of 21%. > > Nice! > > I wonder if we should move the core part for converting to hex to numutils.c, > we already have code the for the inverse. There does seem to be further > optimization potential in the conversion, and that seems better done somewhere > central rather than one type's output function. OTOH, it might not be worth > it, given the need to add the dashes. I thought about it, but then decided not to do that. Calling a function that converts the bytes to hex and then adding the hyphens will slow down processing, and I think the code savings would be minimal at best. > > - Patch 0003 speeds up array_out a bit by avoiding some zero > > byte writes. The measured speed gain is under 2%. > > Makes sense. Thanks for your interest and approval. The attached patches are the renamed, but unmodified patches 2 and 3 from before. Yours, Laurenz Albe [1]: https://postgr.es/m/CAMkU%3D1whbRDUwa4eayD9%2B59K-coxO9senDkPRbTn3cg0pUz4AQ%40mail.gmail.com
Attachment
On Mon, Feb 19, 2024 at 03:08:45PM +0100, Laurenz Albe wrote: > On Sat, 2024-02-17 at 12:24 -0800, Andres Freund wrote: >> I wonder if we should move the core part for converting to hex to numutils.c, >> we already have code the for the inverse. There does seem to be further >> optimization potential in the conversion, and that seems better done somewhere >> central rather than one type's output function. OTOH, it might not be worth >> it, given the need to add the dashes. > > I thought about it, but then decided not to do that. > Calling a function that converts the bytes to hex and then adding the > hyphens will slow down processing, and I think the code savings would be > minimal at best. Yeah, I'm not sure either if that's worth doing, the current conversion code is simple enough. I'd be curious to hear about ideas to optimize that more locally, though. I was curious about the UUID one, and COPYing out 10M rows with a single UUID attribute brings down the runtime of a COPY from 3.8s to 2.3s here on a simple benchmark, with uuid_out showing up at the top of profiles easily on HEAD. Some references for HEAD: 31.63% 5300 postgres postgres [.] uuid_out 19.79% 3316 postgres postgres [.] appendStringInfoChar 11.27% 1887 postgres postgres [.] CopyAttributeOutText 6.36% 1065 postgres postgres [.] pgstat_progress_update_param And with the patch for uuid_out: 22.66% 2147 postgres postgres [.] CopyAttributeOutText 12.99% 1231 postgres postgres [.] uuid_out 11.41% 1081 postgres postgres [.] pgstat_progress_update_param 4.79% 454 postgres postgres [.] CopyOneRowTo That's very good, so I'd like to apply this part. Let me know if there are any objections. -- Michael
Attachment
On Wed, Feb 21, 2024 at 02:29:05PM +0900, Michael Paquier wrote: > That's very good, so I'd like to apply this part. Let me know if > there are any objections. This part is done as of 011d60c4352c. I did not evaluate the rest yet. -- Michael
Attachment
On Thu, 2024-02-22 at 10:34 +0900, Michael Paquier wrote: > This part is done as of 011d60c4352c. I did not evaluate the rest > yet. Thanks! I'm attaching the remaining patch for the Juli commitfest, if you don't get inspired before that. Yours, Laurenz Albe
Attachment
On Thu, Feb 22, 2024 at 08:16:09AM +0100, Laurenz Albe wrote: > I'm attaching the remaining patch for the Juli commitfest, if you > don't get inspired before that. There are good chances that I get inspired some time next week. We'll see ;) -- Michael
Attachment
Hi,
On 2024-02-17 17:48:23 +0100, Laurenz Albe wrote:
> As a test case, I created a table with 10000 rows, each of which
> had an array of 10000 uuids. The table resided in shared buffers.
> had an array of 10000 uuids. The table resided in shared buffers.
Can you share exactly script used to create a table?
best regards,
Ranier Vilela
On Thu, Feb 22, 2024 at 04:42:37PM -0300, Ranier Vilela wrote: > Can you share exactly script used to create a table? Stressing the internals of array_out() for the area of the patch is not that difficult, as we want to quote each element that's returned in output. The trick is to have the following to stress the second quoting loop a maximum: - a high number of rows. - a high number of items in the arrays. - a *minimum* number of characters in each element of the array, with characters that require quoting. The best test case I can think of to demonstrate the patch would be something like that (adjust rows and elts as you see fit): -- Number of rows \set rows 6 -- Number of elements \set elts 4 create table tab as with data as ( select array_agg(a) as array from ( select '{'::text from generate_series(1, :elts) as int(a)) as index(a)) select data.array from data, generate_series(1,:rows); Then I get: array ------------------- {"{","{","{","{"} {"{","{","{","{"} {"{","{","{","{"} {"{","{","{","{"} {"{","{","{","{"} {"{","{","{","{"} (6 rows) With "\set rows 100000" and "\set elts 10000", giving 100MB of data with 100k rows with 10k elements each, I get for HEAD when data is in shared buffers: =# copy tab to '/dev/null'; COPY 100000 Time: 48620.927 ms (00:48.621) And with v3: =# copy tab to '/dev/null'; COPY 100000 Time: 47993.183 ms (00:47.993) Profiles don't fundamentally change much, array_out() gets a 30.76% -> 29.72% in self runtime, with what looks like a limited impact to me. With 1k rows and 1M elements, COPY TO gets reduced from 54338.436 ms to 54129.978 ms, and a 29.51% -> 29.12% increase (looks like noise). Perhaps I've hit some noise while running this set of tests, but the impact of the proposed patch looks very limited to me. If you have a better set of tests and/or ideas, feel free of course. -- Michael
Attachment
Em seg., 26 de fev. de 2024 às 02:28, Michael Paquier <michael@paquier.xyz> escreveu:
On Thu, Feb 22, 2024 at 04:42:37PM -0300, Ranier Vilela wrote:
> Can you share exactly script used to create a table?
Stressing the internals of array_out() for the area of the patch is
not that difficult, as we want to quote each element that's returned
in output.
The trick is to have the following to stress the second quoting loop a
maximum:
- a high number of rows.
- a high number of items in the arrays.
- a *minimum* number of characters in each element of the array, with
characters that require quoting.
The best test case I can think of to demonstrate the patch would be
something like that (adjust rows and elts as you see fit):
-- Number of rows
\set rows 6
-- Number of elements
\set elts 4
create table tab as
with data as (
select array_agg(a) as array
from (
select '{'::text
from generate_series(1, :elts) as int(a)) as index(a))
select data.array from data, generate_series(1,:rows);
Then I get:
array
-------------------
{"{","{","{","{"}
{"{","{","{","{"}
{"{","{","{","{"}
{"{","{","{","{"}
{"{","{","{","{"}
{"{","{","{","{"}
(6 rows)
With "\set rows 100000" and "\set elts 10000", giving 100MB of data
with 100k rows with 10k elements each, I get for HEAD when data is in
shared buffers:
=# copy tab to '/dev/null';
COPY 100000
Time: 48620.927 ms (00:48.621)
And with v3:
=# copy tab to '/dev/null';
COPY 100000
Time: 47993.183 ms (00:47.993)
Thanks Michael, for the script.
It is easier to make comparisons, using the exact same script.
best regards,
Ranier Vilela