Thread: Use COPY for populating all pgbench tables

Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
Hello,

We (Neon) have noticed that pgbench can be quite slow to populate data
in regard to higher latency connections. Higher scale factors exacerbate
this problem. Some employees work on a different continent than the
databases they might be benchmarking. By moving pgbench to use COPY for
populating all tables, we can reduce some of the time pgbench takes for
this particular step.

I wanted to come with benchmarks, but unfortunately I won't have them
until next month. I can follow-up in a future email.

--
Tristan Partin
Neon (https://neon.tech)

Attachment

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Tue May 23, 2023 at 12:33 PM CDT, Tristan Partin wrote:
> I wanted to come with benchmarks, but unfortunately I won't have them
> until next month. I can follow-up in a future email.

I finally got around to benchmarking.

master:

$ ./build/src/bin/pgbench/pgbench -i -s 500 CONNECTION_STRING
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
50000000 of 50000000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 s))
vacuuming...
creating primary keys...
done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side generate 1280.43 s, vacuum 2.55 s, primary
keys130.25 s). 

patchset:

$ ./build/src/bin/pgbench/pgbench -i -s 500 CONNECTION_STRING
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
50000000 of 50000000 tuples (100%) of pgbench_accounts done (elapsed 243.82 s, remaining 0.00 s))
vacuuming...
creating primary keys...
done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side generate 246.27 s, vacuum 2.77 s, primary keys
125.75s). 

I am located in Austin, Texas. The server was located in Ireland. Just
wanted to put some distance between the client and server. To summarize,
that is about an 80% reduction in client-side data generation times.

Note that I used Neon to collect these results.

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
David Rowley
Date:
On Thu, 8 Jun 2023 at 07:16, Tristan Partin <tristan@neon.tech> wrote:
>
> master:
>
> 50000000 of 50000000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 s))
> vacuuming...
> creating primary keys...
> done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side generate 1280.43 s, vacuum 2.55 s, primary
keys130.25 s).
 
>
> patchset:
>
> 50000000 of 50000000 tuples (100%) of pgbench_accounts done (elapsed 243.82 s, remaining 0.00 s))
> vacuuming...
> creating primary keys...
> done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side generate 246.27 s, vacuum 2.77 s, primary
keys125.75 s).
 

I've also previously found pgbench -i to be slow.  It was a while ago,
and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
inside pgbench.

On seeing your email, it makes me wonder if PG16's hex integer
literals might help here.  These should be much faster to generate in
pgbench and also parse on the postgres side.

I wrote a quick and dirty patch to try that and I'm not really getting
the same performance increases as I'd have expected. I also tested
with your patch too and it does not look that impressive either when
running pgbench on the same machine as postgres.

pgbench copy speedup

** master
drowley@amd3990x:~$ pgbench -i -s 1000 postgres
100000000 of 100000000 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).

** David's Patched
drowley@amd3990x:~$ pgbench -i -s 1000 postgres
100000000 of 100000000 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).

** Tristan's patch
drowley@amd3990x:~$ pgbench -i -s 1000 postgres
100000000 of 100000000 tuples (100%) of pgbench_accounts done (elapsed
77.44 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).

I'm interested to see what numbers you get.  You'd need to test on
PG16 however. I left the old code in place to generate the decimal
numbers for versions < 16.

David

Attachment

Re: Use COPY for populating all pgbench tables

From
Hannu Krosing
Date:
I guess that COPY will still be slower than  generating the data
server-side ( --init-steps=...G... ) ?

What I'd really like to see is providing all the pgbench functions
also on the server. Specifically the various random(...) functions -
random_exponential(...), random_gaussian(...), random_zipfian(...) so
that also custom data generationm could be easily done server-side
with matching distributions.

On Thu, Jun 8, 2023 at 7:34 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 8 Jun 2023 at 07:16, Tristan Partin <tristan@neon.tech> wrote:
> >
> > master:
> >
> > 50000000 of 50000000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 s))
> > vacuuming...
> > creating primary keys...
> > done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side generate 1280.43 s, vacuum 2.55 s, primary
keys130.25 s). 
> >
> > patchset:
> >
> > 50000000 of 50000000 tuples (100%) of pgbench_accounts done (elapsed 243.82 s, remaining 0.00 s))
> > vacuuming...
> > creating primary keys...
> > done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side generate 246.27 s, vacuum 2.77 s, primary
keys125.75 s). 
>
> I've also previously found pgbench -i to be slow.  It was a while ago,
> and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
> inside pgbench.
>
> On seeing your email, it makes me wonder if PG16's hex integer
> literals might help here.  These should be much faster to generate in
> pgbench and also parse on the postgres side.
>
> I wrote a quick and dirty patch to try that and I'm not really getting
> the same performance increases as I'd have expected. I also tested
> with your patch too and it does not look that impressive either when
> running pgbench on the same machine as postgres.
>
> pgbench copy speedup
>
> ** master
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 100000000 of 100000000 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).
>
> ** David's Patched
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 100000000 of 100000000 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).
>
> ** Tristan's patch
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 100000000 of 100000000 tuples (100%) of pgbench_accounts done (elapsed
> 77.44 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).
>
> I'm interested to see what numbers you get.  You'd need to test on
> PG16 however. I left the old code in place to generate the decimal
> numbers for versions < 16.
>
> David



Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Thu Jun 8, 2023 at 12:33 AM CDT, David Rowley wrote:
> On Thu, 8 Jun 2023 at 07:16, Tristan Partin <tristan@neon.tech> wrote:
> >
> > master:
> >
> > 50000000 of 50000000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 s))
> > vacuuming...
> > creating primary keys...
> > done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side generate 1280.43 s, vacuum 2.55 s, primary
keys130.25 s). 
> >
> > patchset:
> >
> > 50000000 of 50000000 tuples (100%) of pgbench_accounts done (elapsed 243.82 s, remaining 0.00 s))
> > vacuuming...
> > creating primary keys...
> > done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side generate 246.27 s, vacuum 2.77 s, primary
keys125.75 s). 
>
> I've also previously found pgbench -i to be slow.  It was a while ago,
> and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
> inside pgbench.
>
> On seeing your email, it makes me wonder if PG16's hex integer
> literals might help here.  These should be much faster to generate in
> pgbench and also parse on the postgres side.

Do you have a link to some docs? I have not heard of the feature.
Definitely feels like a worthy cause.

> I wrote a quick and dirty patch to try that and I'm not really getting
> the same performance increases as I'd have expected. I also tested
> with your patch too and it does not look that impressive either when
> running pgbench on the same machine as postgres.

I didn't expect my patch to increase performance in all workloads. I was
mainly aiming to fix high-latency connections. Based on your results
that looks like a 4% reduction in performance of client-side data
generation. I had thought maybe it is worth having a flag to keep the
old way too, but I am not sure a 4% hit is really that big of a deal.

> pgbench copy speedup
>
> ** master
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 100000000 of 100000000 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).
>
> ** David's Patched
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 100000000 of 100000000 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).
>
> ** Tristan's patch
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 100000000 of 100000000 tuples (100%) of pgbench_accounts done (elapsed
> 77.44 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).
>
> I'm interested to see what numbers you get.  You'd need to test on
> PG16 however. I left the old code in place to generate the decimal
> numbers for versions < 16.

I will try to test this soon and follow up on the thread. I definitely
see no problems with your patch as is though. I would be more than happy
to rebase my patches on yours.

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
Gregory Smith
Date:
On Tue, May 23, 2023 at 1:33 PM Tristan Partin <tristan@neon.tech> wrote:
We (Neon) have noticed that pgbench can be quite slow to populate data
in regard to higher latency connections. Higher scale factors exacerbate
this problem. Some employees work on a different continent than the
databases they might be benchmarking. By moving pgbench to use COPY for
populating all tables, we can reduce some of the time pgbench takes for
this particular step.

When latency is continent size high, pgbench should be run with server-side table generation instead of using COPY at all, for any table.  The default COPY based pgbench generation is only intended for testing where the client and server are very close on the network.

Unfortunately there's no simple command line option to change just that one thing about how pgbench runs.  You have to construct a command line that documents each and every step you want instead.  You probably just want this form:

    $ pgbench -i -I dtGvp -s 500

That's server-side table generation with all the usual steps.  I use this instead of COPY in pgbench-tools so much now, basically whenever I'm talking to a cloud system, that I have a simple 0/1 config option to switch between the modes, and this long weird one is the default now.

Try that out, and once you see the numbers my bet is you'll see extending which tables get COPY isn't needed by your use case anymore.  Basically, if you are close enough to use COPY instead of server-side generation, you are close enough that every table besides accounts will not add up to enough time to worry about optimizing the little ones.

--
Greg Smith  greg.smith@crunchydata.com
Director of Open Source Strategy
 

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Fri Jun 9, 2023 at 8:24 AM CDT, Gregory Smith wrote:
> On Tue, May 23, 2023 at 1:33 PM Tristan Partin <tristan@neon.tech> wrote:
>
> > We (Neon) have noticed that pgbench can be quite slow to populate data
> > in regard to higher latency connections. Higher scale factors exacerbate
> > this problem. Some employees work on a different continent than the
> > databases they might be benchmarking. By moving pgbench to use COPY for
> > populating all tables, we can reduce some of the time pgbench takes for
> > this particular step.
> >
>
> When latency is continent size high, pgbench should be run with server-side
> table generation instead of using COPY at all, for any table.  The default
> COPY based pgbench generation is only intended for testing where the client
> and server are very close on the network.
>
> Unfortunately there's no simple command line option to change just that one
> thing about how pgbench runs.  You have to construct a command line that
> documents each and every step you want instead.  You probably just want
> this form:
>
>     $ pgbench -i -I dtGvp -s 500
>
> That's server-side table generation with all the usual steps.  I use this
> instead of COPY in pgbench-tools so much now, basically whenever I'm
> talking to a cloud system, that I have a simple 0/1 config option to switch
> between the modes, and this long weird one is the default now.
>
> Try that out, and once you see the numbers my bet is you'll see extending
> which tables get COPY isn't needed by your use case anymore.  Basically, if
> you are close enough to use COPY instead of server-side generation, you are
> close enough that every table besides accounts will not add up to enough
> time to worry about optimizing the little ones.

Thanks for your input Greg. I'm sure you're correct that server-side data
generation would probably fix the problem. I guess I am trying to
understand if there are any downsides to just committing this anyway. I
haven't done any more testing, but David's email only showed a 4%
performance drop in the workload he tested. Combine this with his hex
patch and we would see an overall performance improvement when
everything is said and done.

It seems like this patch would still be good for client-side high scale
factor data generation (when server and client are close), but I would
need to do testing to confirm.

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
David,

I think you should submit this patch standalone. I don't see any reason
this shouldn't be reviewed and committed when fully fleshed out.

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
Gurjeet Singh
Date:
On Fri, Jun 9, 2023 at 6:24 AM Gregory Smith <gregsmithpgsql@gmail.com> wrote:
>
> Unfortunately there's no simple command line option to change just that one thing about how pgbench runs.  You have
toconstruct a command line that documents each and every step you want instead.  You probably just want this form: 
>
>     $ pgbench -i -I dtGvp -s 500

The steps are severely under-documented in pgbench --help output.
Grepping that output I could not find any explanation of these steps,
so I dug through the code and found them in runInitSteps(). Just as I
was thinking of writing a patch to remedy that, just to be sure, I
checked online docs and sure enough, they are well-documented under
pgbench [1].

I think at least a pointer to the the pgbench docs should be mentioned
in the pgbench --help output; an average user may not rush to read the
code to find the explanation, but a hint to where to find more details
about what the letters in --init-steps mean, would save them a lot of
time.

[1]: https://www.postgresql.org/docs/15/pgbench.html

Best regards,
Gurjeet
http://Gurje.et



Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Fri Jun 9, 2023 at 12:25 PM CDT, Gurjeet Singh wrote:
> On Fri, Jun 9, 2023 at 6:24 AM Gregory Smith <gregsmithpgsql@gmail.com> wrote:
> >
> > Unfortunately there's no simple command line option to change just that one thing about how pgbench runs.  You have
toconstruct a command line that documents each and every step you want instead.  You probably just want this form: 
> >
> >     $ pgbench -i -I dtGvp -s 500
>
> The steps are severely under-documented in pgbench --help output.
> Grepping that output I could not find any explanation of these steps,
> so I dug through the code and found them in runInitSteps(). Just as I
> was thinking of writing a patch to remedy that, just to be sure, I
> checked online docs and sure enough, they are well-documented under
> pgbench [1].
>
> I think at least a pointer to the the pgbench docs should be mentioned
> in the pgbench --help output; an average user may not rush to read the
> code to find the explanation, but a hint to where to find more details
> about what the letters in --init-steps mean, would save them a lot of
> time.
>
> [1]: https://www.postgresql.org/docs/15/pgbench.html
>
> Best regards,
> Gurjeet
> http://Gurje.et

I think this would be nice to have if you wanted to submit a patch.

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
Gregory Smith
Date:
On Fri, Jun 9, 2023 at 1:25 PM Gurjeet Singh <gurjeet@singh.im> wrote:
>     $ pgbench -i -I dtGvp -s 500

The steps are severely under-documented in pgbench --help output.

I agree it's not easy to find information.  I just went through double checking I had the order recently enough to remember what I did.  The man pages have this:

> Each step is invoked in the specified order. The default is dtgvp.

Which was what I wanted to read.  Meanwhile the --help output says:

>  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")

%T$%%Which has the information without a lot of context for what it's used for.  I'd welcome some text expansion that added a minimal but functional improvement to that the existing help output; I don't have such text.

Re: Use COPY for populating all pgbench tables

From
Gurjeet Singh
Date:
On Fri, Jun 9, 2023 at 5:42 PM Gregory Smith <gregsmithpgsql@gmail.com> wrote:
>
> On Fri, Jun 9, 2023 at 1:25 PM Gurjeet Singh <gurjeet@singh.im> wrote:
>>
>> >     $ pgbench -i -I dtGvp -s 500
>>
>> The steps are severely under-documented in pgbench --help output.
>
>
> I agree it's not easy to find information.  I just went through double checking I had the order recently enough to
rememberwhat I did.  The man pages have this: 
>
> > Each step is invoked in the specified order. The default is dtgvp.
>
> Which was what I wanted to read.  Meanwhile the --help output says:
>
> >  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")
>
> %T$%%Which has the information without a lot of context for what it's used for.  I'd welcome some text expansion that
addeda minimal but functional improvement to that the existing help output; I don't have such text. 

Please see attached 2 variants of the patch. Variant 1 simply tells
the reader to consult pgbench documentation. The second variant
provides a description for each of the letters, as the documentation
does. The committer can pick  the one they find suitable.

Best regards,
Gurjeet
http://Gurje.et

Attachment

Re: Use COPY for populating all pgbench tables

From
Gurjeet Singh
Date:
On Fri, Jun 9, 2023 at 6:20 PM Gurjeet Singh <gurjeet@singh.im> wrote:
> On Fri, Jun 9, 2023 at 5:42 PM Gregory Smith <gregsmithpgsql@gmail.com> wrote:
> > On Fri, Jun 9, 2023 at 1:25 PM Gurjeet Singh <gurjeet@singh.im> wrote:
> >>
> >> >     $ pgbench -i -I dtGvp -s 500
> >>
> >> The steps are severely under-documented in pgbench --help output.
> >
> > I agree it's not easy to find information.  I just went through double checking I had the order recently enough to
rememberwhat I did.  The man pages have this: 
> >
> > > Each step is invoked in the specified order. The default is dtgvp.
> >
> > Which was what I wanted to read.  Meanwhile the --help output says:
> >
> > >  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")
> >
> > %T$%%Which has the information without a lot of context for what it's used for.  I'd welcome some text expansion
thatadded a minimal but functional improvement to that the existing help output; I don't have such text. 
>
> Please see attached 2 variants of the patch. Variant 1 simply tells
> the reader to consult pgbench documentation. The second variant
> provides a description for each of the letters, as the documentation
> does. The committer can pick  the one they find suitable.

(I was short on time, so had to keep it short in the above email. To
justify this additional email, I have added 2 more options to choose
from. :)

The text ", in the specified order" is an important detail, that
should be included irrespective of the rest of the patch.

My preference would be to use the first variant, since the second one
feels too wordy for --help output. I believe we'll have to choose
between these two; the alternatives will not make anyone happy.

These two variants show the two extremes; bare minimum vs. everything
but the kitchen sink. So one may feel the need to find a middle ground
and provide a "sufficient, but not too much" variant. I have attempted
that in variants 3 and 4; see attached.

The third variant does away with the list of steps, and uses a
paragraph to describe the letters. And the fourth variant makes that
paragraph terse.

In the order of preference I'd choose variant 1, then 2. Variants 3
and 4 feel like a significant degradation from variant 2.

Attached samples.txt shows the snippets of --help output of each of
the variants/patches, for comparison.

Best regards,
Gurjeet
http://Gurje.et

Attachment

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
I think I am partial to number 2. Removing a context switch always leads
to more productivity.

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Thu Jun 8, 2023 at 11:38 AM CDT, Tristan Partin wrote:
> On Thu Jun 8, 2023 at 12:33 AM CDT, David Rowley wrote:
> > On Thu, 8 Jun 2023 at 07:16, Tristan Partin <tristan@neon.tech> wrote:
> > >
> > > master:
> > >
> > > 50000000 of 50000000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 s))
> > > vacuuming...
> > > creating primary keys...
> > > done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side generate 1280.43 s, vacuum 2.55 s,
primarykeys 130.25 s). 
> > >
> > > patchset:
> > >
> > > 50000000 of 50000000 tuples (100%) of pgbench_accounts done (elapsed 243.82 s, remaining 0.00 s))
> > > vacuuming...
> > > creating primary keys...
> > > done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side generate 246.27 s, vacuum 2.77 s, primary
keys125.75 s). 
> >
> > I've also previously found pgbench -i to be slow.  It was a while ago,
> > and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
> > inside pgbench.
> >
> > On seeing your email, it makes me wonder if PG16's hex integer
> > literals might help here.  These should be much faster to generate in
> > pgbench and also parse on the postgres side.
> >
> > I wrote a quick and dirty patch to try that and I'm not really getting
> > the same performance increases as I'd have expected. I also tested
> > with your patch too and it does not look that impressive either when
> > running pgbench on the same machine as postgres.
>
> I didn't expect my patch to increase performance in all workloads. I was
> mainly aiming to fix high-latency connections. Based on your results
> that looks like a 4% reduction in performance of client-side data
> generation. I had thought maybe it is worth having a flag to keep the
> old way too, but I am not sure a 4% hit is really that big of a deal.
>
> > pgbench copy speedup
> >
> > ** master
> > drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> > 100000000 of 100000000 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s)
> > vacuuming...
> > creating primary keys...
> > done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
> > generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).
> >
> > ** David's Patched
> > drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> > 100000000 of 100000000 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s)
> > vacuuming...
> > creating primary keys...
> > done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
> > generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).
> >
> > ** Tristan's patch
> > drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> > 100000000 of 100000000 tuples (100%) of pgbench_accounts done (elapsed
> > 77.44 s, remaining 0.00 s)
> > vacuuming...
> > creating primary keys...
> > done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
> > generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).
> >
> > I'm interested to see what numbers you get.  You'd need to test on
> > PG16 however. I left the old code in place to generate the decimal
> > numbers for versions < 16.
>
> I will try to test this soon and follow up on the thread. I definitely
> see no problems with your patch as is though. I would be more than happy
> to rebase my patches on yours.

Finally got around to doing more benchmarking. Using an EC2 instance
hosted in Ireland, and my client laptop in Austin, Texas.

Workload: pgbench -i -s 500

master (9aee26a491)
done in 1369.41 s (drop tables 0.21 s, create tables 0.72 s, client-side generate 1336.44 s, vacuum 1.02 s, primary
keys31.03 s). 
done in 1318.31 s (drop tables 0.21 s, create tables 0.72 s, client-side generate 1282.67 s, vacuum 1.02 s, primary
keys33.69 s). 

copy
done in 307.42 s (drop tables 0.21 s, create tables 0.82 s, client-side generate 270.95 s, vacuum 1.02 s, primary keys
34.42s). 

david
done in 1311.14 s (drop tables 0.72 s, create tables 0.72 s, client-side generate 1274.98 s, vacuum 0.94 s, primary
keys33.79 s). 
done in 1340.18 s (drop tables 0.14 s, create tables 0.59 s, client-side generate 1304.78 s, vacuum 0.92 s, primary
keys33.75 s). 

copy + david
done in 348.70 s (drop tables 0.23 s, create tables 0.72 s, client-side generate 312.94 s, vacuum 0.92 s, primary keys
33.90s). 

I ran two tests for master and your patch David. For the last test, I
adapted your patch onto mine. I am still seeing the huge performance
gains on my branch.

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
Here is a v2. It cleans up the output when printing to a tty. The
last "x of y tuples" line gets overwritten now, so the final output
looks like:

dropping old tables...
creating tables...
generating data (client-side)...
vacuuming...
creating primary keys...
done in 0.14 s (drop tables 0.01 s, create tables 0.01 s, client-side generate 0.05 s, vacuum 0.03 s, primary keys 0.03
s).

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
Again, I forget to actually attach. Holy guacamole.

--
Tristan Partin
Neon (https://neon.tech)

Attachment

Re: Use COPY for populating all pgbench tables

From
Michael Paquier
Date:
On Wed, Jun 14, 2023 at 10:58:06AM -0500, Tristan Partin wrote:
> Again, I forget to actually attach. Holy guacamole.

Looks rather OK seen from here, applied 0001 while browsing the
series.  I have a few comments about 0002.

 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+       /* "filler" column defaults to NULL */
+       printfPQExpBuffer(sql,
+                                         INT64_FORMAT "\t0\t\n",
+                                         curr + 1);
+}
+
+static void
+initTeller(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+       /* "filler" column defaults to NULL */
+       printfPQExpBuffer(sql,
+                                         INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+                                         curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+       /* "filler" column defaults to blank padded empty string */
+       printfPQExpBuffer(sql,
+                                         INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+                                         curr + 1, curr / naccounts + 1);
+}

I was looking at that, and while this strikes me as a duplication for
the second and third ones, I'm OK with the use of a callback to fill
in the data, even if naccounts and ntellers are equivalent to the
"base" number given to initPopulateTable(), because the "filler"
column has different expectations for each table.  These routines
don't need a PGconn argument, by the way.

    /* use COPY with FREEZE on v14 and later without partitioning */
    if (partitions == 0 && PQserverVersion(con) >= 140000)
-       copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
+       copy_statement_fmt = "copy %s from stdin with (freeze on)";
    else
-       copy_statement = "copy pgbench_accounts from stdin";
+       copy_statement_fmt = "copy %s from stdin";

This seems a bit incorrect because partitioning only applies to
pgbench_accounts, no?  This change means that the teller and branch
tables would not benefit from FREEZE under a pgbench compiled with
this patch and a backend version of 14 or newer if --partitions is
used.  So, perhaps "partitions" should be an argument of
initPopulateTable() specified for each table?
--
Michael

Attachment

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Tue Jul 11, 2023 at 12:03 AM CDT, Michael Paquier wrote:
> On Wed, Jun 14, 2023 at 10:58:06AM -0500, Tristan Partin wrote:
>  static void
> -initGenerateDataClientSide(PGconn *con)
> +initBranch(PGconn *con, PQExpBufferData *sql, int64 curr)
> +{
> +       /* "filler" column defaults to NULL */
> +       printfPQExpBuffer(sql,
> +                                         INT64_FORMAT "\t0\t\n",
> +                                         curr + 1);
> +}
> +
> +static void
> +initTeller(PGconn *con, PQExpBufferData *sql, int64 curr)
> +{
> +       /* "filler" column defaults to NULL */
> +       printfPQExpBuffer(sql,
> +                                         INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
> +                                         curr + 1, curr / ntellers + 1);
> +}
> +
> +static void
> +initAccount(PGconn *con, PQExpBufferData *sql, int64 curr)
> +{
> +       /* "filler" column defaults to blank padded empty string */
> +       printfPQExpBuffer(sql,
> +                                         INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
> +                                         curr + 1, curr / naccounts + 1);
> +}
>
> These routines don't need a PGconn argument, by the way.

Nice catch!

>     /* use COPY with FREEZE on v14 and later without partitioning */
>     if (partitions == 0 && PQserverVersion(con) >= 140000)
> -       copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
> +       copy_statement_fmt = "copy %s from stdin with (freeze on)";
>     else
> -       copy_statement = "copy pgbench_accounts from stdin";
> +       copy_statement_fmt = "copy %s from stdin";
>
> This seems a bit incorrect because partitioning only applies to
> pgbench_accounts, no?  This change means that the teller and branch
> tables would not benefit from FREEZE under a pgbench compiled with
> this patch and a backend version of 14 or newer if --partitions is
> used.  So, perhaps "partitions" should be an argument of
> initPopulateTable() specified for each table?

Whoops, looks like I didn't read the comment for what the partitions
variable means. It only applies to pgbench_accounts as you said. I don't
think passing it in as an argument would make much sense. Let me know
what you think about the change in this new version, which only hits the
first portion of the `if` statement if the table is pgbench_accounts.

--
Tristan Partin
Neon (https://neon.tech)

Attachment

Re: Use COPY for populating all pgbench tables

From
Michael Paquier
Date:
On Tue, Jul 11, 2023 at 09:46:43AM -0500, Tristan Partin wrote:
> On Tue Jul 11, 2023 at 12:03 AM CDT, Michael Paquier wrote:
>> This seems a bit incorrect because partitioning only applies to
>> pgbench_accounts, no?  This change means that the teller and branch
>> tables would not benefit from FREEZE under a pgbench compiled with
>> this patch and a backend version of 14 or newer if --partitions is
>> used.  So, perhaps "partitions" should be an argument of
>> initPopulateTable() specified for each table?
>
> Whoops, looks like I didn't read the comment for what the partitions
> variable means. It only applies to pgbench_accounts as you said. I don't
> think passing it in as an argument would make much sense. Let me know
> what you think about the change in this new version, which only hits the
> first portion of the `if` statement if the table is pgbench_accounts.

-   /* use COPY with FREEZE on v14 and later without partitioning */
-   if (partitions == 0 && PQserverVersion(con) >= 140000)
-       copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
+   if (partitions == 0 && strcmp(table, "pgbench_accounts") == 0 && PQserverVersion(con) >= 140000)
+       copy_statement_fmt = "copy %s from stdin with (freeze on)";

This would use the freeze option only on pgbench_accounts when no
partitioning is defined, but my point was a bit different.  We could
use the FREEZE option on the teller and branch tables as well, no?
Okay, the impact is limited compared to accounts in terms of amount of
data loaded, but perhaps some people like playing with large scaling
factors where this could show a benefit in the initial data loading.

In passing, I have noticed the following sentence in pgbench.sgml:
       Using <literal>g</literal> causes logging to print one message
       every 100,000 rows while generating data for the
       <structname>pgbench_accounts</structname> table.
It would become incorrect as the same code path would be used for the
teller and branch tables.
--
Michael

Attachment

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Wed Jul 12, 2023 at 1:06 AM CDT, Michael Paquier wrote:
> On Tue, Jul 11, 2023 at 09:46:43AM -0500, Tristan Partin wrote:
> > On Tue Jul 11, 2023 at 12:03 AM CDT, Michael Paquier wrote:
> >> This seems a bit incorrect because partitioning only applies to
> >> pgbench_accounts, no?  This change means that the teller and branch
> >> tables would not benefit from FREEZE under a pgbench compiled with
> >> this patch and a backend version of 14 or newer if --partitions is
> >> used.  So, perhaps "partitions" should be an argument of
> >> initPopulateTable() specified for each table?
> >
> > Whoops, looks like I didn't read the comment for what the partitions
> > variable means. It only applies to pgbench_accounts as you said. I don't
> > think passing it in as an argument would make much sense. Let me know
> > what you think about the change in this new version, which only hits the
> > first portion of the `if` statement if the table is pgbench_accounts.
>
> -   /* use COPY with FREEZE on v14 and later without partitioning */
> -   if (partitions == 0 && PQserverVersion(con) >= 140000)
> -       copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
> +   if (partitions == 0 && strcmp(table, "pgbench_accounts") == 0 && PQserverVersion(con) >= 140000)
> +       copy_statement_fmt = "copy %s from stdin with (freeze on)";
>
> This would use the freeze option only on pgbench_accounts when no
> partitioning is defined, but my point was a bit different.  We could
> use the FREEZE option on the teller and branch tables as well, no?
> Okay, the impact is limited compared to accounts in terms of amount of
> data loaded, but perhaps some people like playing with large scaling
> factors where this could show a benefit in the initial data loading.

Perhaps, should they all be keyed off the same option? Seemed like in
your previous comment you wanted multiple options. Sorry for not reading
your comment correctly.

> In passing, I have noticed the following sentence in pgbench.sgml:
>        Using <literal>g</literal> causes logging to print one message
>        every 100,000 rows while generating data for the
>        <structname>pgbench_accounts</structname> table.
> It would become incorrect as the same code path would be used for the
> teller and branch tables.

I will amend the documentation to mention all tables rather than being
specific to pgbench_accounts in the next patch revision pending what you
want to do about the partition CLI argument.

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
Michael Paquier
Date:
On Wed, Jul 12, 2023 at 09:29:35AM -0500, Tristan Partin wrote:
> On Wed Jul 12, 2023 at 1:06 AM CDT, Michael Paquier wrote:
>> This would use the freeze option only on pgbench_accounts when no
>> partitioning is defined, but my point was a bit different.  We could
>> use the FREEZE option on the teller and branch tables as well, no?
>> Okay, the impact is limited compared to accounts in terms of amount of
>> data loaded, but perhaps some people like playing with large scaling
>> factors where this could show a benefit in the initial data loading.
>
> Perhaps, should they all be keyed off the same option? Seemed like in
> your previous comment you wanted multiple options. Sorry for not reading
> your comment correctly.

I would have though that --partition should only apply to the
pgbench_accounts table, while FREEZE should apply where it is possible
to use it, aka all the COPY queries except when pgbench_accounts is a
partition.  Would you do something different, like not applying FREEZE
to pgbench_tellers and pgbench_branches as these have much less tuples
than pgbench_accounts?
--
Michael

Attachment

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Wed Jul 12, 2023 at 10:52 PM CDT, Michael Paquier wrote:
> On Wed, Jul 12, 2023 at 09:29:35AM -0500, Tristan Partin wrote:
> > On Wed Jul 12, 2023 at 1:06 AM CDT, Michael Paquier wrote:
> >> This would use the freeze option only on pgbench_accounts when no
> >> partitioning is defined, but my point was a bit different.  We could
> >> use the FREEZE option on the teller and branch tables as well, no?
> >> Okay, the impact is limited compared to accounts in terms of amount of
> >> data loaded, but perhaps some people like playing with large scaling
> >> factors where this could show a benefit in the initial data loading.
> >
> > Perhaps, should they all be keyed off the same option? Seemed like in
> > your previous comment you wanted multiple options. Sorry for not reading
> > your comment correctly.
>
> I would have though that --partition should only apply to the
> pgbench_accounts table, while FREEZE should apply where it is possible
> to use it, aka all the COPY queries except when pgbench_accounts is a
> partition.  Would you do something different, like not applying FREEZE
> to pgbench_tellers and pgbench_branches as these have much less tuples
> than pgbench_accounts?

Michael,

I think I completely misunderstood you. From what I can tell, you want
to use FREEZE wherever possible. I think the new patch covers your
comment and fixes the documentation. I am hoping that I have finally
gotten what you are looking for.

--
Tristan Partin
Neon (https://neon.tech)

Attachment

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
Didn't actually include the changes in the previous patch.

--
Tristan Partin
Neon (https://neon.tech)

Attachment

Re: Use COPY for populating all pgbench tables

From
Michael Paquier
Date:
On Wed, Jul 19, 2023 at 01:03:21PM -0500, Tristan Partin wrote:
> Didn't actually include the changes in the previous patch.

-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
 {
-   PQExpBufferData sql;
+   /* "filler" column defaults to NULL */
+   printfPQExpBuffer(sql,
+                     INT64_FORMAT "\t0\t\n",
+                     curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+   /* "filler" column defaults to NULL */
+   printfPQExpBuffer(sql,
+                     INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+                     curr + 1, curr / ntellers + 1);

Hmm.  Something's not right here, see:
=# select count(*) has_nulls from pgbench_accounts where filler is null;
 has_nulls
-----------
         0
(1 row)
=# select count(*) > 0 has_nulls from pgbench_tellers where filler is null;
 has_nulls
-----------
 f
(1 row)
=# select count(*) > 0 has_nulls from pgbench_branches where filler is null;
 has_nulls
-----------
 f
(1 row)

Note as well this comment in initCreateTables():
    /*
     * Note: TPC-B requires at least 100 bytes per row, and the "filler"
     * fields in these table declarations were intended to comply with that.
     * The pgbench_accounts table complies with that because the "filler"
     * column is set to blank-padded empty string. But for all other tables
     * the columns default to NULL and so don't actually take any space.  We
     * could fix that by giving them non-null default values.  However, that
     * would completely break comparability of pgbench results with prior
     * versions. Since pgbench has never pretended to be fully TPC-B compliant
     * anyway, we stick with the historical behavior.
     */

So this patch causes pgbench to not stick with its historical
behavior, and the change is incompatible with the comments because the
tellers and branches tables don't use NULL for their filler attribute
anymore.
--
Michael

Attachment

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Wed Jul 19, 2023 at 10:07 PM CDT, Michael Paquier wrote:
> So this patch causes pgbench to not stick with its historical
> behavior, and the change is incompatible with the comments because the
> tellers and branches tables don't use NULL for their filler attribute
> anymore.

Great find. This was a problem of me just not understanding the COPY
command properly. Relevant documentation snippet:

> Specifies the string that represents a null value. The default is \N
> (backslash-N) in text format, and an unquoted empty string in CSV
> format. You might prefer an empty string even in text format for cases
> where you don't want to distinguish nulls from empty strings. This
> option is not allowed when using binary format.

This new revision populates the column with the NULL value.

> psql (17devel)
> Type "help" for help.
>
> tristan957=# select count(1) from pgbench_branches;
>  count
> -------
>      1
> (1 row)
>
> tristan957=# select count(1) from pgbench_branches where filler is null;
>  count
> -------
>      1
> (1 row)

Thanks for your testing Michael. I went ahead and added a test to make
sure that this behavior doesn't regress accidentally, but I am
struggling to get the test to fail using the previous version of this
patch. Do you have any advice? This is my first time writing a test for
Postgres. I can recreate the issue outside of the test script, but not
within it for whatever reason.

--
Tristan Partin
Neon (https://neon.tech)

Attachment

Re: Use COPY for populating all pgbench tables

From
Michael Paquier
Date:
On Thu, Jul 20, 2023 at 02:22:51PM -0500, Tristan Partin wrote:
> Thanks for your testing Michael. I went ahead and added a test to make sure
> that this behavior doesn't regress accidentally, but I am struggling to get
> the test to fail using the previous version of this patch. Do you have any
> advice? This is my first time writing a test for Postgres. I can recreate
> the issue outside of the test script, but not within it for whatever reason.

We're all here to learn, and writing TAP tests is important these days
for a lot of patches.

+# Check that the pgbench_branches and pgbench_tellers filler columns are filled
+# with NULLs
+foreach my $table ('pgbench_branches', 'pgbench_tellers') {
+   my ($ret, $out, $err) = $node->psql(
+       'postgres',
+       "SELECT COUNT(1) FROM $table;
+        SELECT COUNT(1) FROM $table WHERE filler is NULL;",
+       extra_params => ['--no-align', '--tuples-only']);
+
+   is($ret, 0, "psql $table counts status is 0");
+   is($err, '', "psql $table counts stderr is empty");
+   if ($out =~ m/^(\d+)\n(\d+)$/g) {
+       is($1, $2, "psql $table filler column filled with NULLs");
+   } else {
+       fail("psql $table stdout m/^(\\d+)\n(\\d+)$/g");
+   }
+}

This is IMO hard to parse, and I'd rather add some checks for the
accounts and history tables as well.  Let's use four simple SQL
queries like what I posted upthread (no data for history inserted
after initialization), as of the attached.  I'd be tempted to apply
that first as a separate patch, because we've never add coverage for
that and we have specific expectations in the code from this filler
column for tpc-b.  And this needs to cover both client-side and
server-side data generation.

Note that the indentation was a bit incorrect, so fixed while on it.

Attached is a v7, with these tests (should be a patch on its own but
I'm lazy to split this morning) and some more adjustments that I have
done while going through the patch.  What do you think?
--
Michael

Attachment

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
On Thu Jul 20, 2023 at 9:14 PM CDT, Michael Paquier wrote:
> Attached is a v7, with these tests (should be a patch on its own but
> I'm lazy to split this morning) and some more adjustments that I have
> done while going through the patch.  What do you think?

v7 looks good from my perspective. Thanks for working through this patch
with me. Much appreciated.

--
Tristan Partin
Neon (https://neon.tech)



Re: Use COPY for populating all pgbench tables

From
Michael Paquier
Date:
On Fri, Jul 21, 2023 at 12:22:06PM -0500, Tristan Partin wrote:
> v7 looks good from my perspective. Thanks for working through this patch
> with me. Much appreciated.

Cool.  I have applied the new tests for now to move on with this
thread.
--
Michael

Attachment

Re: Use COPY for populating all pgbench tables

From
Michael Paquier
Date:
On Sun, Jul 23, 2023 at 08:21:51PM +0900, Michael Paquier wrote:
> Cool.  I have applied the new tests for now to move on with this
> thread.

I have done a few more things on this patch today, including
measurements with a local host and large scaling numbers.  One of my
hosts was taking for instance around 36.8s with scale=500 using the
INSERTs and 36.5s with the COPY when loading data (average of 4
runs) with -I dtg.

Greg's upthread point is true as well that for high latency the
server-side generation is the most adapted option, but I don't see a
reason to not switch to a COPY as this option is hidden behind -I,
just for the sake of improving the default option set.  So, at the
end, applied.
--
Michael

Attachment

Re: Use COPY for populating all pgbench tables

From
"Tristan Partin"
Date:
Michael,

Once again I appreciate your patience with this patchset. Thanks for
your help and reviews.

--
Tristan Partin
Neon (https://neon.tech)