Re: Use COPY for populating all pgbench tables - Mailing list pgsql-hackers

From Tristan Partin
Subject Re: Use COPY for populating all pgbench tables
Date
Msg-id CT87G1JPVKU7.KGLR2XD5Z5OU@gonk
Whole thread Raw
In response to Re: Use COPY for populating all pgbench tables  (Gregory Smith <gregsmithpgsql@gmail.com>)
List pgsql-hackers
On Fri Jun 9, 2023 at 8:24 AM CDT, Gregory Smith wrote:
> On Tue, May 23, 2023 at 1:33 PM Tristan Partin <tristan@neon.tech> wrote:
>
> > We (Neon) have noticed that pgbench can be quite slow to populate data
> > in regard to higher latency connections. Higher scale factors exacerbate
> > this problem. Some employees work on a different continent than the
> > databases they might be benchmarking. By moving pgbench to use COPY for
> > populating all tables, we can reduce some of the time pgbench takes for
> > this particular step.
> >
>
> When latency is continent size high, pgbench should be run with server-side
> table generation instead of using COPY at all, for any table.  The default
> COPY based pgbench generation is only intended for testing where the client
> and server are very close on the network.
>
> Unfortunately there's no simple command line option to change just that one
> thing about how pgbench runs.  You have to construct a command line that
> documents each and every step you want instead.  You probably just want
> this form:
>
>     $ pgbench -i -I dtGvp -s 500
>
> That's server-side table generation with all the usual steps.  I use this
> instead of COPY in pgbench-tools so much now, basically whenever I'm
> talking to a cloud system, that I have a simple 0/1 config option to switch
> between the modes, and this long weird one is the default now.
>
> Try that out, and once you see the numbers my bet is you'll see extending
> which tables get COPY isn't needed by your use case anymore.  Basically, if
> you are close enough to use COPY instead of server-side generation, you are
> close enough that every table besides accounts will not add up to enough
> time to worry about optimizing the little ones.

Thanks for your input Greg. I'm sure you're correct that server-side data
generation would probably fix the problem. I guess I am trying to
understand if there are any downsides to just committing this anyway. I
haven't done any more testing, but David's email only showed a 4%
performance drop in the workload he tested. Combine this with his hex
patch and we would see an overall performance improvement when
everything is said and done.

It seems like this patch would still be good for client-side high scale
factor data generation (when server and client are close), but I would
need to do testing to confirm.

--
Tristan Partin
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: Views no longer in rangeTabls?
Next
From: "Tristan Partin"
Date:
Subject: Re: Use COPY for populating all pgbench tables