Re: Benchmark Data requested --- pgloader CE design ideas - Mailing list pgsql-performance

From Dimitri Fontaine
Subject Re: Benchmark Data requested --- pgloader CE design ideas
Date
Msg-id 200802061336.53363.dfontaine@hi-media.com
Whole thread Raw
In response to Re: Benchmark Data requested --- pgloader CE design ideas  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-performance
Le mercredi 06 février 2008, Simon Riggs a écrit :
> For me, it would be good to see a --parallel=n parameter that would
> allow pg_loader to distribute rows in "round-robin" manner to "n"
> different concurrent COPY statements. i.e. a non-routing version.

What happen when you want at most N parallel Threads and have several sections
configured: do you want pgloader to serialize sections loading (often there's
one section per table, sometimes different sections target the same table)
but parallelise each section loading?

I'm thinking we should have a global max_threads knob *and* and per-section
max_thread one if we want to go this way, but then multi-threaded sections
will somewhat fight against other sections (multi-threaded or not) for
threads to use.

So I'll also add a parameter to configure how many (max) sections to load in
parallel at any time.

We'll then have (default values presented):
max_threads = 1
max_parallel_sections = 1
section_threads = -1

The section_threads parameter would be overloadable at section level but would
need to stay <= max_threads (if not, discarded, warning issued). When
section_threads is -1, pgloader tries to have the higher number of them
possible, still in the max_threads global limit.
If max_parallel_section is -1, pgloader start a new thread per each new
section, maxing out at max_threads, then it waits for a thread to finish
before launching a new section loading.

If you have N max_threads and max_parallel_sections = section_threads = -1,
then we'll see some kind of a fight between new section threads and in
section thread (the parallel non-routing COPY behaviour). But then it's a
user choice.

Adding in it the Constraint_Exclusion support would not mess it up, but it'll
have some interest only when section_threads != 1 and max_threads > 1.

> Making
> that work well, whilst continuing to do error-handling seems like a
> challenge, but a very useful goal.

Quick tests showed me python threading model allows for easily sharing of
objects between several threads, I don't think I'll need to adjust my reject
code when going per-section multi-threaded. Just have to use a semaphore
object to continue rejected one line at a time. Not that complex if reliable.

> Adding intelligence to the row distribution may be technically hard but
> may also simply move the bottleneck onto pg_loader. We may need multiple
> threads in pg_loader, or we may just need multiple sessions from
> pg_loader. Experience from doing the non-routing parallel version may
> help in deciding whether to go for the routing version.

If non-routing per-section multi-threading is a user request and not that hard
to implement (thanks to python), that sounds a good enough reason for me to
provide it :)

I'll keep you (and the list) informed as soon as I'll have the code to play
with.
--
dim

Attachment

pgsql-performance by date:

Previous
From: Theo Kramer
Date:
Subject: Re: Optimizer : query rewrite and execution plan ?
Next
From: "Roberts, Jon"
Date:
Subject: Re: Optimizer : query rewrite and execution plan ?