Re: Benchmark Data requested --- pgloader CE design ideas - Mailing list pgsql-performance

From Jignesh K. Shah
Subject Re: Benchmark Data requested --- pgloader CE design ideas
Date
Msg-id 47A9E186.2050009@sun.com
Whole thread Raw
In response to Re: Benchmark Data requested --- pgloader CE design ideas  (Greg Smith <gsmith@gregsmith.com>)
List pgsql-performance

Greg Smith wrote:
> On Wed, 6 Feb 2008, Simon Riggs wrote:
>
>> For me, it would be good to see a --parallel=n parameter that would
>> allow pg_loader to distribute rows in "round-robin" manner to "n"
>> different concurrent COPY statements. i.e. a non-routing version.
>
> Let me expand on this.  In many of these giant COPY situations the
> bottleneck is plain old sequential I/O to a single process.  You can
> almost predict how fast the rows will load using dd.  Having a process
> that pulls rows in and distributes them round-robin is good, but it
> won't crack that bottleneck.  The useful approaches I've seen for
> other databases all presume that the data files involved are large
> enough that on big hardware, you can start multiple processes running
> at different points in the file and beat anything possible with a
> single reader.
>
> If I'm loading a TB file, odds are good I can split that into 4 or
> more vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start
> 4 loaders at once, and get way more than 1 disk worth of throughput
> reading.  You have to play with the exact number because if you push
> the split too far you introduce seek slowdown instead of improvements,
> but that's the basic design I'd like to see one day.  It's not
> parallel loading that's useful for the cases I'm thinking about until
> something like this comes around.
>

Some food for thought here: Most BI Type applications which does data
conversions/cleansing also might end up sorting the data before its
loaded into a database so starting parallel loaders at Total different
points ruins that effort. A More pragmatic approach will be to read the
next rows from the input file So if there are N parallel streams then
each one is offset by 1 from each other and jumps by N rows so the seeks
are pretty much narrrowed down to few rows (ideally 1) instead of
jumping 1/Nth rows every time  a read happens.

For example to replicate this with dd to see the impact use a big file
and use the seek option and blocksizes .. Somebody out here once had
done that test and showed that "seek time" on the file being read is
reduced significantly and depending on the file system it does
intelligent prefetching (which unfortunately UFS in Solaris does not do
best by default)  all the reads for the next stream will already be in
memory.



Regards,
Jignesh


pgsql-performance by date:

Previous
From: Luke Lonergan
Date:
Subject: Re: Benchmark Data requested --- pgloader CE design ideas
Next
From: Dimitri Fontaine
Date:
Subject: Re: Benchmark Data requested