Re: Benchmark Data requested --- pgloader CE design ideas - Mailing list pgsql-performance

From Greg Smith
Subject Re: Benchmark Data requested --- pgloader CE design ideas
Date
Msg-id Pine.GSO.4.64.0802061801540.11463@westnet.com
Whole thread Raw
In response to Re: Benchmark Data requested --- pgloader CE design ideas  (Dimitri Fontaine <dfontaine@hi-media.com>)
Responses Re: Benchmark Data requested --- pgloader CE design ideas  (Dimitri Fontaine <dfontaine@hi-media.com>)
List pgsql-performance
On Wed, 6 Feb 2008, Dimitri Fontaine wrote:

> In fact, the -F option works by having pgloader read the given number of lines
> but skip processing them, which is not at all what Greg is talking about here
> I think.

Yeah, that's not useful.

> Greg, what would you think of a pgloader which will separate file reading
> based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of
> threads: we split into as many pieces as section_threads section config
> value.

Now you're talking.  Find a couple of split points that way, fine-tune the
boundaries a bit so they rest on line termination points, and off you go.
Don't forget that the basic principle here implies you'll never know until
you're done just how many lines were really in the file.  When thread#1 is
running against chunk#1, it will never have any idea what line chunk#2
really started at until it reaches there, at which point it's done and
that information isn't helpful anymore.

You have to stop thinking in terms of lines for this splitting; all you
can do is split the file into useful byte sections and then count the
lines within them as you go.  Anything else requires a counting scan of
the file and such a sequential read is exactly what can't happen
(especially not more than once), it just takes too long.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-performance by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: Benchmark Data requested --- pgloader CE design ideas
Next
From: Dimitri Fontaine
Date:
Subject: Re: Benchmark Data requested --- pgloader CE design ideas