Home > mailing lists

Re: Benchmark Data requested --- pgloader CE design ideas - Mailing list pgsql-performance

From	Greg Smith
Subject	Re: Benchmark Data requested --- pgloader CE design ideas
Date	February 6, 2008 19:36:55
Msg-id	Pine.GSO.4.64.0802061801540.11463@westnet.com Whole thread Raw
In response to	Re: Benchmark Data requested --- pgloader CE design ideas (Dimitri Fontaine <dfontaine@hi-media.com>)
Responses	Re: Benchmark Data requested --- pgloader CE design ideas
List	pgsql-performance

Tree view

On Wed, 6 Feb 2008, Dimitri Fontaine wrote:

> In fact, the -F option works by having pgloader read the given number of lines
> but skip processing them, which is not at all what Greg is talking about here
> I think.

Yeah, that's not useful.

> Greg, what would you think of a pgloader which will separate file reading
> based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of
> threads: we split into as many pieces as section_threads section config
> value.

Now you're talking.  Find a couple of split points that way, fine-tune the
boundaries a bit so they rest on line termination points, and off you go.
Don't forget that the basic principle here implies you'll never know until
you're done just how many lines were really in the file.  When thread#1 is
running against chunk#1, it will never have any idea what line chunk#2
really started at until it reaches there, at which point it's done and
that information isn't helpful anymore.

You have to stop thinking in terms of lines for this splitting; all you
can do is split the file into useful byte sections and then count the
lines within them as you go.  Anything else requires a counting scan of
the file and such a sequential read is exactly what can't happen
(especially not more than once), it just takes too long.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-performance by date:

From: Dimitri Fontaine
Date: 06 February 2008, 16:04:28
Subject: Re: Benchmark Data requested --- pgloader CE design ideas

From: Dimitri Fontaine
Date: 07 February 2008, 06:07:00
Subject: Re: Benchmark Data requested --- pgloader CE design ideas

Re: Benchmark Data requested --- pgloader CE design ideas - Mailing list pgsql-performance

Previous

Next