On Wed, 6 Feb 2008, Dimitri Fontaine wrote:
> In fact, the -F option works by having pgloader read the given number of lines
> but skip processing them, which is not at all what Greg is talking about here
> I think.
Yeah, that's not useful.
> Greg, what would you think of a pgloader which will separate file reading
> based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of
> threads: we split into as many pieces as section_threads section config
> value.
Now you're talking. Find a couple of split points that way, fine-tune the
boundaries a bit so they rest on line termination points, and off you go.
Don't forget that the basic principle here implies you'll never know until
you're done just how many lines were really in the file. When thread#1 is
running against chunk#1, it will never have any idea what line chunk#2
really started at until it reaches there, at which point it's done and
that information isn't helpful anymore.
You have to stop thinking in terms of lines for this splitting; all you
can do is split the file into useful byte sections and then count the
lines within them as you go. Anything else requires a counting scan of
the file and such a sequential read is exactly what can't happen
(especially not more than once), it just takes too long.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD