Re: parallel pg_restore - WIP patch - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: parallel pg_restore - WIP patch
Date
Msg-id 48DCD590.4000207@dunslane.net
Whole thread Raw
In response to parallel pg_restore - WIP patch  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: parallel pg_restore - WIP patch  (Russell Smith <mr-russ@pws.com.au>)
List pgsql-hackers

Russell Smith wrote:
> Hi,
>
> As I'm interested in this topic, I thought I'd take a look at the
> patch.  I have no capability to test it on high end hardware but did
> some basic testing on my workstation and basic review of the patch.
>
> I somehow had the impression that instead of creating a new connection
> for each restore item we would create the processes at the start and
> then send them the dumpId's they should be restoring.  That would allow
> the controller to batch dumpId's together and expect the worker to
> process them in a transaction.  But this is probably just an idea I
> created in my head.
>   

Yes it is. To do that I would have to invent a protocol for talking to 
the workers, etc, and there is not the slightest chance I would get that 
done by November.
And I don't see the virtue in processing them all in a transaction. I've 
provided a much simpler means of avoiding WAL logging of the COPY.

> Do we know why we experience "tuple concurrently updated" errors if we
> spawn thread too fast?
>   

No. That's an open item.
> I completed some test restores using the pg_restore from head with the
> patch applied.  The dump was a custom dump created with pg 8.2 and
> restored to an 8.2 database.  To confirm this would work, I completed a
> restore using the standard single threaded mode.   The schema restore
> successfully.  The only errors reported involved non-existent roles.
>
> When I attempt to restore using parallel restore I get out of memory
> errors reported from _PrintData.   The code returning the error is;
>
> _PrintData(...
>     while (blkLen != 0)
>     {
>         if (blkLen + 1 > ctx->inSize)
>         {
>             free(ctx->zlibIn);
>             ctx->zlibIn = NULL;
>             ctx->zlibIn = (char *) malloc(blkLen + 1);
>             if (!ctx->zlibIn)
>                 die_horribly(AH, modulename, " out of memory\n");
>
>             ctx->inSize = blkLen + 1;
>             in = ctx->zlibIn;
>         }
>
>
> It appears from my debugging and looking at the code that in _PrintData;
>     lclContext *ctx = (lclContext *) AH->formatData;
>
> the memory context is shared across all threads.  Which means that it's
> possible the memory contexts are stomping on each other.  My GDB skills
> are now up to being able to reproduce this in a gdb session as there are
> forks going on all over the place.  And if you process them in a serial
> fashion, there aren't any errors.  I'm not sure of the fix for this. 
> But in a parallel environment it doesn't seem possible to store the
> memory context in the AH.
>   


There are no threads, hence nothing is shared. fork() create s new 
process, not a new thread, and all they share are file descriptors.


> I also receive messages saying "pg_restore: [custom archiver] could not
> read from input file: end of file".  I have not investigated these
> further as my current guess is they are linked to the out of memory error.
>
> Given I ran into this error at my first testing attempt  I haven't
> evaluated much else at this point in time.  Now all this could be
> because I'm using the 8.2 archive, but it works fine in single restore
> mode.  The dump file is about 400M compressed and an entire archive
> schema was removed from the restore path with a custom restore list.
>
> Command line used;  PGPORT=5432 ./pg_restore -h /var/run/postgresql -m4
> --truncate-before-load -v -d tt2 -L tt.list
> /home/mr-russ/pg-index-test/timetable.pgdump 2> log.txt
>
> I've attached the log.txt file so you can review the errors that I saw. 
> I have adjusted the "out of memory" error to include a number to work
> out which one was being triggered.  So you'll see "5 out of memory" in
> the log file, which corresponds to the code above.
>   

However, there does seem to be something odd happening with the 
compression lib, which I will investigate. Thanks for the report.

cheers

andrew



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: [PATCHES] Infrastructure changes for recovery
Next
From: "D'Arcy J.M. Cain"
Date:
Subject: Re: Proposal: new border setting in psql