Thread: patch for parallel pg_dump

patch for parallel pg_dump

From
Joachim Wieland
Date:
So this is the parallel pg_dump patch, generalizing the existing
parallel restore and allowing parallel dumps for the directory archive
format, the patch works on Windows and Unix.

In the first phase of a parallel pg_dump/pg_restore, it does catalog
backup/restore in a single process, then forks off worker processes
which are connected to the master process by pipes (on Windows, the
pg_pipe implementation is used). These pipes are only used for a few
commands and status messages. The processes then work on the items
that they get assigned to by the master, in other words the worker
processes do not terminate after each item but stay there until the
end of the parallel part of the dump/restore. Once they finish their
current item and send the status back to the master, they are assigned
the next item and so forth...

In parallel restore, the master closes its own connection to the
database before forking of worker processes, just as it does now. In
parallel dump however, we need to hold the masters connection open so
that we can detect deadlocks. The issue is that somebody could have
requested an exclusive lock after the master has initially requested a
shared lock on all tables. Therefore, the worker process also requests
a shared lock on the table with NOWAIT and if this fails, we know that
there is a conflicting lock in between and that we need to abort the
dump.

Parallel pg_dump sorts the tables and indexes by their sizes so that
it can start with the largest items first.

The connections of the parallel dump use the synchronized snapshot
feature. However there's also an option --no-synchronized-snapshots
which can be used to dump from an older PostgreSQL version.

I'm also attaching another use-case for the parallel backup as a
separate patch, which is a new archive format that I named
"pg_backup_mirror", it's basically the parallel version of "pg_dump |
psql", so it does a parallel dump and restore of a database from one
host to another. The patch for this is fairly small, but it's still a
bit rough and needs some more work and discussion. Depending on how
quickly (or not) we get done with the review of the main patch, we can
then include this one as well or postpone it.


Joachim

Attachment

Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Sun, Jan 15, 2012 at 1:01 PM, Joachim Wieland <joe@mcknight.de> wrote:
> So this is the parallel pg_dump patch, generalizing the existing
> parallel restore and allowing parallel dumps for the directory archive
> format, the patch works on Windows and Unix.

This patch introduces a large amount of notational churn, as perhaps
well-illustrated by this hunk:
 static int
! dumpBlobs(Archive *AHX, void *arg) {
+       /*
+        * This is a data dumper routine, executed in a child for
parallel backup,
+        * so it must not access the global g_conn but AH->connection instead.
+        */
+       ArchiveHandle *AH = (ArchiveHandle *) AHX;

It seems pretty grotty to have a static function that gets an argument
of the wrong type, and then just immediately turns around and casts it
to something else.  It's not exactly obvious that that's even safe,
especially if you don't know that ArchiveHandle is a struct whose
first element is an Archive.  But even if you do know that subclassing
is intended, that doesn't prove that the particular Archive object is
always going to be an ArchiveHandle under the hood.  If it is, why not
just pass it as an ArchiveHandle to begin with?  I think this needs to
be revised in some way.  At least in the few cases I checked, the only
point of getting at the ArchiveHandle this way was to find
AH->connection, which suggests to me either that AH->connection should
be in the "public" section, inside Archive rather than ArchiveHandle,
or else that we ought to just pass the connection object to this
function (and all of its friends who have similar issues) as a
separate argument.  Either way, I think that would make this patch
both cleaner and less-invasive.  In fact we might want to pull out
just that change and commit it separately to simplify review of the
remaining work.

It's not clear to me why fmtQualifiedId needs to move to dumputils.c.
The way you have it, fmtQualifiedId() is now with fmtId(), but no
longer with fmtCopyColumnList(), the only other similarly named
function in that directory.  That may be more logical, or it may not,
but rearranging the code like this makes it a lot harder to review,
and I would prefer that we make such changes as separate commits if
we're going to do them, so that diff can do something sensible with
the changes that are integral to the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Fri, Jan 27, 2012 at 10:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jan 15, 2012 at 1:01 PM, Joachim Wieland <joe@mcknight.de> wrote:
>> So this is the parallel pg_dump patch, generalizing the existing
>> parallel restore and allowing parallel dumps for the directory archive
>> format, the patch works on Windows and Unix.
>
> This patch introduces a large amount of notational churn, as perhaps
> well-illustrated by this hunk:
>
>  static int
> ! dumpBlobs(Archive *AHX, void *arg)
>  {
> +       /*
> +        * This is a data dumper routine, executed in a child for
> parallel backup,
> +        * so it must not access the global g_conn but AH->connection instead.
> +        */
> +       ArchiveHandle *AH = (ArchiveHandle *) AHX;
>
> It seems pretty grotty to have a static function that gets an argument
> of the wrong type, and then just immediately turns around and casts it
> to something else.  It's not exactly obvious that that's even safe,
> especially if you don't know that ArchiveHandle is a struct whose
> first element is an Archive.  But even if you do know that subclassing
> is intended, that doesn't prove that the particular Archive object is
> always going to be an ArchiveHandle under the hood.  If it is, why not
> just pass it as an ArchiveHandle to begin with?  I think this needs to
> be revised in some way.  At least in the few cases I checked, the only
> point of getting at the ArchiveHandle this way was to find
> AH->connection, which suggests to me either that AH->connection should
> be in the "public" section, inside Archive rather than ArchiveHandle,
> or else that we ought to just pass the connection object to this
> function (and all of its friends who have similar issues) as a
> separate argument.  Either way, I think that would make this patch
> both cleaner and less-invasive.  In fact we might want to pull out
> just that change and commit it separately to simplify review of the
> remaining work.
>
> It's not clear to me why fmtQualifiedId needs to move to dumputils.c.
> The way you have it, fmtQualifiedId() is now with fmtId(), but no
> longer with fmtCopyColumnList(), the only other similarly named
> function in that directory.  That may be more logical, or it may not,
> but rearranging the code like this makes it a lot harder to review,
> and I would prefer that we make such changes as separate commits if
> we're going to do them, so that diff can do something sensible with
> the changes that are integral to the patch.

Woops, hit send a little too soon there.  I'll try to make some more
substantive comments after looking at this more.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Fri, Jan 27, 2012 at 10:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> It's not clear to me why fmtQualifiedId needs to move to dumputils.c.
>> The way you have it, fmtQualifiedId() is now with fmtId(), but no
>> longer with fmtCopyColumnList(), the only other similarly named
>> function in that directory.  That may be more logical, or it may not,
>> but rearranging the code like this makes it a lot harder to review,
>> and I would prefer that we make such changes as separate commits if
>> we're going to do them, so that diff can do something sensible with
>> the changes that are integral to the patch.
>
> Woops, hit send a little too soon there.  I'll try to make some more
> substantive comments after looking at this more.

And maybe retract some of the bogus ones, like the above: I see why
you moved this, now - parallel.c needs it.

Still looking...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Sun, Jan 15, 2012 at 1:01 PM, Joachim Wieland <joe@mcknight.de> wrote:
> So this is the parallel pg_dump patch, generalizing the existing
> parallel restore and allowing parallel dumps for the directory archive
> format, the patch works on Windows and Unix.

It seems a little unfortunate that we are using threads here on
Windows and processes on Linux.  Maybe it's too late to revisit that
decision, but it seems like a recipe for subtle bugs.

> In parallel restore, the master closes its own connection to the
> database before forking of worker processes, just as it does now. In
> parallel dump however, we need to hold the masters connection open so
> that we can detect deadlocks. The issue is that somebody could have
> requested an exclusive lock after the master has initially requested a
> shared lock on all tables. Therefore, the worker process also requests
> a shared lock on the table with NOWAIT and if this fails, we know that
> there is a conflicting lock in between and that we need to abort the
> dump.

I think this is an acceptable limitation, but the window where it can
happen seems awfully wide right now.  As things stand, it seems like
we don't try to lock the table in the child until we're about to
access it, which means that, on a large database, we could dump out
99% of the database and then be forced to abort the dump because of a
conflicting lock on the very last table.  We could fix that by having
every child lock every table right at the beginning, so that all
possible failures of this type would happen before we do any work, but
that will eat up a lot of lock table space.  It would be nice if the
children could somehow piggyback on the parent's locks, but I don't
see any obvious way to make that work.  Maybe we just have to live
with it the way it is, but I worry that people whose dumps fail 10
hours into a 12 hour parallel dump are going to be grumpy.

> The connections of the parallel dump use the synchronized snapshot
> feature. However there's also an option --no-synchronized-snapshots
> which can be used to dump from an older PostgreSQL version.

Right now, you have it set up so that --no-synchronized-snapshots is
ignored even if synchronized snapshots are supported, which doesn't
make much sense to me.  I think you should allow
--no-synchronized-snapshots with any release, and error out if it's
not specified and the version is too old work without it.  It's
probably not a good idea to run with --no-synchronized-snapshots ever,
and doubly so if they're not available, but I'd rather leave that
decision to the user.  (Imagine, for example, that we discovered a bug
in our synchronized snapshot implementation.)

I am tempted to advocate calling the flag --unsynchronized-snapshots,
because to me that underscores the danger a little more clearly, but
perhaps that is too clever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Heikki Linnakangas
Date:
On 27.01.2012 18:46, Robert Haas wrote:
> On Sun, Jan 15, 2012 at 1:01 PM, Joachim Wieland<joe@mcknight.de>  wrote:
>> In parallel restore, the master closes its own connection to the
>> database before forking of worker processes, just as it does now. In
>> parallel dump however, we need to hold the masters connection open so
>> that we can detect deadlocks. The issue is that somebody could have
>> requested an exclusive lock after the master has initially requested a
>> shared lock on all tables. Therefore, the worker process also requests
>> a shared lock on the table with NOWAIT and if this fails, we know that
>> there is a conflicting lock in between and that we need to abort the
>> dump.
>
> I think this is an acceptable limitation, but the window where it can
> happen seems awfully wide right now.  As things stand, it seems like
> we don't try to lock the table in the child until we're about to
> access it, which means that, on a large database, we could dump out
> 99% of the database and then be forced to abort the dump because of a
> conflicting lock on the very last table.  We could fix that by having
> every child lock every table right at the beginning, so that all
> possible failures of this type would happen before we do any work, but
> that will eat up a lot of lock table space.  It would be nice if the
> children could somehow piggyback on the parent's locks, but I don't
> see any obvious way to make that work.  Maybe we just have to live
> with it the way it is, but I worry that people whose dumps fail 10
> hours into a 12 hour parallel dump are going to be grumpy.

If the master process keeps the locks it acquires in the beginning, you 
could fall back to dumping those tables where the child lock fails using 
the master connection.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Fri, Jan 27, 2012 at 11:53 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> If the master process keeps the locks it acquires in the beginning, you
> could fall back to dumping those tables where the child lock fails using the
> master connection.

Hmm, that's a thought.

Another idea I just had is to allow a transaction that has imported a
snapshot to jump the queue when attempting to acquire a lock that the
backend from which the snapshot was imported already holds.  We don't
want to allow queue-jumping in general because there are numerous
places in the code where we rely on the current behavior to avoid
starving strong lockers, but it seems like it might be reasonable to
allow it in this special case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Fri, Jan 27, 2012 at 4:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> But even if you do know that subclassing
> is intended, that doesn't prove that the particular Archive object is
> always going to be an ArchiveHandle under the hood.  If it is, why not
> just pass it as an ArchiveHandle to begin with?

I know that you took back some of your comments, but I'm with you
here. Archive is allocated as an ArchiveHandle and then casted back to
Archive*, so you always know that an Archive is an ArchiveHandle. I'm
all for getting rid of Archive and just using ArchiveHandle throughout
pg_dump which would get rid of these useless casts. I admit that I
might have made it a bit worse by adding a few more of these casts but
the fundamental issue was already there and there is precedence for
casting between them in both directions :-)

Joachim


Re: patch for parallel pg_dump

From
Tom Lane
Date:
Joachim Wieland <joe@mcknight.de> writes:
> I know that you took back some of your comments, but I'm with you
> here. Archive is allocated as an ArchiveHandle and then casted back to
> Archive*, so you always know that an Archive is an ArchiveHandle. I'm
> all for getting rid of Archive and just using ArchiveHandle throughout
> pg_dump which would get rid of these useless casts.

I'd like to see a more thoroughgoing look at the basic structure of
pg_dump.  Everybody who's ever looked at that code has found it
confusing, with the possible exception of the original author who is
long gone from the project anyway.  I don't know exactly what would make
it better, but the useless distinction between Archive and ArchiveHandle
seems like a minor annoyance, not the core disease.

Not that there'd be anything wrong with starting with that.
        regards, tom lane


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Sun, Jan 29, 2012 at 12:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Joachim Wieland <joe@mcknight.de> writes:
>> I know that you took back some of your comments, but I'm with you
>> here. Archive is allocated as an ArchiveHandle and then casted back to
>> Archive*, so you always know that an Archive is an ArchiveHandle. I'm
>> all for getting rid of Archive and just using ArchiveHandle throughout
>> pg_dump which would get rid of these useless casts.
>
> I'd like to see a more thoroughgoing look at the basic structure of
> pg_dump.  Everybody who's ever looked at that code has found it
> confusing, with the possible exception of the original author who is
> long gone from the project anyway.  I don't know exactly what would make
> it better, but the useless distinction between Archive and ArchiveHandle
> seems like a minor annoyance, not the core disease.
>
> Not that there'd be anything wrong with starting with that.

After some study, I'm reluctant to completely abandon the
Archive/ArchiveHandle distinction because it seems to me that it does
serve some purpose: right now, nothing in pg_dump.c - where all the
code to actually dump stuff lives - knows anything about what's inside
the ArchiveHandle, just the Archive.  So having two data structures
really does serve a purpose: if it's part of the Archive, you need it
in order to query the system catalogs and generate SQL.  If it isn't,
you don't.  Considering how much more crap there is inside an
ArchiveHandle than an Archive, I don't think we should lightly abandon
that distinction.

Now, that having been said, there are probably better ways of making
that distinction than what we have here; Archive and ArchiveHandle
could be better named, perhaps, and we could have pointers from one
structure to the other instead of magically embedding them inside each
other.  All the function pointers that are part of the ArchiveHandle
could be separated out into a static, constant structure with a name
like ArchiveMethod, and we could keep a pointer to the appropriate
ArchiveMethod inside each ArchiveHandle instead of copying all the
pointers into it.  I think that'd be a lot less confusing.

But the immediate problem is that pg_dump.c is heavily reliant on
global variables, which isn't going to fly if we want this code to use
threads on Windows (or anywhere else).  It's also bad style.  So I
suggest that we refactor pg_dump.c to get rid of g_conn and g_fout.
Getting rid of g_fout should be fairly easy: in many places, we're
already passing Archive *fout as a parameter.  If we pass it as a
parameter to the functions that need it but aren't yet getting it as a
parameter, then it can cease to exist as a global variable and become
local to main().

Getting rid of g_conn is a little harder.  Joachim's patch takes the
view that we can safely overload the existing ArchiveHandle.connection
member.  Currently, that member is the connection to which we are
doing a direct to database restore; what he's proposing to do is also
use it to store the connection from which are doing the dump.  But it
seems possible that we might someday want to dump from one database
and restore into another database at the same time, so maybe we ought
to play it safe and use different variables for those things.  So I'm
inclined to think we could just add a "PGconn *remote_connection"
argument to the Archive structure (to go with all of the similarly
named "remote" fields, all of which describe the DB to be dumped) and
then that would be available everywhere that the Archive itself is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> But the immediate problem is that pg_dump.c is heavily reliant on
> global variables, which isn't going to fly if we want this code to use
> threads on Windows (or anywhere else).  It's also bad style.  So I
> suggest that we refactor pg_dump.c to get rid of g_conn and g_fout.

I've looked at that a bit in the past and decided that the notational
overhead would be too much.  OTOH, if we're going to be forced into it
in order to support parallel pg_dump, we might as well do it first in a
separate patch.

> ...  But it
> seems possible that we might someday want to dump from one database
> and restore into another database at the same time, so maybe we ought
> to play it safe and use different variables for those things.  So I'm
> inclined to think we could just add a "PGconn *remote_connection"
> argument to the Archive structure (to go with all of the similarly
> named "remote" fields, all of which describe the DB to be dumped) and
> then that would be available everywhere that the Archive itself is.

I always thought that the "remote" terminology was singularly unhelpful.
It implies there's a "local" connection floating around somewhere, which
of course there is not, and it does nothing to remind you of whether the
connection leads to a database being dumped or a database being restored
into.  If we are going to have two fields, could we name them something
less opaque, perhaps "src_connection" and "dest_connection"?
        regards, tom lane


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Mon, Jan 30, 2012 at 12:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> But the immediate problem is that pg_dump.c is heavily reliant on
> global variables, which isn't going to fly if we want this code to use
> threads on Windows (or anywhere else).  It's also bad style.

Technically, since most of pg_dump.c dumps the catalog and since this
isn't done in parallel but only in the master process, most functions
need not be changed for the parallel restore. Only those that are
called from the worker threads need to be changed, this has been done
in e.g. dumpBlobs(), the change that you quoted upthread.

> But it
> seems possible that we might someday want to dump from one database
> and restore into another database at the same time, so maybe we ought
> to play it safe and use different variables for those things.

Actually I've tried that but in the end concluded that it's best to
have at most one database connection in an ArchiveHandle if you don't
want to do a lot more refactoring. Besides the normal connection
parameters like host, port, ... there's also std_strings, encoding,
savedPassword, currUser/currSchema, lo_buf, remoteVersion ... that
wouldn't be obvious where they belonged to.

Speaking about refactoring, I'm happy to also throw in the idea to
make the dump and restore more symmetrical than they are now. I kinda
disliked RestoreOptions* being a member of the ArchiveHandle without
having something similar for the dump. Ideally I'd say there should be
a DumpOptions and a RestoreOptions field (with a "struct Connection"
being part of them containing all the different connection
parameters). They could be a union if you wanted to allow only one
connection, or not if you want more than one.


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Tue, Jan 31, 2012 at 12:55 AM, Joachim Wieland <joe@mcknight.de> wrote:
> On Mon, Jan 30, 2012 at 12:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> But the immediate problem is that pg_dump.c is heavily reliant on
>> global variables, which isn't going to fly if we want this code to use
>> threads on Windows (or anywhere else).  It's also bad style.
>
> Technically, since most of pg_dump.c dumps the catalog and since this
> isn't done in parallel but only in the master process, most functions
> need not be changed for the parallel restore. Only those that are
> called from the worker threads need to be changed, this has been done
> in e.g. dumpBlobs(), the change that you quoted upthread.

If we're going to go to the trouble of cleaning this up, I'd prefer to
rationalize it rather than doing just the absolute bare minimum amount
of stuff needed to make it appear to work.

>> But it
>> seems possible that we might someday want to dump from one database
>> and restore into another database at the same time, so maybe we ought
>> to play it safe and use different variables for those things.
>
> Actually I've tried that but in the end concluded that it's best to
> have at most one database connection in an ArchiveHandle if you don't
> want to do a lot more refactoring. Besides the normal connection
> parameters like host, port, ... there's also std_strings, encoding,
> savedPassword, currUser/currSchema, lo_buf, remoteVersion ... that
> wouldn't be obvious where they belonged to.

And just for added fun and excitement, they all have inconsistent
naming conventions and inadequate documentation.

I think if we need more refactoring in order to support multiple
database connections, we should go do that refactoring.  The current
situation is not serving anyone well.

> Speaking about refactoring, I'm happy to also throw in the idea to
> make the dump and restore more symmetrical than they are now. I kinda
> disliked RestoreOptions* being a member of the ArchiveHandle without
> having something similar for the dump. Ideally I'd say there should be
> a DumpOptions and a RestoreOptions field (with a "struct Connection"
> being part of them containing all the different connection
> parameters). They could be a union if you wanted to allow only one
> connection, or not if you want more than one.

I like the idea of a struct Connection.  I think that could make a lot of sense.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Tue, Jan 31, 2012 at 9:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> And just for added fun and excitement, they all have inconsistent
> naming conventions and inadequate documentation.
>
> I think if we need more refactoring in order to support multiple
> database connections, we should go do that refactoring.  The current
> situation is not serving anyone well.

I guess I'd find it cleaner to have just one connection per Archive
(or ArchiveHandle). If you need two connections, why not just have two
Archive objects, as they would have different characteristics anyway,
one for dumping data, one to restore.


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Tue, Jan 31, 2012 at 4:46 PM, Joachim Wieland <joe@mcknight.de> wrote:
> On Tue, Jan 31, 2012 at 9:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> And just for added fun and excitement, they all have inconsistent
>> naming conventions and inadequate documentation.
>>
>> I think if we need more refactoring in order to support multiple
>> database connections, we should go do that refactoring.  The current
>> situation is not serving anyone well.
>
> I guess I'd find it cleaner to have just one connection per Archive
> (or ArchiveHandle). If you need two connections, why not just have two
> Archive objects, as they would have different characteristics anyway,
> one for dumping data, one to restore.

I think we're more-or-less proposing to rename "Archive" to
"Connection", aren't we?

And then ArchiveHandle can store all the things that aren't related to
a specific connection.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Wed, Feb 1, 2012 at 12:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think we're more-or-less proposing to rename "Archive" to
> "Connection", aren't we?
>
> And then ArchiveHandle can store all the things that aren't related to
> a specific connection.

How about something like that:
(Hopefully you'll come up with better names...)

StateHandle {  Connection  Schema  Archive  Methods  union {     DumpOptions     RestoreOptions  }
}

Dumping would mean to do:
 Connection -> Schema -> Archive using DumpOptions through the
specified Methods

Restore:
  Archive -> Schema -> Connection using RestoreOptions through the
specified Methods

Dumping from one database and restoring into another one would be two
StateHandles with different Connections, Archive == NULL, Schema
pointing to the same Schema, Methods most likely also pointing to the
same function pointer table and each with different Options in the
union of course.

Granted, you could say that above I've only grouped the elements of
the ArchiveHandle, but I don't really see that breaking it up into
several objects makes it any better or easier. There could be some
accessor functions to hide the details of the whole object to the
different consumers.


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Thu, Feb 2, 2012 at 8:31 PM, Joachim Wieland <joe@mcknight.de> wrote:
> On Wed, Feb 1, 2012 at 12:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think we're more-or-less proposing to rename "Archive" to
>> "Connection", aren't we?
>>
>> And then ArchiveHandle can store all the things that aren't related to
>> a specific connection.
>
> How about something like that:
> (Hopefully you'll come up with better names...)
>
> StateHandle {
>   Connection
>   Schema
>   Archive
>   Methods
>   union {
>      DumpOptions
>      RestoreOptions
>   }
> }
>
> Dumping would mean to do:
>
>  Connection -> Schema -> Archive using DumpOptions through the
> specified Methods
>
> Restore:
>
>   Archive -> Schema -> Connection using RestoreOptions through the
> specified Methods
>
> Dumping from one database and restoring into another one would be two
> StateHandles with different Connections, Archive == NULL, Schema
> pointing to the same Schema, Methods most likely also pointing to the
> same function pointer table and each with different Options in the
> union of course.
>
> Granted, you could say that above I've only grouped the elements of
> the ArchiveHandle, but I don't really see that breaking it up into
> several objects makes it any better or easier. There could be some
> accessor functions to hide the details of the whole object to the
> different consumers.

I'm not sure I understand what the various structures would contain.

My gut feeling for how to begin grinding through this is to go through
and do the following:

1. Rename Archive to Connection.
2. Add a PGconn object to it.
3. Change the declaration inside ArchiveHandle from "Archive public"
to "Connection source_connection".

I think that'd get us significantly closer to sanity and be pretty
simple to understand, and then we can take additional passes over it
until we're happy with what we've got.

If you're OK with that much I'll go do it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Fri, Feb 3, 2012 at 7:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> If you're OK with that much I'll go do it.

Sure, go ahead!


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Fri, Feb 3, 2012 at 10:43 AM, Joachim Wieland <joe@mcknight.de> wrote:
> On Fri, Feb 3, 2012 at 7:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> If you're OK with that much I'll go do it.
>
> Sure, go ahead!

It turns out that (as you anticipated) there are some problems with my
previous proposal.  For one thing, an Archive isn't really just a
connection, because it's also used as a data sink by passing it to
functions like ArchiveEntry().  For two things, as you probably
realized but I failed to, ConnectDatabase() is already setting
AH->connection to the same PGconn it returns, so the idea that we can
potentially have multiple connection objects using the existing
framework is not really true; or at least it's going to require more
work than I originally thought.  I think it might still be worth doing
at some point, but I think we probably need to clean up some of the
rest of this mess first.

I've now rejiggered things so that the Archive is passed down to
everything that needs it, so the global variable g_fout is gone.  I've
also added a couple of additional accessors to pg_backup_db.c so that
most places that issue queries can simply use those routines without
needing to peek under the hood into the ArchiveHandle.  This is not
quite enough to get rid of g_conn, but it's close: the major stumbling
block at this point is probably exit_nicely().  The gyrations we're
going through to make sure that AH->connection gets closed before
exiting are fairly annoying; maybe we should invent something in
dumputils.c along the line of the backend's on_shmem_exit().

I'm starting to think it might make sense to press on with this
refactoring just a bit further and eliminate the distinction between
Archive and ArchiveHandle.  Given a few more accessors (and it really
looks like it would only be a few), pg_dump.c could treat an
ArchiveHandle as an opaque struct, which would accomplish more or less
the same thing as the current design, but in a less confusing fashion
- i.e. without giving the reader the idea that the author desperately
wishes he were coding in C++.  That would allow simplification in a
number other places; just to take one example, we wouldn't need both
appendStringLiteralAH and appendStringLiteralAHX.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> most places that issue queries can simply use those routines without
> needing to peek under the hood into the ArchiveHandle.  This is not
> quite enough to get rid of g_conn, but it's close: the major stumbling
> block at this point is probably exit_nicely().  The gyrations we're
> going through to make sure that AH->connection gets closed before
> exiting are fairly annoying; maybe we should invent something in
> dumputils.c along the line of the backend's on_shmem_exit().

Do we actually care about closing the connection?  Worst case is that
backend logs an "unexpected EOF" message.  But yeah, an atexit hook
might be the easiest solution.
        regards, tom lane


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Tue, Feb 7, 2012 at 4:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> It turns out that (as you anticipated) there are some problems with my
> previous proposal.

I assume you're talking to Tom, as my powers of anticipation are
actually quite limited... :-)

> This is not
> quite enough to get rid of g_conn, but it's close: the major stumbling
> block at this point is probably exit_nicely().  The gyrations we're
> going through to make sure that AH->connection gets closed before
> exiting are fairly annoying; maybe we should invent something in
> dumputils.c along the line of the backend's on_shmem_exit().

Yeah, this becomes even more important with parallel jobs where you
want all worker processes die once the parent exits. Otherwise some 10
already started processes would continue to dump your 10 largest
tables for the next few hours with the master process long dead... All
while you're about to start up the next master process...

In my patch I dealt with exactly the same problem for the error
handler by creating a separate function that has a static variable (a
pointer to the ParallelState). The value is set and retrieved through
the same function, so yes, it's kinda global but then again it can
only be accessed from this function, which is only called from the
error handler.


> I'm starting to think it might make sense to press on with this
> refactoring just a bit further and eliminate the distinction between
> Archive and ArchiveHandle.

How about doing more refactoring after applying the patch, you'd then
see what is really needed and then we'd also have an actual use case
for more than one connection (You might have already guessed that this
proposal is heavily influenced by my self-interest of avoiding too
much work to make my patch match your refactoring)...


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Tue, Feb 7, 2012 at 10:21 PM, Joachim Wieland <joe@mcknight.de> wrote:
> On Tue, Feb 7, 2012 at 4:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> It turns out that (as you anticipated) there are some problems with my
>> previous proposal.
>
> I assume you're talking to Tom, as my powers of anticipation are
> actually quite limited... :-)

No, I was talking to you, actually.  You earlier suggested that an
ArchiveHandle was only meant to contain a single PGconn, and it seems
you're right.  We can refactor that assumption away, but not
trivially.

> In my patch I dealt with exactly the same problem for the error
> handler by creating a separate function that has a static variable (a
> pointer to the ParallelState). The value is set and retrieved through
> the same function, so yes, it's kinda global but then again it can
> only be accessed from this function, which is only called from the
> error handler.

How did you make this thread-safe?

>> I'm starting to think it might make sense to press on with this
>> refactoring just a bit further and eliminate the distinction between
>> Archive and ArchiveHandle.
>
> How about doing more refactoring after applying the patch, you'd then
> see what is really needed and then we'd also have an actual use case
> for more than one connection (You might have already guessed that this
> proposal is heavily influenced by my self-interest of avoiding too
> much work to make my patch match your refactoring)...

Well, I don't think your patch is going to be too heavily affected
either way, because most of your changes were not in pg_dump.c, and
the need for any changes at all in pg_dump.c should rapidly be going
away.  At any rate, the problem with stopping here is that g_conn is
still floating around, and one way or the other we've got to get rid
of it if you want to have more than one ArchiveHandle floating around
at a time.  So we at least need to press on far enough to get to that
point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Wed, Feb 8, 2012 at 1:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> In my patch I dealt with exactly the same problem for the error
>> handler by creating a separate function that has a static variable (a
>> pointer to the ParallelState). The value is set and retrieved through
>> the same function, so yes, it's kinda global but then again it can
>> only be accessed from this function, which is only called from the
>> error handler.
>
> How did you make this thread-safe?

The ParallelState has a ParallelSlot for each worker process which
contains that worker process's thread id. So a worker process just
walks through the slots until it finds its own thread id and then goes
from there.

In particular, it only gets the file descriptors to and from the
parent from this structure, to communicate the error it encountered,
but it doesn't get the PGconn. This is because that error handler is
only called when a worker calls die_horribly(AH, ...) in which case
the connection is already known through AH.

Termination via a signal just sets a variable that is checked in the
I/O routines and there we also have AH to shut down the connection
(actually to call die_horribly()).


> So we at least need to press on far enough to get to that point.

Sure, let me know if I can help you with something.


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Wed, Feb 8, 2012 at 7:56 PM, Joachim Wieland <joe@mcknight.de> wrote:
>> So we at least need to press on far enough to get to that point.
>
> Sure, let me know if I can help you with something.

Alright.  I think (hope) that I've pushed this far enough to serve the
needs of parallel pg_dump.  The error handling is still pretty grotty
and might need a bit more surgery to handle the parallel case, but I
think that making this actually not ugly will require eliminating the
Archive/ArchiveHandle distinction, which is probably a good thing to
do but, as you point out, maybe not the first priority right now.

Can you provide an updated patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Thu, Feb 16, 2012 at 1:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Can you provide an updated patch?

Robert, updated patch is attached.

Attachment

Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Thu, Feb 23, 2012 at 11:37 PM, Joachim Wieland <joe@mcknight.de> wrote:
> On Thu, Feb 16, 2012 at 1:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Can you provide an updated patch?
>
> Robert, updated patch is attached.

Well, I was hoping someone else would do some work on this, but here
we are.  Some more comments from me:

+                       /*
+                        * If the version is lower and we don't have
synchronized snapshots
+                        * yet, we will error out earlier already. So
either we have the
+                        * feature or the user has given the explicit
command not to use it.
+                        * Note: If we have it, we always use it, you
cannot switch it off
+                        * then.
+                        */

I don't agree with this design decision.  I think that
--no-synchronized-snapshots should be available even on server
versions that support it.

+       if (archiveFormat != archDirectory && numWorkers > 1) {

Style.

-                        const char *owner, bool withOids,
+                        const char *owner,
+                        unsigned long int relpages, bool withOids,

The new argument to ArchiveEntry() is unused.  Removing it would
declutter things a good bit.

+#include "../../backend/port/pipe.c"

This seems really ugly.  I suggest that we commit a separate patch to
move src/backend/port/pipe.c to src/port and refactor the interface so
that it has a signature something like this:

int win32_setup_pipe(int handles[2], char **error_string, int *error_code);

The backend can have a wrapper function around this that calls ereport
using the error_string and error_code, and any front-end code that
wants to use this can do so directly.

+/*
+ * The parallel error handler is called for any die_horribly() in a
child or master process.
+ * It then takes control over shutting down the rest of the gang.
+ */

I think this needs to be revised to take control in exit_nicely(),
maybe by using on_exit_nicely().  Trapping die_horribly() won't catch
everything.

+               if (msgsize == 0 && !do_wait) {
+                       setnonblocking(fd);
+               }

Style.

+               if (msg[msgsize] == '\0') {
+                       return msg;
+               }

Style.

I find myself somewhat uncomfortable with the extent to which this is
relying on being able to set blocking and nonblocking flags on and off
repeatedly.  Maybe there's no real problem with it except for being
inefficient, but the way I'd expect this to readMessageFromPipe() to
be written is: Always leave the pipe in blocking mode.  If you want a
non-blocking read, then use select() first to check whether it's ready
for read; if not, just do the read directly.  Actually, looking
further, it seems that you already have such logic in
getMessageFromWorker(), so I'm unclear why do_wait needs to be passed
down to readMessageFromPipe() at all.

+ * Hence this function returns an (unsigned) int.
+ */
+static int

It doesn't look unsigned?

+extern const char *fmtQualifiedId(struct Archive *fout,
+
const char *schema, const char *id);

I don't think we want to expose struct Archive to dumputils.h.  Can we
find a different place to put this?

+enum escrow_action { GET, SET };
+static void
+parallel_error_handler_escrow_data(enum escrow_action act,
ParallelState *pstate)
+{
+       static ParallelState *s_pstate = NULL;
+
+       if (act == SET)
+               s_pstate = pstate;
+       else
+               *pstate = *s_pstate;
+}

This seems like a mighty complicated way to implement a global variable.

+#ifdef HAVE_SETSID
+                       /*
+                        * If we can, we try to make each process the
leader of its own
+                        * process group. The reason is that if you
hit Ctrl-C and they are
+                        * all in the same process group, any
termination sequence is
+                        * possible, because every process will
receive the signal. What
+                        * often happens is that a worker receives the
signal, terminates
+                        * and the master detects that one of the
workers had a problem,
+                        * even before acting on its own signal.
That's still okay because
+                        * everyone still terminates but it looks a bit weird.
+                        *
+                        * With setsid() however, a Ctrl-C is only
sent to the master and
+                        * he can then cascade it to the worker processes.
+                        */
+                       setsid();
+#endif

This doesn't seem like a very good idea, because if the master fails
to propagate the ^C to all the workers for any reason, then the user
will have to hunt them down manually and kill them.  Or imagine that
someone hits ^Z, for example.  I think the right thing to do is to
have the workers catch the signal and handle it by terminating,
passing along a notification to the master that they were terminated
by user action; then the master can DTRT.

+typedef struct {

Style.

There's probably more, but the non-PostgreSQL part of my life is calling me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Sat, Mar 10, 2012 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> -                        const char *owner, bool withOids,
> +                        const char *owner,
> +                        unsigned long int relpages, bool withOids,
>
> The new argument to ArchiveEntry() is unused.  Removing it would
> declutter things a good bit.

How do you mean it's unused? pg_dump_sort.c uses relpages to dump the
largest tables first. What you don't want to see in a parallel dump is
a worker starting to dump a large table while everybody else is
already idle...


> The backend can have a wrapper function around this that calls ereport
> using the error_string and error_code, and any front-end code that
> wants to use this can do so directly.

I tried this actually (patch attached) but then I wanted to test it
and couldn't find anything that used pgpipe() on Windows.

pg_basebackup/pg_basebackup.c is using it but it's in an #ifndef WIN32
and the same is true for postmaster/syslogger.c. Am I missing
something or has this Windows implementation become stale by now? I'll
append the patch but haven't adapted the pg_dump patch yet to use it.
Should we still go forward the way you proposed?


> +/*
> + * The parallel error handler is called for any die_horribly() in a
> child or master process.
> + * It then takes control over shutting down the rest of the gang.
> + */
>
> I think this needs to be revised to take control in exit_nicely(),
> maybe by using on_exit_nicely().  Trapping die_horribly() won't catch
> everything.

It's actually not designed to catch everything. This whole error
handler thing is only there to report a single error to the user which
is hopefully the root cause of why everybody is shutting down. Assume
for example that we cannot get a lock on one table in a worker. Then
the worker would die_horribly() saying that it cannot get a lock. The
master would receive that message and shut down. Shutting down for the
master means killing all the other workers.

The master terminates because a worker died. And all the other workers
die because the master killed them. Yet the root cause for the
termination was the fact that one of the workers couldn't get a lock,
and this is the one and only message that the user should see.

If a child terminates without leaving a message, the master will still
detect it and just say "a worker process died unexpectedly" (this part
was actually broken, but now it's fixed :-) )


> Actually, looking
> further, it seems that you already have such logic in
> getMessageFromWorker(), so I'm unclear why do_wait needs to be passed
> down to readMessageFromPipe() at all.

That was a very good catch! Thanks!


> +extern const char *fmtQualifiedId(struct Archive *fout,
> +
> const char *schema, const char *id);
>
> I don't think we want to expose struct Archive to dumputils.h.  Can we
> find a different place to put this?

Right now it's there because fmtId() is there.
fmtQualifiedId is used by dumputils.c, parallel.c and pg_dump.c,
making the headers dumputils.h, parallel.h and pg_dump.h obvious
candidates for the prototype. parallel.h doesn't make much sense. We
can put it in pg_dump.h if you think that's better, but then we should
also move fmtId() and fmtQualifiedId() into pg_dump.c...

Or we change fmtQualifiedId to take an int and then we always pass the
archive version instead of the Archive* ?


> +enum escrow_action { GET, SET };
> +static void
> +parallel_error_handler_escrow_data(enum escrow_action act,
> ParallelState *pstate)
> +{
> +       static ParallelState *s_pstate = NULL;
> +
> +       if (act == SET)
> +               s_pstate = pstate;
> +       else
> +               *pstate = *s_pstate;
> +}
>
> This seems like a mighty complicated way to implement a global variable.

Well, we talked about that before, when you complained that you
couldn't get rid of the global g_conn because of the exit handler.
You're right that in fact it is an indirect global variable here but
it's clearly limited to the use of the error handler and you can be
sure that nobody other than this function writes to it or accesses it
without calling this function.

If you want to make the ParallelState global then the immediate next
question is why would you still pass it around as an argument to the
various functions and not just access the global variable instead from
everywhere...


(I have accepted and implemented all other proposals that I didn't
comment on here)


Joachim

Attachment

Re: patch for parallel pg_dump

From
Tom Lane
Date:
Joachim Wieland <joe@mcknight.de> writes:
> On Sat, Mar 10, 2012 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> -                        const char *owner, bool withOids,
>> +                        const char *owner,
>> +                        unsigned long int relpages, bool withOids,
>> 
>> The new argument to ArchiveEntry() is unused.  Removing it would
>> declutter things a good bit.

> How do you mean it's unused? pg_dump_sort.c uses relpages to dump the
> largest tables first. What you don't want to see in a parallel dump is
> a worker starting to dump a large table while everybody else is
> already idle...

Used or not, I think you could find a less ugly and less invasive way to
pass that around than this.  We should try to avoid adding arguments to
ArchiveEntry that apply to only one object type.

(I'm also unconvinced that sorting by relation size is a good idea
anyway.  Anything that makes the dump order less predictable gets
push-back, IME.)
        regards, tom lane


Re: patch for parallel pg_dump

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (I'm also unconvinced that sorting by relation size is a good idea
> anyway.  Anything that makes the dump order less predictable gets
> push-back, IME.)
Given that people often use diff on files from pg_dump,
unpredictable ordering can be a bad thing.  On the other hand, that
is not something you would probably want to do with the output of a
*parallel* dump, so if it only affect that, it probably makes sense.
It seems like a reasonable heuristic to avoid having all but some
big table done, and having to wait for that while the other
processors are sitting idle.
-Kevin


Re: patch for parallel pg_dump

From
Andres Freund
Date:
On Tuesday, March 13, 2012 02:48:11 PM Tom Lane wrote:
> (I'm also unconvinced that sorting by relation size is a good idea
> anyway.  Anything that makes the dump order less predictable gets
> push-back, IME.)
Why? Especially in the directory format - which is a prerequisite for parallel 
dump if I remember this correctly - I don't really see a negative point in a 
slightly changing dump order. Given its not deterministic anyway.

Andres


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Mon, Mar 12, 2012 at 11:35 PM, Joachim Wieland <joe@mcknight.de> wrote:
> How do you mean it's unused? pg_dump_sort.c uses relpages to dump the
> largest tables first. What you don't want to see in a parallel dump is
> a worker starting to dump a large table while everybody else is
> already idle...

What I mean is that the function ArchiveEntry() is defined in
pg_backup_archiver.c, and it takes an argument called relpages, and
the string "relpages" does not appear anywhere else in that file.

>> The backend can have a wrapper function around this that calls ereport
>> using the error_string and error_code, and any front-end code that
>> wants to use this can do so directly.
>
> I tried this actually (patch attached) but then I wanted to test it
> and couldn't find anything that used pgpipe() on Windows.
>
> pg_basebackup/pg_basebackup.c is using it but it's in an #ifndef WIN32
> and the same is true for postmaster/syslogger.c. Am I missing
> something or has this Windows implementation become stale by now? I'll
> append the patch but haven't adapted the pg_dump patch yet to use it.
> Should we still go forward the way you proposed?

Dunno.  Can we get an opinion on that from one of the Windows guys?
Andrew, Magnus?

>> +/*
>> + * The parallel error handler is called for any die_horribly() in a
>> child or master process.
>> + * It then takes control over shutting down the rest of the gang.
>> + */
>>
>> I think this needs to be revised to take control in exit_nicely(),
>> maybe by using on_exit_nicely().  Trapping die_horribly() won't catch
>> everything.
>
> It's actually not designed to catch everything. This whole error
> handler thing is only there to report a single error to the user which
> is hopefully the root cause of why everybody is shutting down. Assume
> for example that we cannot get a lock on one table in a worker. Then
> the worker would die_horribly() saying that it cannot get a lock. The
> master would receive that message and shut down. Shutting down for the
> master means killing all the other workers.
>
> The master terminates because a worker died. And all the other workers
> die because the master killed them. Yet the root cause for the
> termination was the fact that one of the workers couldn't get a lock,
> and this is the one and only message that the user should see.
>
> If a child terminates without leaving a message, the master will still
> detect it and just say "a worker process died unexpectedly" (this part
> was actually broken, but now it's fixed :-) )

All that may be true, but I still don't see why it's right for this to
apply in the cases where the worker thread says die_horribly(), but
not in the cases where the worker says exit_horribly().

> Or we change fmtQualifiedId to take an int and then we always pass the
> archive version instead of the Archive* ?

Hmm, I think that might make sense.

>> +enum escrow_action { GET, SET };
>> +static void
>> +parallel_error_handler_escrow_data(enum escrow_action act,
>> ParallelState *pstate)
>> +{
>> +       static ParallelState *s_pstate = NULL;
>> +
>> +       if (act == SET)
>> +               s_pstate = pstate;
>> +       else
>> +               *pstate = *s_pstate;
>> +}
>>
>> This seems like a mighty complicated way to implement a global variable.
>
> Well, we talked about that before, when you complained that you
> couldn't get rid of the global g_conn because of the exit handler.
> You're right that in fact it is an indirect global variable here but
> it's clearly limited to the use of the error handler and you can be
> sure that nobody other than this function writes to it or accesses it
> without calling this function.

Sure, but since all the function does is write to it or access it,
what good does that do me?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Andrew Dunstan
Date:

On 03/13/2012 01:53 PM, Robert Haas wrote:
>>
>> I tried this actually (patch attached) but then I wanted to test it
>> and couldn't find anything that used pgpipe() on Windows.
>>
>> pg_basebackup/pg_basebackup.c is using it but it's in an #ifndef WIN32
>> and the same is true for postmaster/syslogger.c. Am I missing
>> something or has this Windows implementation become stale by now? I'll
>> append the patch but haven't adapted the pg_dump patch yet to use it.
>> Should we still go forward the way you proposed?
> Dunno.  Can we get an opinion on that from one of the Windows guys?
> Andrew, Magnus?
>
>

I haven't had time to review this patch or even follow all the 
discussion as I was hoping. I'll try to review the whole thing shortly.

cheers

andrew


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Tue, Mar 13, 2012 at 9:59 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> (I'm also unconvinced that sorting by relation size is a good idea
>> anyway.  Anything that makes the dump order less predictable gets
>> push-back, IME.)
>
> Given that people often use diff on files from pg_dump,
> unpredictable ordering can be a bad thing.  On the other hand, that
> is not something you would probably want to do with the output of a
> *parallel* dump, so if it only affect that, it probably makes sense.
> It seems like a reasonable heuristic to avoid having all but some
> big table done, and having to wait for that while the other
> processors are sitting idle.

Yeah, I think it's a good heuristic.  Finishing the dump in the
minimum possible time is sufficiently similar to the knapsack problem
as to make me suspect that there is no easy way to be certain of
getting the optimal dump order (and we don't even have perfect
information, since we know little about the characteristics of the
underlying storage).  But dumping tables in descending order figures
to get many easy cases right - e.g. suppose there are 200 tables of
size X and 1 table of size 100*X, and we have 3 workers to play with.
If we dump the tables in an essentially random order (relative to
size) then the overall time will get longer the more little tables we
dump before we start the big one.

Now, if we have tables of sizes 10*X, 9*X, 8*X, 6*X, and 5*X and two
workers, then the first worker will get the 10*X table, the second
worker will get the 9*X table, then the second worker will start the
8*X table, then the first worker will get the 6*X and 5*X tables and,
assuming dump time is a uniform function of table size, we'll finish
after 21 time units.  Had we been smarter, we could have assigned the
9*X, 6*X, and 5*X tables to one worker and the 10*X and 8*X tables to
the other and finished in just 20 time units.  There's probably a way
to construct a more extreme example of this, but I think in practice
if there's any loss due to this kind of effect it will be small, and
descending-size order certainly seems more likely to be right than
leading it to chance.

A bigger problem is dumping relations A and B at the same time might
involve a lot more I/O contention than dumping relations A and C at
the same time if, say, A and B are on the same tablespace and C is
not.  I have no idea what to do about that in general, but for a first
version of this feature I think it's fine to punt.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Tue, Mar 13, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> What I mean is that the function ArchiveEntry() is defined in
> pg_backup_archiver.c, and it takes an argument called relpages, and
> the string "relpages" does not appear anywhere else in that file.

Uhm, that's kinda concerning, isn't it... fixed...


[...pgpipe...]
> Dunno.  Can we get an opinion on that from one of the Windows guys?
> Andrew, Magnus?

Waiting for the verdict here...


>> If a child terminates without leaving a message, the master will still
>> detect it and just say "a worker process died unexpectedly" (this part
>> was actually broken, but now it's fixed :-) )
>
> All that may be true, but I still don't see why it's right for this to
> apply in the cases where the worker thread says die_horribly(), but
> not in the cases where the worker says exit_horribly().

Hm, I'm not calling the error handler from exit_horribly because it
doesn't have the AH. It looks like the code assumes that
die_horribly() is called whenever AH is available and if not,
exit_horribly() should be called which eventually calls these
preregistered exit-hooks via exit_nicely() to clean up the connection.

I think we should somehow unify both functions, the code is not very
consistent in this respect, it also calls exit_horribly() when it has
AH available. See for example pg_backup_tar.c

Or is there another distinction between them? The question how to
clean it up basically brings us back to the question what to do about
global variables in general and for error handlers in particular.


>> Or we change fmtQualifiedId to take an int and then we always pass the
>> archive version instead of the Archive* ?
>
> Hmm, I think that might make sense.

Done.


>>> +enum escrow_action { GET, SET };
>>> +static void
>>> +parallel_error_handler_escrow_data(enum escrow_action act,
>>> ParallelState *pstate)
>>> +{
>>> +       static ParallelState *s_pstate = NULL;
>>> +
>>> +       if (act == SET)
>>> +               s_pstate = pstate;
>>> +       else
>>> +               *pstate = *s_pstate;
>>> +}
>>>
>>> This seems like a mighty complicated way to implement a global variable.
>>
>> Well, we talked about that before, when you complained that you
>> couldn't get rid of the global g_conn because of the exit handler.
>> You're right that in fact it is an indirect global variable here but
>> it's clearly limited to the use of the error handler and you can be
>> sure that nobody other than this function writes to it or accesses it
>> without calling this function.
>
> Sure, but since all the function does is write to it or access it,
> what good does that do me?

It encapsulates the variable so that it can only be used for one
specific use case.

Attaching a new version.

Attachment

Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Wed, Mar 14, 2012 at 12:34 AM, Joachim Wieland <joe@mcknight.de> wrote:
>>> If a child terminates without leaving a message, the master will still
>>> detect it and just say "a worker process died unexpectedly" (this part
>>> was actually broken, but now it's fixed :-) )
>>
>> All that may be true, but I still don't see why it's right for this to
>> apply in the cases where the worker thread says die_horribly(), but
>> not in the cases where the worker says exit_horribly().
>
> Hm, I'm not calling the error handler from exit_horribly because it
> doesn't have the AH. It looks like the code assumes that
> die_horribly() is called whenever AH is available and if not,
> exit_horribly() should be called which eventually calls these
> preregistered exit-hooks via exit_nicely() to clean up the connection.
>
> I think we should somehow unify both functions, the code is not very
> consistent in this respect, it also calls exit_horribly() when it has
> AH available. See for example pg_backup_tar.c

I think we should get rid of die_horribly(), and instead have arrange
to always clean up AH via an on_exit_nicely hook.

>> Sure, but since all the function does is write to it or access it,
>> what good does that do me?
>
> It encapsulates the variable so that it can only be used for one
> specific use case.

Seems pointless to me.

+       /*
+        * This is a data dumper routine, executed in a child for parallel backu
+        * so it must not access the global g_conn but AH->connection instead.
+        */

There's no g_conn any more.  This and several other references to it
should be updated or expunged.

+       {
+               write_msg(NULL, "parallel backup only supported by the directory
+               exit(1);
+       }

I think this should exit_horribly() with that message.  It definitely
can't use exit() rather than exit_nicely(); more generally, every copy
of exit() that you've added here should exit_nicely() instead, or use
some higher-level routine like exit_horribly().

+                       write_msg(NULL, "No synchronized snapshots available in
+                                                "You might have to run with --n
+                       exit(1);

In addition to the previous problem, what do you mean by "might"?  The
real problem is that on pre-9.2 versions multiple jobs are not OK
unless that option is used; I think we should say that more directly.

/** The sequence is the following (for dump, similar for restore):** Master                                   Worker**
                                       enters WaitForCommands()* DispatchJobForTocEntry(...te...)** [ Worker is IDLE
]**arg = (MasterStartParallelItemPtr)()* send: DUMP arg*                                          receive: DUMP arg*
                                     str = (WorkerJobDumpPtr)(arg)* [ Worker is WORKING ]                    ... gets
tefrom arg ...*                                          ... dump te ...*
send:OK DUMP info** In ListenToWorkers():** [ Worker is FINISHED ]* receive: OK DUMP info* status =
(MasterEndParallelItemPtr)(info)**In ReapWorkerStatus(&ptr):* *ptr = status;* [ Worker is IDLE ]*/
 

I don't find this comment very clear, and would suggest rewriting it
using prose rather than an ASCII diagram.  Note also that any sort of
thing that does look like an ASCII diagram must be surrounded by lines
of dashes within the comment block, or pgindent will make hash of it.
There are a couple of other places where this is an issue as well,
like the comment for ListenToWorkers().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Andrew Dunstan
Date:
On 03/13/2012 02:10 PM, Andrew Dunstan wrote:
>
>
> On 03/13/2012 01:53 PM, Robert Haas wrote:
>>>
>>> I tried this actually (patch attached) but then I wanted to test it
>>> and couldn't find anything that used pgpipe() on Windows.
>>>
>>> pg_basebackup/pg_basebackup.c is using it but it's in an #ifndef WIN32
>>> and the same is true for postmaster/syslogger.c. Am I missing
>>> something or has this Windows implementation become stale by now? I'll
>>> append the patch but haven't adapted the pg_dump patch yet to use it.
>>> Should we still go forward the way you proposed?
>> Dunno.  Can we get an opinion on that from one of the Windows guys?
>> Andrew, Magnus?
>>
>>
>
> I haven't had time to review this patch or even follow all the 
> discussion as I was hoping. I'll try to review the whole thing shortly.


pgpipe used to be used in pgstat.c, but that's no longer true in any 
live branch, so it's probably long dead. I'd be inclined to rip it out 
if possible rather than expand its use.

I've just started looking at the patch, and I'm curious to know why it 
didn't follow the pattern of parallel pg_restore which created a new 
worker for each table rather than passing messages to looping worker 
threads as this appears to do. That might have avoided a lot of the need 
for this message passing infrastructure, if it could have been done. But 
maybe I just need to review the patch and the discussions some more.

cheers

andrew







Re: patch for parallel pg_dump

From
Alvaro Herrera
Date:
Excerpts from Andrew Dunstan's message of mié mar 14 17:39:59 -0300 2012:

> pgpipe used to be used in pgstat.c, but that's no longer true in any
> live branch, so it's probably long dead. I'd be inclined to rip it out
> if possible rather than expand its use.

our pgpipe() function is interesting -- all the callers that use it
first verify that they aren't WIN32.  If they are, they are using a
#define that makes it plain pipe().  And the function is only defined in
WIN32.  It seems a reasonable idea to kill both pgpipe() and piperead().

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Wed, Mar 14, 2012 at 4:39 PM, Andrew Dunstan <adunstan@postgresql.org> wrote:
> I've just started looking at the patch, and I'm curious to know why it
> didn't follow the pattern of parallel pg_restore which created a new worker
> for each table rather than passing messages to looping worker threads as
> this appears to do. That might have avoided a lot of the need for this
> message passing infrastructure, if it could have been done. But maybe I just
> need to review the patch and the discussions some more.

Hmm, I hadn't actually considered that idea.  Not sure whether it's
better or worse than the current implementation...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Wed, Mar 14, 2012 at 2:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think we should get rid of die_horribly(), and instead have arrange
> to always clean up AH via an on_exit_nicely hook.

Good. The only exit handler I've seen so far is
pgdump_cleanup_at_exit. If there's no other one, is it okay to remove
all of this stacking functionality (see on_exit_nicely_index /
MAX_ON_EXIT_NICELY) from dumputils.c and just define two global
variables, one for the function and one for the arg that this function
would operate on (or a struct of both)?

We'd then have the current function and AHX (or only &AH->connection
from it) in the non-parallel case and as soon as we enter the parallel
dump, we can exchange it for another function operating on
ParallelState*. This avoids having to deal with thread-local storage
on Windows, because ParallelState* is just large enough to hold all
the required data and a specific thread can easily find its own slot
with its threadId.


>>> Sure, but since all the function does is write to it or access it,
>>> what good does that do me?
>>
>> It encapsulates the variable so that it can only be used for one
>> specific use case.
>
> Seems pointless to me.

Not so much to me if the alternative is to make ParallelState* a
global variable, but anyway, with the concept proposed above,
ParallelState* would be the arg that the parallel exit handler would
operate on, so it would indeed be global but hidden behind a different
name and a void* pointer.

(I will address all the other points you brought up in my next patch)


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Wed, Mar 14, 2012 at 4:39 PM, Andrew Dunstan <adunstan@postgresql.org> wrote:
> I've just started looking at the patch, and I'm curious to know why it
> didn't follow the pattern of parallel pg_restore which created a new worker
> for each table rather than passing messages to looping worker threads as
> this appears to do. That might have avoided a lot of the need for this
> message passing infrastructure, if it could have been done. But maybe I just
> need to review the patch and the discussions some more.

The main reason for this design has now been overcome by the
flexibility of the synchronized snapshot feature, which allows to get
the snapshot of a transaction even if this other transaction has been
running for quite some time already. In other previously proposed
implementations of this feature, workers had to connect at the same
time and then could not close their transactions without losing the
snapshot.

The other drawback of the fork-per-tocentry-approach is the somewhat
limited bandwith of information from the worker back to the master,
it's basically just the return code. That's fine if there is no error,
but if there is, then the master can't tell any further details (e.g.
"could not get lock on table foo", or "could not write to file bar: no
space left on device").

This restriction does not only apply to error messages. For example,
what I'd also like to have in pg_dump would be checksums on a
per-TocEntry basis. The individual workers would calculate the
checksums when writing the file and then send them back to the master
for integration into the TOC. I don't see how such a feature could be
implemented in a straightforward way without a message passing
infrastructure.


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Thu, Mar 15, 2012 at 12:56 AM, Joachim Wieland <joe@mcknight.de> wrote:
> On Wed, Mar 14, 2012 at 2:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think we should get rid of die_horribly(), and instead have arrange
>> to always clean up AH via an on_exit_nicely hook.
>
> Good. The only exit handler I've seen so far is
> pgdump_cleanup_at_exit. If there's no other one, is it okay to remove
> all of this stacking functionality (see on_exit_nicely_index /
> MAX_ON_EXIT_NICELY) from dumputils.c and just define two global
> variables, one for the function and one for the arg that this function
> would operate on (or a struct of both)?

No.  That code is included by other things - like pg_dumpall - that
don't know there's such a thing as an Archive.  But I don't see that
as a big problem; just on_exit_nicely whatever you want.  We could
also add on_exit_nicely_reset(), if needed, to clear the existing
handlers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Fri, Mar 16, 2012 at 12:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Good. The only exit handler I've seen so far is
>> pgdump_cleanup_at_exit. If there's no other one, is it okay to remove
>> all of this stacking functionality (see on_exit_nicely_index /
>> MAX_ON_EXIT_NICELY) from dumputils.c and just define two global
>> variables, one for the function and one for the arg that this function
>> would operate on (or a struct of both)?
>
> No.  That code is included by other things - like pg_dumpall - that
> don't know there's such a thing as an Archive.  But I don't see that
> as a big problem; just on_exit_nicely whatever you want.  We could
> also add on_exit_nicely_reset(), if needed, to clear the existing
> handlers.

Yes, on_exit_nicely_reset() would be what I'd need to remove all
callbacks from the parent after the fork in the child process.

I still can't find any other hooks except for pgdump_cleanup_at_exit
from pg_dump.c. I guess what you're saying is that we provide
dumputil.c to other programs but even though none of them currently
sets any exit callback, you want to keep the functionality so that
they can set multiple exit hooks in the future should the need for
them arise.


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Wed, Mar 14, 2012 at 2:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think we should somehow unify both functions, the code is not very
>> consistent in this respect, it also calls exit_horribly() when it has
>> AH available. See for example pg_backup_tar.c
>
> I think we should get rid of die_horribly(), and instead have arrange
> to always clean up AH via an on_exit_nicely hook.

Attached is a patch that gets rid of die_horribly().

For the parallel case it maintains an array with as many elements as
we have worker processes. When the workers start, they enter their Pid
(or ThreadId) and their ArchiveHandle (AH). The exit handler function
in a process can then find its own ArchiveHandle by comparing the own
Pid with all the elements in the array.

Attachment

Re: patch for parallel pg_dump

From
Alvaro Herrera
Date:
Excerpts from Joachim Wieland's message of lun mar 19 00:31:47 -0300 2012:
> On Wed, Mar 14, 2012 at 2:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> I think we should somehow unify both functions, the code is not very
> >> consistent in this respect, it also calls exit_horribly() when it has
> >> AH available. See for example pg_backup_tar.c
> >
> > I think we should get rid of die_horribly(), and instead have arrange
> > to always clean up AH via an on_exit_nicely hook.
>
> Attached is a patch that gets rid of die_horribly().
>
> For the parallel case it maintains an array with as many elements as
> we have worker processes. When the workers start, they enter their Pid
> (or ThreadId) and their ArchiveHandle (AH). The exit handler function
> in a process can then find its own ArchiveHandle by comparing the own
> Pid with all the elements in the array.

Sounds good to me in general ... my only gripe is this: I wonder if it
would be better to have a central routine that knows about both
archive_close_connection and archive_close_connection_parallel -- and
the argument to the callback is a struct that contains both a pointer to
the struct with the connection to be closed, a ParallelState (either of
which can be null), and a flag stating which of the ParallelState/AH is
in use.  That way, you avoid having to reset the callbacks when you
switch from AH to parallel; instead you just clear out the AH
connection, set the ParallelState, and flip the switch.  The general
archive_close_connection checks the flag to know which routine to call.

I mean, what you have probably works fine now, but it doesn't seem very
extensible.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Mon, Mar 19, 2012 at 9:14 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Sounds good to me in general ... my only gripe is this: I wonder if it
> would be better to have a central routine that knows about both
> archive_close_connection and archive_close_connection_parallel -- and
> the argument to the callback is a struct that contains both a pointer to
> the struct with the connection to be closed [...]

I had a similar idea before but then concluded that for it you need to
have this struct globally available so that everybody (pg_dump.c /
pg_restore.c / pg_backup_archive.c) can access it to set the
appropriate state.

I gave it a second thought and now just defined a function that these
consumers can call, that way the variable can stay at file scope at
least.

Also we don't need this switch, we can set the ParallelState in the
struct before any child forks off and reset it to NULL after the last
child has terminated.

New patch attached, thanks for your comments.

Attachment

Re: patch for parallel pg_dump

From
"Erik Rijkers"
Date:
On Tue, March 20, 2012 04:04, Joachim Wieland wrote:
> On Mon, Mar 19, 2012 at 9:14 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> Sounds good to me in general ... my only gripe is this: I wonder if it
>> would be better to have a central routine that knows about both
>> archive_close_connection and archive_close_connection_parallel -- and
>> the argument to the callback is a struct that contains both a pointer to
>> the struct with the connection to be closed [...]
>
> I had a similar idea before but then concluded that for it you need to
> have this struct globally available so that everybody (pg_dump.c /
> pg_restore.c / pg_backup_archive.c) can access it to set the
> appropriate state.
>
> I gave it a second thought and now just defined a function that these
> consumers can call, that way the variable can stay at file scope at
> least.
>
> Also we don't need this switch, we can set the ParallelState in the
> struct before any child forks off and reset it to NULL after the last
> child has terminated.
>
> New patch attached, thanks for your comments.
>

[pg_dump_die_horribly.2.diff ]


In my hands, the patch complains:


In file included from gram.y:13255:0:
scan.c: In function ‘yy_try_NUL_trans’:
scan.c:16243:23: warning: unused variable ‘yyg’ [-Wunused-variable]
pg_backup_archiver.c:3320:1: error: static declaration of ‘archive_close_connection’ follows
non-static declaration
pg_backup.h:170:13: note: previous declaration of ‘archive_close_connection’ was here
make[3]: *** [pg_backup_archiver.o] Error 1
make[2]: *** [all-pg_dump-recurse] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [all-bin-recurse] Error 2
make: *** [all-src-recurse] Error 2
-- make returned 2 - abort



Erik Rijkers




Re: patch for parallel pg_dump

From
"Erik Rijkers"
Date:
>
> [pg_dump_die_horribly.2.diff ]
>
>
> In my hands, the patch complains:
>
>
> In file included from gram.y:13255:0:
> scan.c: In function ‘yy_try_NUL_trans’:
> scan.c:16243:23: warning: unused variable ‘yyg’ [-Wunused-variable]
> pg_backup_archiver.c:3320:1: error: static declaration of ‘archive_close_connection’ follows
> non-static declaration
> pg_backup.h:170:13: note: previous declaration of ‘archive_close_connection’ was here
> make[3]: *** [pg_backup_archiver.o] Error 1
> make[2]: *** [all-pg_dump-recurse] Error 2
> make[2]: *** Waiting for unfinished jobs....
> make[1]: *** [all-bin-recurse] Error 2
> make: *** [all-src-recurse] Error 2
> -- make returned 2 - abort
>

I should add:  Centos 5.8, gcc 4.6.3


Erik Rijkers





Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Tue, Mar 20, 2012 at 12:03 AM, Erik Rijkers <er@xs4all.nl> wrote:
> In my hands, the patch complains:

Thanks, updated patch attached.

Attachment

Re: patch for parallel pg_dump

From
Alvaro Herrera
Date:
Excerpts from Joachim Wieland's message of mar mar 20 08:26:52 -0300 2012:
> On Tue, Mar 20, 2012 at 12:03 AM, Erik Rijkers <er@xs4all.nl> wrote:
> > In my hands, the patch complains:
>
> Thanks, updated patch attached.

Applied, with some minor tweaks, thanks.

I didn't try the WIN32 compile.  I hope I didn't break it (assuming it
was working in your patch.)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for parallel pg_dump

From
Alvaro Herrera
Date:

Are you going to provide a rebased version?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Fri, Mar 23, 2012 at 11:11 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Are you going to provide a rebased version?

Yes, working on that.


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Fri, Mar 23, 2012 at 11:11 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Are you going to provide a rebased version?

Rebased version attached, this patch also includes Robert's earlier suggestions.

Attachment

Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Sun, Mar 25, 2012 at 10:50 PM, Joachim Wieland <joe@mcknight.de> wrote:
> On Fri, Mar 23, 2012 at 11:11 AM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> Are you going to provide a rebased version?
>
> Rebased version attached, this patch also includes Robert's earlier suggestions.

I keep hoping someone who knows Windows is going to take a look at
this, but so far no luck.  It could also really use some attention
from someone who has an actual really big database handy, to see how
successful it is in reducing the dump time.  Without those things, I
can't see this getting committed.  But in the meantime, a few fairly
minor comments based on reading the code.

I'm wondering if we really need this much complexity around shutting
down workers.  I'm not sure I understand why we need both a "hard" and
a "soft" method of shutting them down.  At least on non-Windows
systems, it seems like it would be entirely sufficient to just send a
SIGTERM when you want them to die.  They don't even need to catch it;
they can just die.  You could also set things up so that if the
connection to the parent process is closed, the worker exits.  Then,
during normal shutdown, you don't need to kill them at all.  The
master can simply exit, and the child processes will follow suit.  The
checkAborting stuff all goes away.

The existing coding of on_exit_nicely is intention, and copied from
similar logic for on_shmem_exit in the backend. Is there really a
compelling reason to reverse the firing order of exit hooks?

On my system:

parallel.c: In function ‘WaitForTerminatingWorkers’:
parallel.c:275: warning: ‘slot’ may be used uninitialized in this function
make: *** [parallel.o] Error 1

Which actually looks like a semi-legitimate gripe.

+       if (numWorkers > MAXIMUM_WAIT_OBJECTS)
+       {
+               fprintf(stderr, _("%s: invalid number of parallel
jobs\n"),     progname);
+               exit(1);
+       }

I think this error message could be more clear.  How about "maximum
number of parallel jobs is %d"?

+void _SetupWorker(Archive *AHX, RestoreOptions *ropt) {}

Thankfully, this bit in pg_dumpall.c appears to be superfluous.  I
hope this is just a holdover from an earlier version that we can lose.

-                                         const char *modulename,
const char *fmt,...)
+                                         const char *modulename,
const char *fmt,...)

Useless hunk.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of mié mar 28 14:46:30 -0300 2012:

> I keep hoping someone who knows Windows is going to take a look at
> this, but so far no luck.  It could also really use some attention
> from someone who has an actual really big database handy, to see how
> successful it is in reducing the dump time.  Without those things, I
> can't see this getting committed.  But in the meantime, a few fairly
> minor comments based on reading the code.

My main comment about the current patch is that it looks like it's
touching pg_restore parallel code by moving some stuff into parallel.c.
If that's really the case and its voluminous, maybe this patch would
shrink a bit if we could do the code moving in a first patch.  That
would be mostly mechanical.  Then the interesting stuff would apply on
top of that.  That would make review easier.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Wed, Mar 28, 2012 at 2:20 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Robert Haas's message of mié mar 28 14:46:30 -0300 2012:
>> I keep hoping someone who knows Windows is going to take a look at
>> this, but so far no luck.  It could also really use some attention
>> from someone who has an actual really big database handy, to see how
>> successful it is in reducing the dump time.  Without those things, I
>> can't see this getting committed.  But in the meantime, a few fairly
>> minor comments based on reading the code.
>
> My main comment about the current patch is that it looks like it's
> touching pg_restore parallel code by moving some stuff into parallel.c.
> If that's really the case and its voluminous, maybe this patch would
> shrink a bit if we could do the code moving in a first patch.  That
> would be mostly mechanical.  Then the interesting stuff would apply on
> top of that.  That would make review easier.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Andrew Dunstan
Date:

On 03/28/2012 02:27 PM, Robert Haas wrote:
> On Wed, Mar 28, 2012 at 2:20 PM, Alvaro Herrera
> <alvherre@commandprompt.com>  wrote:
>> Excerpts from Robert Haas's message of mié mar 28 14:46:30 -0300 2012:
>>> I keep hoping someone who knows Windows is going to take a look at
>>> this, but so far no luck.  It could also really use some attention
>>> from someone who has an actual really big database handy, to see how
>>> successful it is in reducing the dump time.  Without those things, I
>>> can't see this getting committed.  But in the meantime, a few fairly
>>> minor comments based on reading the code.
>> My main comment about the current patch is that it looks like it's
>> touching pg_restore parallel code by moving some stuff into parallel.c.
>> If that's really the case and its voluminous, maybe this patch would
>> shrink a bit if we could do the code moving in a first patch.  That
>> would be mostly mechanical.  Then the interesting stuff would apply on
>> top of that.  That would make review easier.
> +1.

+1 also.

FYI I am just starting some test runs on Windows. I also have a
reasonably sized db (98 gb) on my SL6 server which should be perfect for
testing this (I just partitioned its two main tables), and will try to
get some timing runs.

cheers

andrew



Re: patch for parallel pg_dump

From
Andrew Dunstan
Date:

On 03/28/2012 03:17 PM, Andrew Dunstan wrote:
>
>
> On 03/28/2012 02:27 PM, Robert Haas wrote:
>> On Wed, Mar 28, 2012 at 2:20 PM, Alvaro Herrera
>> <alvherre@commandprompt.com>  wrote:
>>> Excerpts from Robert Haas's message of mié mar 28 14:46:30 -0300 2012:
>>>> I keep hoping someone who knows Windows is going to take a look at
>>>> this, but so far no luck.  It could also really use some attention
>>>> from someone who has an actual really big database handy, to see how
>>>> successful it is in reducing the dump time.  Without those things, I
>>>> can't see this getting committed.  But in the meantime, a few fairly
>>>> minor comments based on reading the code.
>>> My main comment about the current patch is that it looks like it's
>>> touching pg_restore parallel code by moving some stuff into parallel.c.
>>> If that's really the case and its voluminous, maybe this patch would
>>> shrink a bit if we could do the code moving in a first patch.  That
>>> would be mostly mechanical.  Then the interesting stuff would apply on
>>> top of that.  That would make review easier.
>> +1.
>
> +1 also.
>
> FYI I am just starting some test runs on Windows. I also have a
> reasonably sized db (98 gb) on my SL6 server which should be perfect
> for testing this (I just partitioned its two main tables), and will
> try to get some timing runs.


First hurdle: It doesn't build under Windows/mingw-w64:
   x86_64-w64-mingw32-gcc -O2 -Wall -Wmissing-prototypes   -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
 -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing   -fwrapv -fexcess-precision=standard -g
-I../../../src/interfaces/libpq-I../../../src/include   -I./src/include/port/win32 -DEXEC_BACKEND
-I/c/prog/mingwdep/include"-I../../../src/include/port/win32"  -c   -o parallel.o parallel.c   parallel.c:40:12: error:
staticdeclaration of 'pgpipe' follows   non-static declaration   ../../../src/include/port.h:268:12: note: previous
declarationof   'pgpipe' was here   parallel.c:41:12: error: static declaration of 'piperead' follows   non-static
declaration  ../../../src/include/port.h:269:12: note: previous declaration of   'piperead' was here   parallel.c: In
function'ParallelBackupStart':   parallel.c:455:9: warning: passing argument 3 of '_beginthreadex'   from incompatible
pointertype
c:\mingw64\bin\../lib/gcc/x86_64-w64-mingw32/4.5.3/../../../../x86_64-w64-mingw32/include/process.h:31:29:  note:
expected'unsigned int (*)(void *)' but argument is of type   'unsigned int (*)(struct WorkerInfo *)'   make[3]: ***
[parallel.o]Error 1   make[3]: Leaving directory   `/home/andrew/bf/root/HEAD/pgsql/src/bin/pg_dump'   make[2]: ***
[all-pg_dump-recurse]Error 2   make[2]: Leaving directory `/home/andrew/bf/root/HEAD/pgsql/src/bin'   make[1]: ***
[all-bin-recurse]Error 2   make[1]: Leaving directory `/home/andrew/bf/root/HEAD/pgsql/src'   make: ***
[all-src-recurse]Error 2 

I'll have a look at that in a little while.

cheers

andrew


>
> cheers
>
> andrew
>
>


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Wed, Mar 28, 2012 at 5:19 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> First hurdle: It doesn't build under Windows/mingw-w64:
>
>   parallel.c:40:12: error: static declaration of 'pgpipe' follows
>   non-static declaration

Strange, I'm not seeing this but I'm building with VC2005. What
happens is that you're pulling in the pgpipe.h header. I have moved
these functions as static functions into pg_dump since you voted for
removing them from the other location (because as it turned out,
nobody else is currently using them).

Joachim


Re: patch for parallel pg_dump

From
Andrew Dunstan
Date:

On 03/28/2012 08:28 PM, Joachim Wieland wrote:
> On Wed, Mar 28, 2012 at 5:19 PM, Andrew Dunstan<andrew@dunslane.net>  wrote:
>> First hurdle: It doesn't build under Windows/mingw-w64:
>>
>>    parallel.c:40:12: error: static declaration of 'pgpipe' follows
>>    non-static declaration
> Strange, I'm not seeing this but I'm building with VC2005. What
> happens is that you're pulling in the pgpipe.h header. I have moved
> these functions as static functions into pg_dump since you voted for
> removing them from the other location (because as it turned out,
> nobody else is currently using them).

But your patch hasn't got rid of them, and so it's declared twice. There 
is no pgpipe.h, BTW, it's declared in port.h. If VC2005 doesn't complain 
about the double declaration then that's a bug in the compiler, IMNSHO. 
Doesn't it even issue a warning? I no longer use VC2005 for anything, 
BTW, I use VC2008 or later.

Anyway, ISTM the best thing is just for us to get rid of pgpipe without 
further ado. I'll try to get a patch together for that.

cheers

andrew





> Joachim
>


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Thu, Mar 29, 2012 at 2:46 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
> But your patch hasn't got rid of them, and so it's declared twice. There is
> no pgpipe.h, BTW, it's declared in port.h. If VC2005 doesn't complain about
> the double declaration then that's a bug in the compiler, IMNSHO. Doesn't it
> even issue a warning? I no longer use VC2005 for anything, BTW, I use VC2008
> or later.

I agree, the compiler should have found it, but no, I don't even get a
warning. I just verified it and when I add a #error right after the
prototypes in port.h then it hits the #error on every file, so it
definitely sees both prototypes and doesn't complain... cl.exe is
running with /W3.


> Anyway, ISTM the best thing is just for us to get rid of pgpipe without
> further ado. I'll try to get a patch together for that.

Thanks.


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Wed, Mar 28, 2012 at 1:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm wondering if we really need this much complexity around shutting
> down workers.  I'm not sure I understand why we need both a "hard" and
> a "soft" method of shutting them down.  At least on non-Windows
> systems, it seems like it would be entirely sufficient to just send a
> SIGTERM when you want them to die.  They don't even need to catch it;
> they can just die.

At least on my Linux test system, even if all pg_dump processes are
gone, the server happily continues sending data. When I strace an
individual backend process, I see a lot of Broken pipe writes, but
that doesn't stop it from just writing out the whole table to a closed
file descriptor. This is a 9.0-latest server.

--- SIGPIPE (Broken pipe) @ 0 (0) ---
read(13, "\220\370\0\0\240\240r\266\3\0\4\0\264\1\320\1\0 \4
\0\0\0\0\270\237\220\0h\237\230\0"..., 8192) = 8192
read(13, "\220\370\0\0\350\300r\266\3\0\4\0\264\1\320\1\0 \4
\0\0\0\0\260\237\230\0h\237\220\0"..., 8192) = 8192
sendto(7, "d\0\0\0Acpp\t15.00000\t1245240000\taut"..., 8192, 0, NULL,
0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
read(13, "\220\370\0\0000\341r\266\3\0\5\0\260\1\340\1\0 \4
\0\0\0\0\270\237\220\0p\237\220\0"..., 8192) = 8192
sendto(7, "d\0\0\0Dcpp\t15.00000\t1245672000\taut"..., 8192, 0, NULL,
0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
read(13,  <unfinished ...>

I guess that https://commitfest.postgresql.org/action/patch_view?id=663
would take care of that in the future, but without it (i.e anything <=
9.2) it's quite annoying if you want to Ctrl-C a pg_dump and then have
to manually hunt down and kill all the backend processes.

I tested the above with immediately returning from
DisconnectDatabase() in pg_backup_db.c on Linux. The important thing
is that it calls PQcancel() on it's connection before terminating.

On Windows, several pages indicate that you can only cleanly terminate
a thread from within the thread, e.g. the last paragraph on

http://msdn.microsoft.com/en-us/library/windows/desktop/ms686724%28v=vs.85%29.aspx

The patch is basically doing what this page is recommending.

I'll try your proposal about terminating in the child when the
connection is closed, that sounds reasonable and I don't see an
immediate problem with that.


> The existing coding of on_exit_nicely is intention, and copied from
> similar logic for on_shmem_exit in the backend. Is there really a
> compelling reason to reverse the firing order of exit hooks?

No, reversing the order was not intended. I rewrote it to a for loop
because the current implementation modifies a global variable and so
on Windows only one thread would call the exit hook.

I'll add all your other suggestions to the next version of my patch. Thanks!


Joachim


Re: patch for parallel pg_dump

From
Andrew Dunstan
Date:

On 03/28/2012 09:12 PM, Joachim Wieland wrote:
> On Thu, Mar 29, 2012 at 2:46 AM, Andrew Dunstan<andrew@dunslane.net>  wrote:
>> But your patch hasn't got rid of them, and so it's declared twice. There is
>> no pgpipe.h, BTW, it's declared in port.h. If VC2005 doesn't complain about
>> the double declaration then that's a bug in the compiler, IMNSHO. Doesn't it
>> even issue a warning? I no longer use VC2005 for anything, BTW, I use VC2008
>> or later.
> I agree, the compiler should have found it, but no, I don't even get a
> warning. I just verified it and when I add a #error right after the
> prototypes in port.h then it hits the #error on every file, so it
> definitely sees both prototypes and doesn't complain... cl.exe is
> running with /W3.
>
> src/bin/pg_dump/pg_backup_archiver.c
>> Anyway, ISTM the best thing is just for us to get rid of pgpipe without
>> further ado. I'll try to get a patch together for that.
> Thanks.


OK, I have committed this. Following that, you need to add a couple of 
macros for pipewrite to parallel.c. There's also a missing cast in the 
call to beginthreadex().

I'll continue testing tomorrow.

cheers

andrew



Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Wed, Mar 28, 2012 at 9:54 PM, Joachim Wieland <joe@mcknight.de> wrote:
> On Wed, Mar 28, 2012 at 1:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm wondering if we really need this much complexity around shutting
>> down workers.  I'm not sure I understand why we need both a "hard" and
>> a "soft" method of shutting them down.  At least on non-Windows
>> systems, it seems like it would be entirely sufficient to just send a
>> SIGTERM when you want them to die.  They don't even need to catch it;
>> they can just die.
>
> At least on my Linux test system, even if all pg_dump processes are
> gone, the server happily continues sending data. When I strace an
> individual backend process, I see a lot of Broken pipe writes, but
> that doesn't stop it from just writing out the whole table to a closed
> file descriptor. This is a 9.0-latest server.

Wow, yuck.  At least now I understand why you're doing it like that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Andrew Dunstan
Date:

On 03/28/2012 11:43 PM, Andrew Dunstan wrote:
>
>
> On 03/28/2012 09:12 PM, Joachim Wieland wrote:
>> On Thu, Mar 29, 2012 at 2:46 AM, Andrew Dunstan<andrew@dunslane.net>
>> wrote:
>>> But your patch hasn't got rid of them, and so it's declared twice.
>>> There is
>>> no pgpipe.h, BTW, it's declared in port.h. If VC2005 doesn't
>>> complain about
>>> the double declaration then that's a bug in the compiler, IMNSHO.
>>> Doesn't it
>>> even issue a warning? I no longer use VC2005 for anything, BTW, I
>>> use VC2008
>>> or later.
>> I agree, the compiler should have found it, but no, I don't even get a
>> warning. I just verified it and when I add a #error right after the
>> prototypes in port.h then it hits the #error on every file, so it
>> definitely sees both prototypes and doesn't complain... cl.exe is
>> running with /W3.
>>
>> src/bin/pg_dump/pg_backup_archiver.c
>>> Anyway, ISTM the best thing is just for us to get rid of pgpipe without
>>> further ado. I'll try to get a patch together for that.
>> Thanks.
>
>
> OK, I have committed this. Following that, you need to add a couple of
> macros for pipewrite to parallel.c. There's also a missing cast in the
> call to beginthreadex().
>
> I'll continue testing tomorrow.
>
>


Here is an updated patch that builds now that pgpipe has been removed.

cheers

andrew

Attachment

Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Wed, Mar 28, 2012 at 2:20 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> My main comment about the current patch is that it looks like it's
> touching pg_restore parallel code by moving some stuff into parallel.c.
> If that's really the case and its voluminous, maybe this patch would
> shrink a bit if we could do the code moving in a first patch.  That
> would be mostly mechanical.  Then the interesting stuff would apply on
> top of that.  That would make review easier.

Unfortunately this is not really the case. What is being moved out of
pg_backup_archiver.c and into parallel.c is either the shutdown logic
that has been applied only a few days ago or is necessary to change
the parallel restore logic from one-thread-per-dump-object to the
message passing framework where a worker starts in the beginning and
then receives a new dump object from the master every time it's idle.

Instead now I split up the patch into two parts. The first part does
not depend on the parallel functionality and can be applied already
now. The second part then adds the parallelism on top.

Here's the complete list of changes of the first patch:

- split up restore_toc_entries_parallel into
restore_toc_entries_prefork / restore_toc_entries_parallel and
restore_toc_entries_postfork
- remove static char from prependDirectory
- remove static PQExpBuffer from fmtCopyColumnList
- get the relPages numbers, add a function that sorts by tablesize
(not called right now)
- remove static char* from selectSourceSchema
- function getThreadLocalPQExpBuffer returning a PQExpBuffer that
lives in thread-local memory
- make fmtId use the thread-local PQExpBuffer
- add function fmtQualifiedId which is schema + fmtId()
- add function pointer on_exit_msg_func, that will be called to handle
the last words of a process (currently it only prints the message)
- change exit_nicely to not modify the global variable on_exit_nicely_index
- move ropt->number_of_jobs from RestoreOptions to AHX->numWorkers so
that it can be used for backup as well
- make getTocEntryByDumpId non-static, even though it's not necessary now
- make CloneArchive / DeCloneArchive non-static, even though it's not
necessary now
- if a transaction is active in DisconnectDatabase, cancel this
transaction with PQcancel

Attachment

Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Sun, Apr 1, 2012 at 12:35 PM, Joachim Wieland <joe@mcknight.de> wrote:
> On Wed, Mar 28, 2012 at 2:20 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> My main comment about the current patch is that it looks like it's
>> touching pg_restore parallel code by moving some stuff into parallel.c.
>> If that's really the case and its voluminous, maybe this patch would
>> shrink a bit if we could do the code moving in a first patch.  That
>> would be mostly mechanical.  Then the interesting stuff would apply on
>> top of that.  That would make review easier.
>
> Unfortunately this is not really the case. What is being moved out of
> pg_backup_archiver.c and into parallel.c is either the shutdown logic
> that has been applied only a few days ago or is necessary to change
> the parallel restore logic from one-thread-per-dump-object to the
> message passing framework where a worker starts in the beginning and
> then receives a new dump object from the master every time it's idle.

Hmm.  It looks to me like the part-two patch still contains a bunch of
code rearrangement.  For example, the current code for
pg_backup_archiver.c patch contains this:

typedef struct ParallelState
{       int                     numWorkers;       ParallelStateEntry *pse;
} ParallelState;

In the revised patch, that's removed, and parallel.c instead contains this:

typedef struct _parallel_state
{     int                     numWorkers;     ParallelSlot *parallelSlot;
} ParallelState;

Perhaps I am missing something, but it looks to me like that's the
same code, except that for some reason the identifiers have been
renamed.  I see little point in renaming the struct from ParallelState
to _parallel_state (in fact, I like the new name less) or changing
ParallelStateEntry to ParallelSlot (which is no worse, but it's not
better either).

On a similar note, what's the point of changing struct Archive to have
int numWorkers instead of int number_of_jobs, and furthermore
shuffling the declaration around to a different part of the struct?
If that's really an improvement, it should be a separate patch, but my
guess is that it is just rearranging deck chairs.  It gives rise to
subsidiary diff hunks like this:
       /*        * If we're going to do parallel restore, there are some restrictions.        */
!       parallel_mode = (ropt->number_of_jobs > 1 && ropt->useDB);       if (parallel_mode)       {               /* We
haven'tgot round to making this work for all 
archive formats */
--- 271,277 ----       /*        * If we're going to do parallel restore, there are some restrictions.        */
!       parallel_mode = (AH->public.numWorkers > 1 && ropt->useDB);       if (parallel_mode)       {               /*
Wehaven't got round to making this work for all 
archive formats */

On another note, I am not sure that I like the messaging protocol
you've designed.  It seems to me that this has little chance of being
reliable:

+ void (*on_exit_msg_func)(const char *modulename, const char *fmt,
va_list ap) = vwrite_msg;

I believe the idea here is that you're going to capture the dying gasp
of the worker thread and send it to the master instead of printing it
out.  But that doesn't seem very reliable.  There's code all over the
place (and, in particular, in pg_dump.c) that assumes that we may
write messages at any time, and in particular that we may emit
multiple messages just before croaking.  Even if you were to hook
vwrite_msg, I'm not sure that would do it, because there are certainly
situations in which libpq can print out errors directly, for example,
or a system library might cough something up.  I'm thinking that the
boss process should really probably be capturing stdout and stderr and
reading messages from there, and interpreting any messages that it
sees as non-fatal (but still reporting them back to the user) unless
the worker exits unexpectedly (which the master can detect by noticing
that the connection has been closed).

Also, I like the idea of making it possible to use assertions in
front-end code.  But I think that if we're going to do that, we should
make it work in general, not just for things that include
pg_backup_archiver.h.  It seems to me that the way to do this would be
to move the definitions of Assert(), AssertMacro(), AssertArg(),
AssertState(), Trap(), TrapMacro(), and ExceptionalCondition() out of
postgres.h and into a new header file, say, pg_assert.h.  postgres.h
can include that automatically, and people writing front-end code can
include it if they're so inclined.  The only difficulty is where and
how to define ExceptionalCondition().  The obvious place seems to be
dumputils.c, which seems to be our unofficial place to stick stuff
that may be needed by a variety of frontend code.  Really, I think we
ought to think about creating a library that is more explicitly for
that purpose (libpgfrontend?) or else figuring out how to incorporate
it into libpq, but that's a project for another day.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Tue, Apr 3, 2012 at 9:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Apr 1, 2012 at 12:35 PM, Joachim Wieland <joe@mcknight.de> wrote:
>> Unfortunately this is not really the case. What is being moved out of
>> pg_backup_archiver.c and into parallel.c is either the shutdown logic
>> that has been applied only a few days ago or is necessary to change
>> the parallel restore logic from one-thread-per-dump-object to the
>> message passing framework where a worker starts in the beginning and
>> then receives a new dump object from the master every time it's idle.
>
> Hmm.  It looks to me like the part-two patch still contains a bunch of
> code rearrangement.  For example, the current code for
> pg_backup_archiver.c patch contains this:
>
> typedef struct ParallelState
> {
>        int                     numWorkers;
>        ParallelStateEntry *pse;
> } ParallelState;
>
> In the revised patch, that's removed, and parallel.c instead contains this:
>
> typedef struct _parallel_state
> {
>      int                     numWorkers;
>      ParallelSlot *parallelSlot;
> } ParallelState;

Yes, this is what I referred to as the part of the shutdown logic that
we only applied a few days ago. I basically backported what I had in
parallel.c into pg_backup_archiver.c which is where all the parallel
logic lives at the moment. Moving it out of pg_backup_archiver.c and
into a new parallel.c file means that I'd either have to move the
declaration to a header or create accessor functions and declare these
in a header. I actually tried it and both solutions created more lines
than they would save later on, especially with the lines that will
remove this temporary arrangement again...

The current parallel restore engine already has a "ParallelSlot"
structure but uses it in a slightly different way. That's why I
created the one in the shutdown logic as "ParallelStateEntry" for now.
This will be gone with the final patch and at the end there will only
be a "ParallelSlot" left that will serve both purposes. That's why you
see this renaming (and the removal of the current ParallelSlot
structure).

"struct _parallel_state" won't be used anywhere, except for forward
declarations in headers. I just used it because that seemed to be the
naming scheme, other structures are called similarly, e.g.:

$ grep "struct _" pg_backup_archiver.c
typedef struct _restore_args
typedef struct _parallel_slot
typedef struct _outputContext

I'm fine with any name, just let me know what you prefer.


> On a similar note, what's the point of changing struct Archive to have
> int numWorkers instead of int number_of_jobs, and furthermore
> shuffling the declaration around to a different part of the struct?

number_of_jobs was in the struct RestoreOptions before, a structure
that is not used when doing a dump. I moved it to the Archive as it is
used by both dump and restore and since all other code talks about
"workers" I changed the name to "numWorkers".


> On another note, I am not sure that I like the messaging protocol
> you've designed.  It seems to me that this has little chance of being
> reliable:
>
> + void (*on_exit_msg_func)(const char *modulename, const char *fmt,
> va_list ap) = vwrite_msg;
>
> I believe the idea here is that you're going to capture the dying gasp
> of the worker thread and send it to the master instead of printing it
> out.  But that doesn't seem very reliable.  There's code all over the
> place (and, in particular, in pg_dump.c) that assumes that we may
> write messages at any time, and in particular that we may emit
> multiple messages just before croaking.

I guess you're talking about the code in pg_dump that reads in the
database schema and the details of all the different objects in the
schema. This code is run before forking off workers and is always
executed in the master.

pg_dump only forks when all the catalog data has been read so only
actual TABLE DATA and BLOBs are dumped from the workers. I claim that
in at least 90% the functions involved here use exit_horribly() and
output a clear message about why they're dying. If they don't but just
die, the master will say "worker died unexpectedly". As you said a few
mails before, any function exiting at this stage should rather call
exit_horribly() to properly clean up after itself.

The advantage of using the message passing system for the last error
message is that you get exactly one message and it's very likely that
it accurately describes what happened to a worker to make it stop.


> Also, I like the idea of making it possible to use assertions in
> front-end code.  But I think that if we're going to do that, we should
> make it work in general, not just for things that include
> pg_backup_archiver.h.

I completely agree. Assertions helped a lot dealing with concurrent
code. How do you want to tackle this for now? Want me to create a
separate header pg_assert.h as part of my patch? Or is it okay to
factor it out later and include it from the general header then?


Re: patch for parallel pg_dump

From
Alvaro Herrera
Date:
Excerpts from Joachim Wieland's message of mar abr 03 11:40:31 -0300 2012:
>
> On Tue, Apr 3, 2012 at 9:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> > Hmm.  It looks to me like the part-two patch still contains a bunch of
> > code rearrangement.  For example, the current code for
> > pg_backup_archiver.c patch contains this:
> >
> > typedef struct ParallelState
> > {
> >        int                     numWorkers;
> >        ParallelStateEntry *pse;
> > } ParallelState;
> >
> > In the revised patch, that's removed, and parallel.c instead contains this:
> >
> > typedef struct _parallel_state
> > {
> >      int                     numWorkers;
> >      ParallelSlot *parallelSlot;
> > } ParallelState;

> "struct _parallel_state" won't be used anywhere, except for forward
> declarations in headers. I just used it because that seemed to be the
> naming scheme, other structures are called similarly, e.g.:
>
> $ grep "struct _" pg_backup_archiver.c
> typedef struct _restore_args
> typedef struct _parallel_slot
> typedef struct _outputContext
>
> I'm fine with any name, just let me know what you prefer.

I think I can explain this one.  In the refactoring patch that Joachim
submitted and I committed, the struct was called _parallel_state, using
the same naming convention with the useless _ suffix and all lowercase
that already plagued the pg_dump code.  This name is pointlessly
different from the typedef -- maybe the original pg_dump author thought
the names would collide and chose them different.  But this is not true,
looks ugly to me, and furthermore it is inconsistent with what we do in
the rest of the PG code which is to use struct names identical to the
typedef names.  So I changed it -- without realizing that the subsequent
patch would move the declarations elsewhere, losing my renaming.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Tue, Apr 3, 2012 at 10:40 AM, Joachim Wieland <joe@mcknight.de> wrote:
>> On a similar note, what's the point of changing struct Archive to have
>> int numWorkers instead of int number_of_jobs, and furthermore
>> shuffling the declaration around to a different part of the struct?
>
> number_of_jobs was in the struct RestoreOptions before, a structure
> that is not used when doing a dump. I moved it to the Archive as it is
> used by both dump and restore and since all other code talks about
> "workers" I changed the name to "numWorkers".

Gah.  Somehow I feel that splitting up this patch into two pieces
hasn't made anything any better.

>> On another note, I am not sure that I like the messaging protocol
>> you've designed.  It seems to me that this has little chance of being
>> reliable:
>>
>> + void (*on_exit_msg_func)(const char *modulename, const char *fmt,
>> va_list ap) = vwrite_msg;
>>
>> I believe the idea here is that you're going to capture the dying gasp
>> of the worker thread and send it to the master instead of printing it
>> out.  But that doesn't seem very reliable.  There's code all over the
>> place (and, in particular, in pg_dump.c) that assumes that we may
>> write messages at any time, and in particular that we may emit
>> multiple messages just before croaking.
>
> I guess you're talking about the code in pg_dump that reads in the
> database schema and the details of all the different objects in the
> schema. This code is run before forking off workers and is always
> executed in the master.

OK, but it seems like a pretty fragile assumption that none of the
workers will ever manage to emit any other error messages.  We don't
rely on this kind of assumption in the backend, which is a lot
better-structured and less spaghetti-like than the pg_dump code.

>> Also, I like the idea of making it possible to use assertions in
>> front-end code.  But I think that if we're going to do that, we should
>> make it work in general, not just for things that include
>> pg_backup_archiver.h.
>
> I completely agree. Assertions helped a lot dealing with concurrent
> code. How do you want to tackle this for now? Want me to create a
> separate header pg_assert.h as part of my patch? Or is it okay to
> factor it out later and include it from the general header then?

I'll just go do it, barring objections.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 3, 2012 at 10:40 AM, Joachim Wieland <joe@mcknight.de> wrote:
>> I completely agree. Assertions helped a lot dealing with concurrent
>> code. How do you want to tackle this for now? Want me to create a
>> separate header pg_assert.h as part of my patch? Or is it okay to
>> factor it out later and include it from the general header then?

> I'll just go do it, barring objections.

If the necessary support code isn't going to be available *everywhere*,
it should not be in postgres.h.  So I did not care for your proposal to
put it in dumputils.

Possibly we could move assert.c into src/port/ and make it part of
libpgport?
        regards, tom lane


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Tue, Apr 3, 2012 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Apr 3, 2012 at 10:40 AM, Joachim Wieland <joe@mcknight.de> wrote:
>>> I completely agree. Assertions helped a lot dealing with concurrent
>>> code. How do you want to tackle this for now? Want me to create a
>>> separate header pg_assert.h as part of my patch? Or is it okay to
>>> factor it out later and include it from the general header then?
>
>> I'll just go do it, barring objections.
>
> If the necessary support code isn't going to be available *everywhere*,
> it should not be in postgres.h.  So I did not care for your proposal to
> put it in dumputils.

Err... I didn't suggest putting it in postgres.h.  I suggested taking
it OUT of postgres.h and putting it in a separate header file.

> Possibly we could move assert.c into src/port/ and make it part of
> libpgport?

The trouble is that it calls write_stderr(), which has a non-trivial
implementation on Windows that I don't believe will be suitable for
front-end code.  If we can find a reasonable way to work around that
issue then I think that would work.  We might also want to rename
ExceptionalCondition() to pg_exceptional_condition or something like
that if we're going to include it in libpgport.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of mar abr 03 12:38:20 -0300 2012:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Apr 3, 2012 at 10:40 AM, Joachim Wieland <joe@mcknight.de> wrote:
> >> I completely agree. Assertions helped a lot dealing with concurrent
> >> code. How do you want to tackle this for now? Want me to create a
> >> separate header pg_assert.h as part of my patch? Or is it okay to
> >> factor it out later and include it from the general header then?
>
> > I'll just go do it, barring objections.
>
> If the necessary support code isn't going to be available *everywhere*,
> it should not be in postgres.h.  So I did not care for your proposal to
> put it in dumputils.
>
> Possibly we could move assert.c into src/port/ and make it part of
> libpgport?

That only leaves assert_enabled to be handled.  In the backend it lives
in guc.c; what to do about frontend code?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: patch for parallel pg_dump

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 3, 2012 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Possibly we could move assert.c into src/port/ and make it part of
>> libpgport?

> The trouble is that it calls write_stderr(), which has a non-trivial
> implementation on Windows that I don't believe will be suitable for
> front-end code.  If we can find a reasonable way to work around that
> issue then I think that would work.

Well, if we don't have a solution to that problem then it's premature
to propose making Assert available to frontend code.  So my opinion
is that that idea is too half-baked to be pushing into 9.2 at this
time.  Let's put it on the to-do list instead.
        regards, tom lane


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Tue, Apr 3, 2012 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Apr 3, 2012 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Possibly we could move assert.c into src/port/ and make it part of
>>> libpgport?
>
>> The trouble is that it calls write_stderr(), which has a non-trivial
>> implementation on Windows that I don't believe will be suitable for
>> front-end code.  If we can find a reasonable way to work around that
>> issue then I think that would work.
>
> Well, if we don't have a solution to that problem then it's premature
> to propose making Assert available to frontend code.  So my opinion
> is that that idea is too half-baked to be pushing into 9.2 at this
> time.  Let's put it on the to-do list instead.

It's more baked than Joachim's existing solution, and I don't favor
punting his whole patch because we don't want to give five minutes of
thought to this problem.  The patch may need to be punted for other
reasons, of course.

Maybe we could just stick #ifdef BACKEND in the libpgport code.  If
we're in the backend, we write_stderr().  Otherwise we just
fprintf(stderr, ...).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> That only leaves assert_enabled to be handled.  In the backend it lives
> in guc.c; what to do about frontend code?

There's no mechanism for turning such a switch on or off in most
frontend code anyway.  I'd think it could just be assumed to be on
in the frontend implementation --- ie, frontend Asserts would always
be active in --enable-cassert builds.  There's not really any need
to do better until/unless we start having Asserts that impact
performance on the frontend side.
        regards, tom lane


Re: patch for parallel pg_dump

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 3, 2012 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, if we don't have a solution to that problem then it's premature
>> to propose making Assert available to frontend code. �So my opinion
>> is that that idea is too half-baked to be pushing into 9.2 at this
>> time. �Let's put it on the to-do list instead.

> It's more baked than Joachim's existing solution, and I don't favor
> punting his whole patch because we don't want to give five minutes of
> thought to this problem.  The patch may need to be punted for other
> reasons, of course.

Ripping out the Asserts surely can't take long.

> Maybe we could just stick #ifdef BACKEND in the libpgport code.  If
> we're in the backend, we write_stderr().  Otherwise we just
> fprintf(stderr, ...).

No, the reason for write_stderr() is that fprintf(stderr) is unreliable
on Windows.  If memory serves, it can actually crash in some situations.

More generally, it's not very clear what to do with error reports in an
OS environment where stderr isn't a universally supported concept.
So (IMO anyway) you can't just ignore the problem.  And it's not one
that's going to be solved in five minutes, either.
        regards, tom lane


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Tue, Apr 3, 2012 at 12:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Apr 3, 2012 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Well, if we don't have a solution to that problem then it's premature
>>> to propose making Assert available to frontend code.  So my opinion
>>> is that that idea is too half-baked to be pushing into 9.2 at this
>>> time.  Let's put it on the to-do list instead.
>
>> It's more baked than Joachim's existing solution, and I don't favor
>> punting his whole patch because we don't want to give five minutes of
>> thought to this problem.  The patch may need to be punted for other
>> reasons, of course.
>
> Ripping out the Asserts surely can't take long.

Yeah, but asserts exist for a reason: to make it possible to find bugs
more easily.  Let's not be penny-wise and pound-foolish.

>> Maybe we could just stick #ifdef BACKEND in the libpgport code.  If
>> we're in the backend, we write_stderr().  Otherwise we just
>> fprintf(stderr, ...).
>
> No, the reason for write_stderr() is that fprintf(stderr) is unreliable
> on Windows.  If memory serves, it can actually crash in some situations.

Dude, we're already doing fprintf(stderr) all over pg_dump.  If it's
unreliable even in front-end code, we're screwed anyway.  That is a
non-objection.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 3, 2012 at 12:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> No, the reason for write_stderr() is that fprintf(stderr) is unreliable
>> on Windows. �If memory serves, it can actually crash in some situations.

> Dude, we're already doing fprintf(stderr) all over pg_dump.  If it's
> unreliable even in front-end code, we're screwed anyway.  That is a
> non-objection.

No, it isn't.  The fact that it works in pg_dump doesn't extrapolate
to other places.  (In particular, it will absolutely not work in libpq,
at least not in all the environments where libpq is supposed to work.)

I think what we've got at the moment is something that's adequate for
pg_dump, and that's all that it is.  Concluding that it can be used in
all frontend code is way premature, and therefore I'm -1 on the idea
of exposing it in non-pg_dump header files.
        regards, tom lane


Re: patch for parallel pg_dump

From
Robert Haas
Date:
On Tue, Apr 3, 2012 at 12:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Apr 3, 2012 at 12:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> No, the reason for write_stderr() is that fprintf(stderr) is unreliable
>>> on Windows.  If memory serves, it can actually crash in some situations.
>
>> Dude, we're already doing fprintf(stderr) all over pg_dump.  If it's
>> unreliable even in front-end code, we're screwed anyway.  That is a
>> non-objection.
>
> No, it isn't.  The fact that it works in pg_dump doesn't extrapolate
> to other places.  (In particular, it will absolutely not work in libpq,
> at least not in all the environments where libpq is supposed to work.)

Well, I didn't propose putting the assert machinery into libpq, but
FWIW fprintf(stderr, ...) it's already used there - see
defaultNoticeProcessor() for example.  It's also used in libpgport,
which is where we're proposing to put this code (see pgsymlink, for
example).  Furthermore, the code would only run in the event that
assertions are enabled (which only developers normally do) and it
would only break if run as a service (which would be unusual when
debugging) and only in the case where the process was in the middle of
crashing anyway (in which case the worst case is that it crashes just
before printing the error message rather than just after).

However, if despite all that you're still worried about it, it appears
special handling is only needed to cover the case where code might be
running as a service, and apparently none of our utilities worry about
that right now except for pg_ctl, which handles it by defining a
frontend-safe version of write_stderr().  So perhaps we ought to merge
the frontend version in pg_ctl with the backend version and put the
result in libpgport (possibly renamed) and then ExceptionalCondition()
can go into pgport and simply call write_stderr() and it will get the
appropriate version depending on whether it's running in frontend or
backend code.  That's a tad more work than I was expecting to do here,
but I'm still not convinced that it's more than can be done in an
afternoon.

> I think what we've got at the moment is something that's adequate for
> pg_dump, and that's all that it is.  Concluding that it can be used in
> all frontend code is way premature, and therefore I'm -1 on the idea
> of exposing it in non-pg_dump header files.

It seems desirable to share as much of the implementation with the
backend as possible, and moving the macros from postgres.h into their
own header file (which postgres.h would include) would allow that,
leaving it as the application's problem to provide
ExceptionalCondition(), which pg_dump can surely do.  If we don't do
at least that much, then we're going to end up either (1) removing the
assertions, which doesn't seem prudent, or (2) doing a cut-and-paste
from postgres.h into a pg_dump header file, which seems to have no
virtues whatsoever, or (3) having an implementation in pg_dump that is
completely separate from and different from what we have in the
backend, which seems even worse than the preceding option.  The
minimum we need here to avoid creating a mess here is no more than
moving 25 lines of code from postgres.h to a separate header file that
pg_dump can include.  I don't see why that's a big deal.  I also don't
see why we can't fix the problem more completely, perhaps along the
lines suggested above.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch for parallel pg_dump

From
Joachim Wieland
Date:
On Tue, Apr 3, 2012 at 11:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> OK, but it seems like a pretty fragile assumption that none of the
> workers will ever manage to emit any other error messages.  We don't
> rely on this kind of assumption in the backend, which is a lot
> better-structured and less spaghetti-like than the pg_dump code.

Yeah, but even if they don't use exit_horribly(), the user would still
see the output, stdout/stderr aren't closed and everyone can still
write to it.

As a test, I printed out some messages upon seeing a specific dump id
in a worker:
       if (strcmp(command, "DUMP 3540") == 0)       {           write_msg(NULL, "Info 1\n");           printf("Info
2\n");          exit_horribly(modulename, "that's why\n");       } 


$ ./pg_dump -j 7 ...
pg_dump: Info 1
Info 2
pg_dump: [parallel archiver] that's why

       if (strcmp(command, "DUMP 3540") == 0)       {           write_msg(NULL, "Info 1\n");           printf("Info
2\n");          fprintf(stderr, "exiting on my own\n");           exit(1);       } 


$ ./pg_dump -j 7 ...
pg_dump: Info 1
Info 2
exiting on my own
pg_dump: [parallel archiver] A worker process died unexpectedly