Thread: pg_dump and pgpool

pg_dump and pgpool

From
Scott Marlowe
Date:
I've noticed today that if one tries to pg_dump a database cluster
running under pgpool, one gets the error message:

pg_dump: query to get table columns failed: ERROR:  kind mismatch
between backends
HINT:  check data consistency between master and secondary

Looking at the SQL that pg_dump sends to be relying on object OIDs for
the dump.  Would it be reasonable at some date to design pg_dump to work
by joining whatever query would get the OID with the query that would
use it as where criteria so that the OID itself is never reported?
Would there be any interest in such changes, were they made, making
their way into the backend?

Does my question even make sense?  It's been a long couple of weeks...

Re: pg_dump and pgpool

From
Tom Lane
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:
> I've noticed today that if one tries to pg_dump a database cluster
> running under pgpool, one gets the error message:

> pg_dump: query to get table columns failed: ERROR:  kind mismatch
> between backends
> HINT:  check data consistency between master and secondary

I would like to know exactly what pgpool has done to break pg_dump.

> Looking at the SQL that pg_dump sends to be relying on object OIDs for
> the dump.

There is nothing wrong with that technique ...

            regards, tom lane

Re: pg_dump and pgpool

From
Scott Marlowe
Date:
On Wed, 2004-12-29 at 16:11, Tom Lane wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > I've noticed today that if one tries to pg_dump a database cluster
> > running under pgpool, one gets the error message:
>
> > pg_dump: query to get table columns failed: ERROR:  kind mismatch
> > between backends
> > HINT:  check data consistency between master and secondary
>
> I would like to know exactly what pgpool has done to break pg_dump.

What's happening is that there are two databases behind pgpool, and each
has managed to assign a different (set of) OID(s) to the table(s).  So,
when pg_dump asks for an OID, it gets two different ones.

> > Looking at the SQL that pg_dump sends to be relying on object OIDs for
> > the dump.
>
> There is nothing wrong with that technique ...

What I'd want is for it to use a join IN the database so that at no time
are OIDs floating across the ether or seen by pg_dump.

Re: pg_dump and pgpool

From
Tom Lane
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:
> On Wed, 2004-12-29 at 16:11, Tom Lane wrote:
>> I would like to know exactly what pgpool has done to break pg_dump.

> What's happening is that there are two databases behind pgpool, and each
> has managed to assign a different (set of) OID(s) to the table(s).  So,
> when pg_dump asks for an OID, it gets two different ones.

Mph.  I'd have zero confidence in a dump taken under such circumstances,
anyway.  If pgpool can't duplicate OIDs (which I agree it probably
can't) then it's unlikely to be able to make the databases match closely
enough to satisfy pg_dump in other ways.  I'd worry about
synchronization issues to start with...

I don't think we should make pg_dump slower and possibly less reliable
in order to support a fundamentally dangerous administration procedure.
Run pg_dump directly into the database, not through pgpool.

            regards, tom lane

Re: pg_dump and pgpool

From
Scott Marlowe
Date:
On Wed, 2004-12-29 at 16:33, Tom Lane wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > On Wed, 2004-12-29 at 16:11, Tom Lane wrote:
> >> I would like to know exactly what pgpool has done to break pg_dump.
>
> > What's happening is that there are two databases behind pgpool, and each
> > has managed to assign a different (set of) OID(s) to the table(s).  So,
> > when pg_dump asks for an OID, it gets two different ones.
>
> Mph.  I'd have zero confidence in a dump taken under such circumstances,
> anyway.  If pgpool can't duplicate OIDs (which I agree it probably
> can't) then it's unlikely to be able to make the databases match closely
> enough to satisfy pg_dump in other ways.

Such as?

> I'd worry about
> synchronization issues to start with...

I am not worried about that.  As long as I'm not doing things like
inserting random() into the database, the data in the two backend stores
is coherent.

> I don't think we should make pg_dump slower and possibly less reliable
> in order to support a fundamentally dangerous administration procedure.
> Run pg_dump directly into the database, not through pgpool.

What makes you think this would be slower.  If anything, it would be
faster or as fast, since we're throwing fewer queries and at the same
time, hiding the implementation details that OIDs are.

Or is it technically impossible to dump data from PostgreSQL reliably
without relying on OIDs to get the right table at the right time?

Re: pg_dump and pgpool

From
Tom Lane
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:
> On Wed, 2004-12-29 at 16:33, Tom Lane wrote:
>> I'd worry about
>> synchronization issues to start with...

> I am not worried about that.  As long as I'm not doing things like
> inserting random() into the database, the data in the two backend stores
> is coherent.

For sufficiently small values of "coherent", sure, but I am not prepared
to buy into the notion that pg_dump cannot examine the database contents
more closely than the stupidest user application will ;-).

Also, let's play devil's advocate and assume that the master and slave
*have* managed to get out of sync somehow.  Do you want your backup to
be a true picture of the master's state, or an unpredictable
amalgamation of the master and slave states?  Heaven help you if you
need to use the backup to recover from the out-of-sync condition.

>> I don't think we should make pg_dump slower and possibly less reliable
>> in order to support a fundamentally dangerous administration procedure.
>> Run pg_dump directly into the database, not through pgpool.

> What makes you think this would be slower.  If anything, it would be
> faster or as fast, since we're throwing fewer queries and at the same
> time, hiding the implementation details that OIDs are.

No, we'd be throwing more, and more complex, queries.  Instead of a
simple lookup there would be some kind of join, or at least a lookup
that uses a multicolumn key.

There are also correctness issues to think about.  OID *is* the primary
key on most of the system catalogs pg_dump is looking at, and not all of
them have other unique keys.  Doing anything useful with pg_depend or
pg_description without explicitly looking at OIDs would be darn painful,
too.

            regards, tom lane

Re: pg_dump and pgpool

From
Scott Marlowe
Date:
On Wed, 2004-12-29 at 16:56, Tom Lane wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > On Wed, 2004-12-29 at 16:33, Tom Lane wrote:
> >> I'd worry about
> >> synchronization issues to start with...
>
> > I am not worried about that.  As long as I'm not doing things like
> > inserting random() into the database, the data in the two backend stores
> > is coherent.
>
> For sufficiently small values of "coherent", sure, but I am not prepared
> to buy into the notion that pg_dump cannot examine the database contents
> more closely than the stupidest user application will ;-).

Sounds a bit like verbal handwaving here.

> Also, let's play devil's advocate and assume that the master and slave
> *have* managed to get out of sync somehow.  Do you want your backup to
> be a true picture of the master's state, or an unpredictable
> amalgamation of the master and slave states?  Heaven help you if you
> need to use the backup to recover from the out-of-sync condition.

Actually, given the operational mode pgpool is in it would error out
here, since it is basically issuing all queries, select or otherwise, to
both backends, and comparing what comes back.  If it's not the same, the
connection is dropped and the transaction rolled back.

Hence the current problem where the dump fails.  I.e. we're getting two
different OIDs and pgpool is barfing on that.

>
> >> I don't think we should make pg_dump slower and possibly less reliable
> >> in order to support a fundamentally dangerous administration procedure.
> >> Run pg_dump directly into the database, not through pgpool.
>
> > What makes you think this would be slower.  If anything, it would be
> > faster or as fast, since we're throwing fewer queries and at the same
> > time, hiding the implementation details that OIDs are.
>
> No, we'd be throwing more, and more complex, queries.  Instead of a
> simple lookup there would be some kind of join, or at least a lookup
> that uses a multicolumn key.

I'm willing to bet the performance difference is less than noise.

> There are also correctness issues to think about.  OID *is* the primary
> key on most of the system catalogs pg_dump is looking at, and not all of
> them have other unique keys.  Doing anything useful with pg_depend or
> pg_description without explicitly looking at OIDs would be darn painful,
> too.

Don't we always preach to users to NEVER use OIDs as PKs in their apps?
:-)  Seriously though, I get your point.  What I want to know is if it's
possible to do this without passing OIDs back and forth.  Keep in mind,
I don't mind there being OIDs, I'd just like to have all the references
to them be hidden from the backup agent by joins et. al.

Re: pg_dump and pgpool

From
Tom Lane
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:
> On Wed, 2004-12-29 at 16:56, Tom Lane wrote:
>> No, we'd be throwing more, and more complex, queries.  Instead of a
>> simple lookup there would be some kind of join, or at least a lookup
>> that uses a multicolumn key.

> I'm willing to bet the performance difference is less than noise.

[ shrug... ]  I don't have a good handle on that, and neither do you.
What I am quite sure about though is that pg_dump would become internally
a great deal messier and harder to maintain if it couldn't use OIDs.
Look at the DumpableObject manipulations and ask yourself what you're
going to do instead if you have to use a primary key that is of a
different kind (different numbers of columns and datatypes) for each
system catalog.  Ugh.

I don't think it's worth that price to support a fundamentally bogus
approach to backup.  IMHO you don't want extra layers of software in
between pg_dump and the database --- each one just introduces another
risk of getting a wrong backup.  You've yet to explain what the
*benefit* of putting pgpool in there is for this problem.

            regards, tom lane

Re: pg_dump and pgpool

From
Greg Stark
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:

> What's happening is that there are two databases behind pgpool, and each
> has managed to assign a different (set of) OID(s) to the table(s).  So,
> when pg_dump asks for an OID, it gets two different ones.

If pgpool is so good at maintaining consistency between databases how did they
end up with different OIDs?

It seems you really do have inconsistent databases and are asking for pg_dump
to be robust against that. If the databases really are inconsistent isn't the
correct behaviour to cause precisely this error? Wouldn't hiding the
inconsistency only be doing you a disservice?

I think you should be trying to figure out why the databases are inconsistent
and working to figure out what you have to change to avoid whatever actions
caused that.

--
greg

Re: pg_dump and pgpool

From
Scott Marlowe
Date:
On Wed, 2004-12-29 at 23:12, Greg Stark wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
>
> > What's happening is that there are two databases behind pgpool, and each
> > has managed to assign a different (set of) OID(s) to the table(s).  So,
> > when pg_dump asks for an OID, it gets two different ones.
>
> If pgpool is so good at maintaining consistency between databases how did they
> end up with different OIDs?

That's rather disingenuous, considering that OIDs are more of an
internal artifact of the databases, while the USER data stored therein
is what I, or anyone else, would consider the word consistency applies
to.  The fact that both databases have different OIDs for the same
objects has nothing to do with the userland data being consistent or
not.

> It seems you really do have inconsistent databases and are asking for pg_dump
> to be robust against that.

No. I have a perfectly consistent database.  It happens to reside on a
cluster of two machines which have different internal ids assigned to
the same objects, which, when I throw bog standard SQL at them, I get
the same answer from both.  That is consistent.  The fact that
PostgreSQL has different OIDs underneath is an implementation quirk that
I, as a user, shouldn't really have to worry about or even notice.
IMHO.

> Wouldn't hiding the
> inconsistency only be doing you a disservice?

If they were inconsistent, then certainly it would.  But again, they're
NOT inconsistent.  You've built your argument on a false premise.

> I think you should be trying to figure out why the databases are inconsistent
> and working to figure out what you have to change to avoid whatever actions
> caused that.

I shouldn't have to care what the OIDs used internally are.

Users are consistently warned to never use OIDs as PKs, yet PostgreSQL
the database does just that.  My data is coherent.  I'll explain more in
my reply to Tom Lane...

Re: pg_dump and pgpool

From
Scott Marlowe
Date:
On Wed, 2004-12-29 at 17:30, Tom Lane wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > On Wed, 2004-12-29 at 16:56, Tom Lane wrote:
> >> No, we'd be throwing more, and more complex, queries.  Instead of a
> >> simple lookup there would be some kind of join, or at least a lookup
> >> that uses a multicolumn key.
>
> > I'm willing to bet the performance difference is less than noise.
>
> [ shrug... ]  I don't have a good handle on that, and neither do you.
> What I am quite sure about though is that pg_dump would become internally
> a great deal messier and harder to maintain if it couldn't use OIDs.
> Look at the DumpableObject manipulations and ask yourself what you're
> going to do instead if you have to use a primary key that is of a
> different kind (different numbers of columns and datatypes) for each
> system catalog.  Ugh.

Wait, do you mean it's impossible to throw a single SQL query with a
proper join clause that USES OIDs but doesn't return them?  Or that it's
impossible to throw a single query without joining on OIDs.  I don't
mind joining on OIDs, I just don't want them crossing the connection is
all.  And yes, it might be ugly, but I can't imagine it being
unmaintable for some reason.

> I don't think it's worth that price to support a fundamentally bogus
> approach to backup.

But it's not bogus.  IT allows me to compare two databases running under
a pgpool synchronous cluster and KNOW if there are inconsistencies in
data between them, so it is quite useful to me.

> IMHO you don't want extra layers of software in
> between pg_dump and the database --- each one just introduces another
> risk of getting a wrong backup.  You've yet to explain what the
> *benefit* of putting pgpool in there is for this problem.

Actually, it ensures that I get the right backup, because pgpool will
cause the backup to fail if there are any differences between the two
backend servers, thus telling me that I have an inconsistency.

That's the primary reason I want this.  The secondary reason, which I
can work around, is that I'm running the individual databases on
machines that only answer the specific IP of the pgpool machine's IP, so
remote backups aren't possible, and only the pgpool machine would be
capable of doing the backups, but we have (like so many other companies)
a centralized backup server.  I can always allow that machine to connect
to the database(s) to do backup, but my fear is that by allowing
anything other than pgpool to hit those backend databases they could be
placed out of sync with each other.  Admitted, a backup process
shouldn't be updating the database, so this, as I said, isn't really a
big deal.  More of a mild kink really.  As long as all access is
happening through pgpool, they should stay coherent to each other.

Re: pg_dump and pgpool

From
Tatsuo Ishii
Date:
> I've noticed today that if one tries to pg_dump a database cluster
> running under pgpool, one gets the error message:
>
> pg_dump: query to get table columns failed: ERROR:  kind mismatch
> between backends
> HINT:  check data consistency between master and secondary
>
> Looking at the SQL that pg_dump sends to be relying on object OIDs for
> the dump.  Would it be reasonable at some date to design pg_dump to work
> by joining whatever query would get the OID with the query that would
> use it as where criteria so that the OID itself is never reported?
> Would there be any interest in such changes, were they made, making
> their way into the backend?
>
> Does my question even make sense?  It's been a long couple of weeks...

OIDs may not be replicated exactly same by using pgpool. It has been
explained in pgpool web page.

"Please note that certain queries are physically dependent to a server
or non-deterministic.  These include random functions, OID, XID, and
timestamps may not be replicated in exactly same value among two
servers."

This could happen due to timing difference when multiple sessions run
through pgpool. Suppose session A issues "INSERT INTO foo VALUES(1, 100)"
while session B issues "INSERT INTO foo VALUES(2, 101)" simultaneously and
OID counter value is 100000 on master and secondary. Following
situation could happen:

master executes INSERT INTO foo VALUES(1, 100) and assigns OID 100001
master executes INSERT INTO foo VALUES(2, 101) and assigns OID 100002
secondary executes INSERT INTO foo VALUES(2, 101) and assigns OID 100001
secondary executes INSERT INTO foo VALUES(1, 100) and assigns OID 100002

As a result, on the master row(1, 100)'s OID is 100001 and row(2,
101)&s OID is 100002, while on the secondary row(1, 100)'s OID is
100002 and row(2, 101)&s OID is 100001.

(Note, however, data consitency within the master or the secondary has
not been broken. So you can get consistent dump by directly connecting
to master or secondary.)

One way to prevent the problem above is having a lock on the OID
counter. Of course there's no such that lock in PostgreSQL, we need a
substitution, for example, locking pg_database (I know this is a
horrible idea).

BTW I think there's no replication tool in the world which can
replicate OIDs. So applications including pg_dump, psql or whatever
could not issue system catalog related queries in load-balacing manner
to replicated servers to gain better performance.
--
Tatsuo Ishii

Re: pg_dump and pgpool

From
Tom Lane
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:
>> I don't think it's worth that price to support a fundamentally bogus
>> approach to backup.

> But it's not bogus.  IT allows me to compare two databases running under
> a pgpool synchronous cluster and KNOW if there are inconsistencies in
> data between them, so it is quite useful to me.

As a data comparison tool it is certainly bogus.  What about different
row ordering between the two databases, for instance?

AFAICS this could only work if you were doing physical rather than
logical replication (eg, shipping WAL logs) in which case the OIDs would
be just as much in sync as everything else.

Basically my point is that you are proposing to do a lot of work in
order to solve the first problem you are running up against, but that
will only get you to the next problem.  I'm not prepared to accept a
significant increase in complexity and loss of maintainability in
pg_dump in order to move one step closer to the dead end that you will
certainly hit.

            regards, tom lane

Re: pg_dump and pgpool

From
Tatsuo Ishii
Date:
> On Wed, 2004-12-29 at 17:30, Tom Lane wrote:
> > Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > > On Wed, 2004-12-29 at 16:56, Tom Lane wrote:
> > >> No, we'd be throwing more, and more complex, queries.  Instead of a
> > >> simple lookup there would be some kind of join, or at least a lookup
> > >> that uses a multicolumn key.
> >
> > > I'm willing to bet the performance difference is less than noise.
> >
> > [ shrug... ]  I don't have a good handle on that, and neither do you.
> > What I am quite sure about though is that pg_dump would become internally
> > a great deal messier and harder to maintain if it couldn't use OIDs.
> > Look at the DumpableObject manipulations and ask yourself what you're
> > going to do instead if you have to use a primary key that is of a
> > different kind (different numbers of columns and datatypes) for each
> > system catalog.  Ugh.
>
> Wait, do you mean it's impossible to throw a single SQL query with a
> proper join clause that USES OIDs but doesn't return them?  Or that it's
> impossible to throw a single query without joining on OIDs.  I don't
> mind joining on OIDs, I just don't want them crossing the connection is
> all.  And yes, it might be ugly, but I can't imagine it being
> unmaintable for some reason.
>
> > I don't think it's worth that price to support a fundamentally bogus
> > approach to backup.
>
> But it's not bogus.  IT allows me to compare two databases running under
> a pgpool synchronous cluster and KNOW if there are inconsistencies in
> data between them, so it is quite useful to me.
>
> > IMHO you don't want extra layers of software in
> > between pg_dump and the database --- each one just introduces another
> > risk of getting a wrong backup.  You've yet to explain what the
> > *benefit* of putting pgpool in there is for this problem.
>
> Actually, it ensures that I get the right backup, because pgpool will
> cause the backup to fail if there are any differences between the two
> backend servers, thus telling me that I have an inconsistency.
>
> That's the primary reason I want this.  The secondary reason, which I
> can work around, is that I'm running the individual databases on
> machines that only answer the specific IP of the pgpool machine's IP, so
> remote backups aren't possible, and only the pgpool machine would be
> capable of doing the backups, but we have (like so many other companies)
> a centralized backup server.  I can always allow that machine to connect
> to the database(s) to do backup, but my fear is that by allowing
> anything other than pgpool to hit those backend databases they could be
> placed out of sync with each other.  Admitted, a backup process
> shouldn't be updating the database, so this, as I said, isn't really a
> big deal.  More of a mild kink really.  As long as all access is
> happening through pgpool, they should stay coherent to each other.

Pgpool could be modified so that it has "no SELECT replication mode",
where pgpool runs SELECT on only master server. I could do this if you
think it's usefull.

However problem is pg_dump is not only running SELECT but also
modifying database (counting up OID counter), i.e. it creates
temporary tables. Is this a problem for you?
--
Tatsuo Ishii

Re: pg_dump and pgpool

From
Greg Stark
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:

> On Wed, 2004-12-29 at 23:12, Greg Stark wrote:
> > Scott Marlowe <smarlowe@g2switchworks.com> writes:
> >
> > > What's happening is that there are two databases behind pgpool, and each
> > > has managed to assign a different (set of) OID(s) to the table(s).  So,
> > > when pg_dump asks for an OID, it gets two different ones.
> >
> > If pgpool is so good at maintaining consistency between databases how did they
> > end up with different OIDs?
>
> That's rather disingenuous, considering that OIDs are more of an
> internal artifact of the databases, while the USER data stored therein
> is what I, or anyone else, would consider the word consistency applies
> to.  The fact that both databases have different OIDs for the same
> objects has nothing to do with the userland data being consistent or
> not.

It's not like the database uses random() to generate OIDs. To reach different
OIDs you would have had to issue a different sequence of DDL statements.

It's a bit of a pain since there's no sure way to resync the databases using
DDL. But surely if you dumped one of the copies and restored a clean copy of
the database on both machines they would end up with consistent OIDs?

Scott Marlowe <smarlowe@g2switchworks.com> writes:

> No. I have a perfectly consistent database.  It happens to reside on a
> cluster of two machines which have different internal ids assigned to
> the same objects, which, when I throw bog standard SQL at them, I get
> the same answer from both.  That is consistent.

But you're not throwing BOG-standard SQL at them, you're running pg_dump
against them which is using non-BOG-standard SQL. If you want pg_dump to work
against them I think you need to keep them consistent at a lower level.

> Users are consistently warned to never use OIDs as PKs, yet PostgreSQL
> the database does just that.  My data is coherent.  I'll explain more in
> my reply to Tom Lane...

If Postgres used sequences then your sequences would be out of sync. The point
is that at the level pg_dump is working the databases really are inconsistent.

Perhaps one day pg_dump could be reimplemented entirely in terms of
information_schema where the inconsistencies happen to be hidden. But I doubt
it can be today. And I suspect you could arrive at inconsistent
information_schema if you use different ddl anyways.

--
greg

Re: pg_dump and pgpool

From
Scott Marlowe
Date:
On Thu, 2004-12-30 at 09:46, Tatsuo Ishii wrote:
> > On Wed, 2004-12-29 at 17:30, Tom Lane wrote:
> > > Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > > > On Wed, 2004-12-29 at 16:56, Tom Lane wrote:
> > > >> No, we'd be throwing more, and more complex, queries.  Instead of a
> > > >> simple lookup there would be some kind of join, or at least a lookup
> > > >> that uses a multicolumn key.
> > >
> > > > I'm willing to bet the performance difference is less than noise.
> > >
> > > [ shrug... ]  I don't have a good handle on that, and neither do you.
> > > What I am quite sure about though is that pg_dump would become internally
> > > a great deal messier and harder to maintain if it couldn't use OIDs.
> > > Look at the DumpableObject manipulations and ask yourself what you're
> > > going to do instead if you have to use a primary key that is of a
> > > different kind (different numbers of columns and datatypes) for each
> > > system catalog.  Ugh.
> >
> > Wait, do you mean it's impossible to throw a single SQL query with a
> > proper join clause that USES OIDs but doesn't return them?  Or that it's
> > impossible to throw a single query without joining on OIDs.  I don't
> > mind joining on OIDs, I just don't want them crossing the connection is
> > all.  And yes, it might be ugly, but I can't imagine it being
> > unmaintable for some reason.
> >
> > > I don't think it's worth that price to support a fundamentally bogus
> > > approach to backup.
> >
> > But it's not bogus.  IT allows me to compare two databases running under
> > a pgpool synchronous cluster and KNOW if there are inconsistencies in
> > data between them, so it is quite useful to me.
> >
> > > IMHO you don't want extra layers of software in
> > > between pg_dump and the database --- each one just introduces another
> > > risk of getting a wrong backup.  You've yet to explain what the
> > > *benefit* of putting pgpool in there is for this problem.
> >
> > Actually, it ensures that I get the right backup, because pgpool will
> > cause the backup to fail if there are any differences between the two
> > backend servers, thus telling me that I have an inconsistency.
> >
> > That's the primary reason I want this.  The secondary reason, which I
> > can work around, is that I'm running the individual databases on
> > machines that only answer the specific IP of the pgpool machine's IP, so
> > remote backups aren't possible, and only the pgpool machine would be
> > capable of doing the backups, but we have (like so many other companies)
> > a centralized backup server.  I can always allow that machine to connect
> > to the database(s) to do backup, but my fear is that by allowing
> > anything other than pgpool to hit those backend databases they could be
> > placed out of sync with each other.  Admitted, a backup process
> > shouldn't be updating the database, so this, as I said, isn't really a
> > big deal.  More of a mild kink really.  As long as all access is
> > happening through pgpool, they should stay coherent to each other.
>
> Pgpool could be modified so that it has "no SELECT replication mode",
> where pgpool runs SELECT on only master server. I could do this if you
> think it's usefull.
>
> However problem is pg_dump is not only running SELECT but also
> modifying database (counting up OID counter), i.e. it creates
> temporary tables. Is this a problem for you?

Does it?  I didn't know it used temp tables.  It's not that big of a
deal, and I'm certain I can work around it.  I just really like the idea
of a cluster of pg servers running sychronously behind a redirector and
looking, for all the world, like one database.  But I think it would
take log shipping for it to work the way I'm envisioning.  I'd much
rather see work go into making pgpool run atop >2 servers than this
exercise in (_very_) likely futility.

Re: pg_dump and pgpool

From
Scott Marlowe
Date:
On Thu, 2004-12-30 at 09:20, Tom Lane wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> >> I don't think it's worth that price to support a fundamentally bogus
> >> approach to backup.
>
> > But it's not bogus.  IT allows me to compare two databases running under
> > a pgpool synchronous cluster and KNOW if there are inconsistencies in
> > data between them, so it is quite useful to me.
>
> As a data comparison tool it is certainly bogus.  What about different
> row ordering between the two databases, for instance?

Apparently pgpool knows that different order is ok.  Having three psql's
open, one to the front end pgpool, one to each of the backends, I can
insert data in different orders on each backend, select it on each, and
get a different order, but from the front end it works:

on the MASTER database:
test=# \d
        List of relations
 Schema | Name | Type  |  Owner
--------+------+-------+----------
 public | test | table | postgres
(1 row)

test=# insert into test values (2);
INSERT 11839388 1
test=# insert into test values (1);
INSERT 11839389 1
test=# select * from test;
 id
----
  2
  1
(2 rows)

on the SLAVE database:
test=# \d
        List of relations
 Schema | Name | Type  |  Owner
--------+------+-------+----------
 public | test | table | postgres
(1 row)

test=# insert into test values (1);
INSERT 13612414 1
test=# insert into test values (2);
INSERT 13612415 1
test=# select * from test;
 id
----
  1
  2
(2 rows)

On the front end:
test=# select * from test;
 id
----
  2
  1
(2 rows)

Now I add a wrong row to the slave database:

test=# insert into test values (3);
INSERT 13612416 1

and I get this error from the front end:
test=# select * from test;
ERROR:  kind mismatch between backends
HINT:  check data consistency between master and secondary
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
After deleting the row, things return to normal:
test=# delete from test where id=3;
DELETE 1
From the front end I get:
test=# select * from test;
 id
----
  2
  1
(2 rows)

> AFAICS this could only work if you were doing physical rather than
> logical replication (eg, shipping WAL logs) in which case the OIDs would
> be just as much in sync as everything else.

So, for me, the OIDs are the ONLY problem I'm getting here.  Note that
the application we're running on the front end only connects to the
database with a single thread, and serializes in the intermediate layer
(not my choice, but it seems to work pretty well so far...) so sequences
also aren't an issue, as all the queries will go in one at a time.

> Basically my point is that you are proposing to do a lot of work in
> order to solve the first problem you are running up against, but that
> will only get you to the next problem.  I'm not prepared to accept a
> significant increase in complexity and loss of maintainability in
> pg_dump in order to move one step closer to the dead end that you will
> certainly hit.

I'm certainly willing to do the vast majority of the work.  As Greg I
think mentioned, maybe a fresh start using the information_schema would
make sense as a sort of non-pg specific backup tool or something.

Re: pg_dump and pgpool

From
Tatsuo Ishii
Date:
> On Thu, 2004-12-30 at 09:20, Tom Lane wrote:
> > Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > >> I don't think it's worth that price to support a fundamentally bogus
> > >> approach to backup.
> >
> > > But it's not bogus.  IT allows me to compare two databases running under
> > > a pgpool synchronous cluster and KNOW if there are inconsistencies in
> > > data between them, so it is quite useful to me.
> >
> > As a data comparison tool it is certainly bogus.  What about different
> > row ordering between the two databases, for instance?
>
> Apparently pgpool knows that different order is ok.

I think pgpool actually behaves different from what you expect.
pgpool just ignores the content of data. Let me show you an example.

on the master:

test=# select * from t1;
 i
---
 1
 2

on the secondary:
test=# select * from t1;
 i
---
 1
 3
(2 rows)

result from pgpool:

test=# select * from t1;
 i
---
 1
 2

However it checks the packet length. Here is another example.

on the master:

test=# select * from t2;
  t
-----
 abc
(1 row)

on the secondary:

test=# select * from t2;
  t
------
 abcd
(1 row)

result from pgpool:

test=# select * from t2;
  t
-----
 abc
(1 row)

LOG: pid 1093: SimpleForwardToFrontend: length does not match between backends master(13) secondary(14) kind:(D)

>  Having three psql's
> open, one to the front end pgpool, one to each of the backends, I can
> insert data in different orders on each backend, select it on each, and
> get a different order, but from the front end it works:
>
> on the MASTER database:
> test=# \d
>         List of relations
>  Schema | Name | Type  |  Owner
> --------+------+-------+----------
>  public | test | table | postgres
> (1 row)
>
> test=# insert into test values (2);
> INSERT 11839388 1
> test=# insert into test values (1);
> INSERT 11839389 1
> test=# select * from test;
>  id
> ----
>   2
>   1
> (2 rows)
>
> on the SLAVE database:
> test=# \d
>         List of relations
>  Schema | Name | Type  |  Owner
> --------+------+-------+----------
>  public | test | table | postgres
> (1 row)
>
> test=# insert into test values (1);
> INSERT 13612414 1
> test=# insert into test values (2);
> INSERT 13612415 1
> test=# select * from test;
>  id
> ----
>   1
>   2
> (2 rows)
>
> On the front end:
> test=# select * from test;
>  id
> ----
>   2
>   1
> (2 rows)
>
> Now I add a wrong row to the slave database:
>
> test=# insert into test values (3);
> INSERT 13612416 1
>
> and I get this error from the front end:
> test=# select * from test;
> ERROR:  kind mismatch between backends
> HINT:  check data consistency between master and secondary
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
> After deleting the row, things return to normal:
> test=# delete from test where id=3;
> DELETE 1
> >From the front end I get:
> test=# select * from test;
>  id
> ----
>   2
>   1
> (2 rows)
>
> > AFAICS this could only work if you were doing physical rather than
> > logical replication (eg, shipping WAL logs) in which case the OIDs would
> > be just as much in sync as everything else.
>
> So, for me, the OIDs are the ONLY problem I'm getting here.  Note that
> the application we're running on the front end only connects to the
> database with a single thread, and serializes in the intermediate layer
> (not my choice, but it seems to work pretty well so far...) so sequences
> also aren't an issue, as all the queries will go in one at a time.

I think in this case the row ordering problem will not hurt you.

> > Basically my point is that you are proposing to do a lot of work in
> > order to solve the first problem you are running up against, but that
> > will only get you to the next problem.  I'm not prepared to accept a
> > significant increase in complexity and loss of maintainability in
> > pg_dump in order to move one step closer to the dead end that you will
> > certainly hit.
>
> I'm certainly willing to do the vast majority of the work.  As Greg I
> think mentioned, maybe a fresh start using the information_schema would
> make sense as a sort of non-pg specific backup tool or something.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: pg_dump and pgpool

From
Tom Lane
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:
> I'm certainly willing to do the vast majority of the work.  As Greg I
> think mentioned, maybe a fresh start using the information_schema would
> make sense as a sort of non-pg specific backup tool or something.

This is a dead end.

First, the information_schema only shows you the intersection between
what Postgres can do and what SQL can do; that leaves out way too much
to make a tool that anyone will want to use in practice.  For instance,
you can't determine which columns are SERIAL columns.  Worse, there are
cases where you can't tell what's going on from the information_schema
views because the views are designed around assumptions that don't hold,
such as constraint names being unique schema-wide rather than just
table-wide.

Second, there are protection problems.  There are numerous cases in
which the information_schema views will show info only to the owner of
an object, and not to anyone else, not even superusers.  This may be a
bug, or it may not be ... I'm not convinced whether the SQL spec
requires the views to work that way.  But it will certainly cripple the
usefulness of a dump tool if even running it as superuser doesn't ensure
you get everything.

Third, as a means for avoiding any dependency on OIDs, this does not
work.  Check out the "specific_name" columns of some of the views.

            regards, tom lane