Thread: Refactor pg_dump as a library?

Refactor pg_dump as a library?

From

Jakob Egger

Date:

14 April 2016, 10:16:25

Would anybody else be interested in a pg_dump library? I've found a thread from 2013 related to the idea, but the
discussioncame to nothing. 

Thread started here:
http://www.postgresql.org/message-id/71e01949.2e16b.13df4707405.Coremail.shuai900217@126.com


My Motivation:

I'm the developer of a PostgreSQL GUI client, and I am looking for ways to integrate pg_dump into my application. The
mainuse case would be to get the dump of individual tables (for example, when you want to create a table similar to an
existingone) 

Bundling pg_dump with my application and calling it doesn't allow the fine grained control and integration I would like
tohave. Also, pg_dump always opens a new connection; I would prefer to use an existing database connection instead. 

In case anybody else is interested in this, I can offer to sponsor some developer time towards this effort.

Best regards,
Jakob

Re: Refactor pg_dump as a library?

From

Craig Ringer

Date:

14 April 2016, 10:22:41

On 14 April 2016 at 18:16, Jakob Egger <jakob@eggerapps.at> wrote:

Would anybody else be interested in a pg_dump library?

Yes. It comes up a lot, but so far nobody's had the time and energy to actually do the required refactoring.

I'm the developer of a PostgreSQL GUI client, and I am looking for ways to integrate pg_dump into my application. The main use case would be to get the dump of individual tables (for example, when you want to create a table similar to an existing one)

I'd find a pg_get_tabledef(...) built-in function more interesting for this particular purpose than pg_dump as a library would be. We already have pg_get_viewdef(...), pg_get_functiondef(...) etc.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Refactor pg_dump as a library?

From

Andreas Karlsson

Date:

14 April 2016, 11:16:25

On 04/14/2016 12:22 PM, Craig Ringer wrote:
> I'd find a pg_get_tabledef(...) built-in function more interesting for
> this particular purpose than pg_dump as a library would be. We already
> have pg_get_viewdef(...), pg_get_functiondef(...) etc.

I am personally not a fan of the pg_get_Xdef() functions due to their 
heavy reliance on the syscache which feels rather unsafe in combination 
with concurrent DDL. I would not be surprised if we have some low 
probability bugs which cause inconsistent backups there which just has 
not hit enough people yet to have been reported. And this problem will 
only get worse as we reduce the lock level of more DDL.

Andreas

Re: Refactor pg_dump as a library?

From

Jim Nasby

Date:

14 April 2016, 16:45:51

On 4/14/16 6:16 AM, Andreas Karlsson wrote:
> On 04/14/2016 12:22 PM, Craig Ringer wrote:
>> I'd find a pg_get_tabledef(...) built-in function more interesting for
>> this particular purpose than pg_dump as a library would be. We already
>> have pg_get_viewdef(...), pg_get_functiondef(...) etc.
>
> I am personally not a fan of the pg_get_Xdef() functions due to their
> heavy reliance on the syscache which feels rather unsafe in combination
> with concurrent DDL. I would not be surprised if we have some low
> probability bugs which cause inconsistent backups there which just has
> not hit enough people yet to have been reported. And this problem will
> only get worse as we reduce the lock level of more DDL.

The other issue specific to pg_dump is it often needs to do something 
different than what pg_get_*def would do to support version upgrades.

I agree that it would be nice to have better DDL generation capabilities 
in the database, but I think the right way to do that is to create the 
pg_dump library and then wrap that as a (version-specific) extension. 
That way you could loan the pg_dump-9.5 extension in a 9.3 database if 
you wanted.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Refactor pg_dump as a library?

From

Andres Freund

Date:

14 April 2016, 17:01:49

On 2016-04-14 13:16:20 +0200, Andreas Karlsson wrote:
> I am personally not a fan of the pg_get_Xdef() functions due to their heavy
> reliance on the syscache which feels rather unsafe in combination with
> concurrent DDL.

I'm not sure I find that convincing: The state portrayed by the syscache
is th state COPY/SELECT et al will be using.  I think the angle to
attack this is rather to allow blocking 'globally visible' DDL
efficiently and correctly, rather than the hack pg_dump is using right now.

Greetings,

Andres Freund

Re: Refactor pg_dump as a library?

From

David Steele

Date:

14 April 2016, 17:28:42

On 4/14/16 7:16 AM, Andreas Karlsson wrote:
> On 04/14/2016 12:22 PM, Craig Ringer wrote:
>> I'd find a pg_get_tabledef(...) built-in function more interesting for
>> this particular purpose than pg_dump as a library would be. We already
>> have pg_get_viewdef(...), pg_get_functiondef(...) etc.
> 
> I am personally not a fan of the pg_get_Xdef() functions due to their
> heavy reliance on the syscache which feels rather unsafe in combination
> with concurrent DDL. I would not be surprised if we have some low
> probability bugs which cause inconsistent backups there which just has
> not hit enough people yet to have been reported. And this problem will
> only get worse as we reduce the lock level of more DDL.

As far as I know pg_dump share locks everything before it starts so
there shouldn't be issues with concurrent DDL.  Try creating a new
inherited table with FKs, etc. during a pg_dump and you'll see lots of
fun lock waits.

-- 
-David
david@pgmasters.net

Re: Refactor pg_dump as a library?

From

Tom Lane

Date:

14 April 2016, 17:33:43

David Steele <david@pgmasters.net> writes:
> On 4/14/16 7:16 AM, Andreas Karlsson wrote:
>> I am personally not a fan of the pg_get_Xdef() functions due to their
>> heavy reliance on the syscache which feels rather unsafe in combination
>> with concurrent DDL.

> As far as I know pg_dump share locks everything before it starts so
> there shouldn't be issues with concurrent DDL.  Try creating a new
> inherited table with FKs, etc. during a pg_dump and you'll see lots of
> fun lock waits.

I think pg_dump is reasonably proof against DDL on tables.  It is not
at all proof against DDL on other sorts of objects, such as functions,
because of the fact that the syscache will follow catalog updates that
occur after pg_dump's transaction snapshot.
        regards, tom lane

Re: Refactor pg_dump as a library?

From

David Steele

Date:

14 April 2016, 17:40:26

On 4/14/16 1:33 PM, Tom Lane wrote:
> David Steele <david@pgmasters.net> writes:
>> On 4/14/16 7:16 AM, Andreas Karlsson wrote:
>>> I am personally not a fan of the pg_get_Xdef() functions due to their
>>> heavy reliance on the syscache which feels rather unsafe in combination
>>> with concurrent DDL.
> 
>> As far as I know pg_dump share locks everything before it starts so
>> there shouldn't be issues with concurrent DDL.  Try creating a new
>> inherited table with FKs, etc. during a pg_dump and you'll see lots of
>> fun lock waits.
> 
> I think pg_dump is reasonably proof against DDL on tables.  It is not
> at all proof against DDL on other sorts of objects, such as functions,
> because of the fact that the syscache will follow catalog updates that
> occur after pg_dump's transaction snapshot.

Hmm, OK.  I'll need to go look at that.

I thought that the backend running the pg_dump would fill it's syscache
when it took all the locks and then not update them during the actual
dump.  If that's not the case then it's a bit scary, yes.

It seems to make a good case for physical backups vs. logical.

-- 
-David
david@pgmasters.net

Re: Refactor pg_dump as a library?

From

Robert Haas

Date:

15 April 2016, 18:41:41

On Thu, Apr 14, 2016 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-14 13:16:20 +0200, Andreas Karlsson wrote:
>> I am personally not a fan of the pg_get_Xdef() functions due to their heavy
>> reliance on the syscache which feels rather unsafe in combination with
>> concurrent DDL.
>
> I'm not sure I find that convincing: The state portrayed by the syscache
> is th state COPY/SELECT et al will be using.  I think the angle to
> attack this is rather to allow blocking 'globally visible' DDL
> efficiently and correctly, rather than the hack pg_dump is using right now.

Maybe.  I think that idea of changing the pg_get_Xdef() stuff to use
the transaction snapshot rather than the latest snapshot might be
worth considering, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactor pg_dump as a library?

From

Tom Lane

Date:

15 April 2016, 18:51:24

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Apr 14, 2016 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote:
>> I'm not sure I find that convincing: The state portrayed by the syscache
>> is th state COPY/SELECT et al will be using.  I think the angle to
>> attack this is rather to allow blocking 'globally visible' DDL
>> efficiently and correctly, rather than the hack pg_dump is using right now.

> Maybe.  I think that idea of changing the pg_get_Xdef() stuff to use
> the transaction snapshot rather than the latest snapshot might be
> worth considering, too.

The problem here is the connection to syscache; changing the behavior
of that, in a general context, is very scary.  What we might be able to
do that would satisfy pg_dump's needs is to invent a mode in which you
can run a read-only transaction that uses the transaction snapshot to
populate syscache (and then flushes the syscache at the end).  It would
have to be a pretty "hard" notion of read-only, not the squishy one we
have now, but I think it would work safely.  Anything that might otherwise
break because of stale syscache entries should be prevented from having
bad side-effects by the read-only restriction.
        regards, tom lane

Re: Refactor pg_dump as a library?

From

Andreas Karlsson

Date:

16 April 2016, 13:00:16

On 04/14/2016 07:28 PM, David Steele wrote:
> As far as I know pg_dump share locks everything before it starts so
> there shouldn't be issues with concurrent DDL.  Try creating a new
> inherited table with FKs, etc. during a pg_dump and you'll see lots of
> fun lock waits.

I am pretty sure that it does not lock functions, types or casts. So if 
you rename a function during pg_dump there should be a risk of getting a 
backup which will fail on restore.

Andreas

Re: Refactor pg_dump as a library?

From

Robert Haas

Date:

18 April 2016, 14:52:18

On Fri, Apr 15, 2016 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Apr 14, 2016 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote:
>>> I'm not sure I find that convincing: The state portrayed by the syscache
>>> is th state COPY/SELECT et al will be using.  I think the angle to
>>> attack this is rather to allow blocking 'globally visible' DDL
>>> efficiently and correctly, rather than the hack pg_dump is using right now.
>
>> Maybe.  I think that idea of changing the pg_get_Xdef() stuff to use
>> the transaction snapshot rather than the latest snapshot might be
>> worth considering, too.
>
> The problem here is the connection to syscache; changing the behavior
> of that, in a general context, is very scary.  What we might be able to
> do that would satisfy pg_dump's needs is to invent a mode in which you
> can run a read-only transaction that uses the transaction snapshot to
> populate syscache (and then flushes the syscache at the end).  It would
> have to be a pretty "hard" notion of read-only, not the squishy one we
> have now, but I think it would work safely.  Anything that might otherwise
> break because of stale syscache entries should be prevented from having
> bad side-effects by the read-only restriction.

I think that we could have an alternate set of functions which have
the same interface as the syscache functions but using the transaction
snapshot and don't actually cache anything, and it would be fine for
what the pg_dump support functions need.  It wouldn't be nearly as
fast as the actual syscache, but server-side CPU when deparsing DDL
has never really been a problem to my knowledge.  I mean, if we had
something like this, it could also be sped up by adding a cache that
is flushed when the snapshot changes, but I'm not very sure we really
need to go that far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactor pg_dump as a library?

From

Tom Lane

Date:

18 April 2016, 15:04:20

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Apr 15, 2016 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The problem here is the connection to syscache; changing the behavior
>> of that, in a general context, is very scary.  What we might be able to
>> do that would satisfy pg_dump's needs is to invent a mode in which you
>> can run a read-only transaction that uses the transaction snapshot to
>> populate syscache (and then flushes the syscache at the end).

> I think that we could have an alternate set of functions which have
> the same interface as the syscache functions but using the transaction
> snapshot and don't actually cache anything, and it would be fine for
> what the pg_dump support functions need.

The problem with that approach is that then you are talking about building
duplicate copies of entire layers of the system.  For example, namespace.c
would have to be duplicated into one copy that uses syscache and one that
uses this not-quite-cache.  If it were *only* syscache.c that had to be
duplicated, probably this would work, but ruleutils.c depends on an awful
lot of code above that level.  Indeed, if it did not, the idea of
reimplementing it on the client side wouldn't be so unattractive.
        regards, tom lane

Re: Refactor pg_dump as a library?

From

Bruce Momjian

Date:

29 April 2016, 21:02:21

On Thu, Apr 14, 2016 at 01:40:21PM -0400, David Steele wrote:
> On 4/14/16 1:33 PM, Tom Lane wrote:
> > David Steele <david@pgmasters.net> writes:
> >> On 4/14/16 7:16 AM, Andreas Karlsson wrote:
> >>> I am personally not a fan of the pg_get_Xdef() functions due to their
> >>> heavy reliance on the syscache which feels rather unsafe in combination
> >>> with concurrent DDL.
> > 
> >> As far as I know pg_dump share locks everything before it starts so
> >> there shouldn't be issues with concurrent DDL.  Try creating a new
> >> inherited table with FKs, etc. during a pg_dump and you'll see lots of
> >> fun lock waits.
> > 
> > I think pg_dump is reasonably proof against DDL on tables.  It is not
> > at all proof against DDL on other sorts of objects, such as functions,
> > because of the fact that the syscache will follow catalog updates that
> > occur after pg_dump's transaction snapshot.
> 
> Hmm, OK.  I'll need to go look at that.
> 
> I thought that the backend running the pg_dump would fill it's syscache
> when it took all the locks and then not update them during the actual
> dump.  If that's not the case then it's a bit scary, yes.
> 
> It seems to make a good case for physical backups vs. logical.

I think another issue is that the pg_dump backend gets cache
invalidations from other backends that cause it to reload the cache with
new contents, so even if you pre-loaded the cache at snapshot time, you
would still need to ignore cache invalidations from other backends.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

Re: Refactor pg_dump as a library?

From

Robert Haas

Date:

02 May 2016, 13:10:37

On Fri, Apr 29, 2016 at 5:02 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> > I think pg_dump is reasonably proof against DDL on tables.  It is not
>> > at all proof against DDL on other sorts of objects, such as functions,
>> > because of the fact that the syscache will follow catalog updates that
>> > occur after pg_dump's transaction snapshot.
>>
>> Hmm, OK.  I'll need to go look at that.
>>
>> I thought that the backend running the pg_dump would fill it's syscache
>> when it took all the locks and then not update them during the actual
>> dump.  If that's not the case then it's a bit scary, yes.
>>
>> It seems to make a good case for physical backups vs. logical.
>
> I think another issue is that the pg_dump backend gets cache
> invalidations from other backends that cause it to reload the cache with
> new contents, so even if you pre-loaded the cache at snapshot time, you
> would still need to ignore cache invalidations from other backends.

If you temporarily nailed the backend's snapshot for syscache lookups
to some specific MVCC snapshot, as Tom was suggesting, then it
wouldn't matter if it processed invalidations, because reload would
pull the same entries as before back into the cache.  So I don't think
you'd *have* to do this, but you might want to do it as an
optimization.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactor pg_dump as a library?

From

Robert Haas

Date:

02 May 2016, 13:12:56

On Mon, Apr 18, 2016 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think that we could have an alternate set of functions which have
>> the same interface as the syscache functions but using the transaction
>> snapshot and don't actually cache anything, and it would be fine for
>> what the pg_dump support functions need.
>
> The problem with that approach is that then you are talking about building
> duplicate copies of entire layers of the system.  For example, namespace.c
> would have to be duplicated into one copy that uses syscache and one that
> uses this not-quite-cache.  If it were *only* syscache.c that had to be
> duplicated, probably this would work, but ruleutils.c depends on an awful
> lot of code above that level.  Indeed, if it did not, the idea of
> reimplementing it on the client side wouldn't be so unattractive.

Urgh.  Does ruleutils.c really depend on everything in namespace.c?
When I last looked at this I had the idea that at least a good chunk
of ruleutils.c had only limited outside dependencies that might not be
too tough to manage.  However, if that's not true, then I agree that's
a problem for this approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactor pg_dump as a library?

From

Tom Lane

Date:

02 May 2016, 14:00:29

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Apr 18, 2016 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The problem with that approach is that then you are talking about building
>> duplicate copies of entire layers of the system.  For example, namespace.c
>> would have to be duplicated into one copy that uses syscache and one that
>> uses this not-quite-cache.  If it were *only* syscache.c that had to be
>> duplicated, probably this would work, but ruleutils.c depends on an awful
>> lot of code above that level.  Indeed, if it did not, the idea of
>> reimplementing it on the client side wouldn't be so unattractive.

> Urgh.  Does ruleutils.c really depend on everything in namespace.c?

Indirectly, probably most of it.  For example, it uses format_type_be()
which depends on TypeIsVisible(), and it uses func_get_detail()
which depends on FuncnameGetCandidates().  And it's those intermediate
functions that are really bloating the depends-on footprint.  As things
stand, ruleutils depends on significant fractions of backend/catalog/
and backend/parser/, all of which would have to be rewritten if you'd
like to make it use some alternate catalog-access infrastructure.

But really the killer point here is that it uses SPI in some places.
I've always wondered whether that was a good design choice, but right
now that implicates just about the whole backend.
        regards, tom lane

Re: Refactor pg_dump as a library?

From

Robert Haas

Date:

02 May 2016, 14:10:53

On Mon, May 2, 2016 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Apr 18, 2016 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The problem with that approach is that then you are talking about building
>>> duplicate copies of entire layers of the system.  For example, namespace.c
>>> would have to be duplicated into one copy that uses syscache and one that
>>> uses this not-quite-cache.  If it were *only* syscache.c that had to be
>>> duplicated, probably this would work, but ruleutils.c depends on an awful
>>> lot of code above that level.  Indeed, if it did not, the idea of
>>> reimplementing it on the client side wouldn't be so unattractive.
>
>> Urgh.  Does ruleutils.c really depend on everything in namespace.c?
>
> Indirectly, probably most of it.  For example, it uses format_type_be()
> which depends on TypeIsVisible(), and it uses func_get_detail()
> which depends on FuncnameGetCandidates().  And it's those intermediate
> functions that are really bloating the depends-on footprint.  As things
> stand, ruleutils depends on significant fractions of backend/catalog/
> and backend/parser/, all of which would have to be rewritten if you'd
> like to make it use some alternate catalog-access infrastructure.
>
> But really the killer point here is that it uses SPI in some places.
> I've always wondered whether that was a good design choice, but right
> now that implicates just about the whole backend.

Ouch.

Well, I think the first thing to do here might be to reconsider
whether the footprint could be cut down.  Removing the dependency on
SPI seems like a good idea even if we do nothing else.  Nailing the
catalogs to a snapshot isn't crazy - the logical decoding stuff does
it already - but having such a wide dependency footprint does not seem
especially good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Refactor pg_dump as a library?

From

Tom Lane

Date:

02 May 2016, 14:23:05

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 2, 2016 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Mon, Apr 18, 2016 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> The problem with that approach is that then you are talking about building
>>>> duplicate copies of entire layers of the system.

>>> Urgh.  Does ruleutils.c really depend on everything in namespace.c?

>> Indirectly, probably most of it.  For example, it uses format_type_be()
>> which depends on TypeIsVisible(), and it uses func_get_detail()
>> which depends on FuncnameGetCandidates().  And it's those intermediate
>> functions that are really bloating the depends-on footprint.
>> But really the killer point here is that it uses SPI in some places.
>> I've always wondered whether that was a good design choice, but right
>> now that implicates just about the whole backend.

> Ouch.

> Well, I think the first thing to do here might be to reconsider
> whether the footprint could be cut down.  Removing the dependency on
> SPI seems like a good idea even if we do nothing else.  Nailing the
> catalogs to a snapshot isn't crazy - the logical decoding stuff does
> it already - but having such a wide dependency footprint does not seem
> especially good.

Meh.  I'd be the first to say that probably a lot of that is because it
was convenient; but I don't see any way to avoid it without duplicating
vast quantities of code, which does not sound like a great idea from a
maintainability standpoint.  And have we mentioned the fact that some of
this code looks directly at the catalogs because there's no suitable
syscache?

Really I think the idea of fixing this with an alternate syscache is
a nonstarter.
        regards, tom lane

Re: Refactor pg_dump as a library?

From

Pavel Golub

Date:

20 May 2016, 13:34:35

Hello, Jakob.

You wrote:

JE> Would anybody else be interested in a pg_dump library? I've found
JE> a thread from 2013 related to the idea, but the discussion came to nothing.

JE> Thread started here:
JE> http://www.postgresql.org/message-id/71e01949.2e16b.13df4707405.Coremail.shuai900217@126.com

JE> My Motivation:

JE> I'm the developer of a PostgreSQL GUI client, and I am looking
JE> for ways to integrate pg_dump into my application. The main use
JE> case would be to get the dump of individual tables (for example,
JE> when you want to create a table similar to an existing one)

JE> Bundling pg_dump with my application and calling it doesn't allow
JE> the fine grained control and integration I would like to have.
JE> Also, pg_dump always opens a new connection; I would prefer to use
JE> an existing database connection instead.

JE> In case anybody else is interested in this, I can offer to
JE> sponsor some developer time towards this effort.

JE> Best regards,
JE> Jakob

I proposed this several times, but nobody cares. Then we did it.
Our PostgresDAC component set
(http://microolap.com/products/connectivity/postgresdac/) has
TPSQLDump and TPSQLRestore classes.
Details: http://microolap.com/products/connectivity/postgresdac/help/tpsqldump_tpsqldump.htm

Also we've implemented PaGoDump and PaGoRestore utilities compatible
with native pg_dump/pg_restore: http://microolap.com/products/database/pagodump/

--
With best wishes,Pavel mailto:pavel@gf.microolap.com