Thread: Refactor pg_dump as a library?
Would anybody else be interested in a pg_dump library? I've found a thread from 2013 related to the idea, but the discussioncame to nothing. Thread started here: http://www.postgresql.org/message-id/71e01949.2e16b.13df4707405.Coremail.shuai900217@126.com My Motivation: I'm the developer of a PostgreSQL GUI client, and I am looking for ways to integrate pg_dump into my application. The mainuse case would be to get the dump of individual tables (for example, when you want to create a table similar to an existingone) Bundling pg_dump with my application and calling it doesn't allow the fine grained control and integration I would like tohave. Also, pg_dump always opens a new connection; I would prefer to use an existing database connection instead. In case anybody else is interested in this, I can offer to sponsor some developer time towards this effort. Best regards, Jakob
On 14 April 2016 at 18:16, Jakob Egger <jakob@eggerapps.at> wrote:
Would anybody else be interested in a pg_dump library?
Yes. It comes up a lot, but so far nobody's had the time and energy to actually do the required refactoring.
I'm the developer of a PostgreSQL GUI client, and I am looking for ways to integrate pg_dump into my application. The main use case would be to get the dump of individual tables (for example, when you want to create a table similar to an existing one)
I'd find a pg_get_tabledef(...) built-in function more interesting for this particular purpose than pg_dump as a library would be. We already have pg_get_viewdef(...), pg_get_functiondef(...) etc.
On 04/14/2016 12:22 PM, Craig Ringer wrote: > I'd find a pg_get_tabledef(...) built-in function more interesting for > this particular purpose than pg_dump as a library would be. We already > have pg_get_viewdef(...), pg_get_functiondef(...) etc. I am personally not a fan of the pg_get_Xdef() functions due to their heavy reliance on the syscache which feels rather unsafe in combination with concurrent DDL. I would not be surprised if we have some low probability bugs which cause inconsistent backups there which just has not hit enough people yet to have been reported. And this problem will only get worse as we reduce the lock level of more DDL. Andreas
On 4/14/16 6:16 AM, Andreas Karlsson wrote: > On 04/14/2016 12:22 PM, Craig Ringer wrote: >> I'd find a pg_get_tabledef(...) built-in function more interesting for >> this particular purpose than pg_dump as a library would be. We already >> have pg_get_viewdef(...), pg_get_functiondef(...) etc. > > I am personally not a fan of the pg_get_Xdef() functions due to their > heavy reliance on the syscache which feels rather unsafe in combination > with concurrent DDL. I would not be surprised if we have some low > probability bugs which cause inconsistent backups there which just has > not hit enough people yet to have been reported. And this problem will > only get worse as we reduce the lock level of more DDL. The other issue specific to pg_dump is it often needs to do something different than what pg_get_*def would do to support version upgrades. I agree that it would be nice to have better DDL generation capabilities in the database, but I think the right way to do that is to create the pg_dump library and then wrap that as a (version-specific) extension. That way you could loan the pg_dump-9.5 extension in a 9.3 database if you wanted. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2016-04-14 13:16:20 +0200, Andreas Karlsson wrote: > I am personally not a fan of the pg_get_Xdef() functions due to their heavy > reliance on the syscache which feels rather unsafe in combination with > concurrent DDL. I'm not sure I find that convincing: The state portrayed by the syscache is th state COPY/SELECT et al will be using. I think the angle to attack this is rather to allow blocking 'globally visible' DDL efficiently and correctly, rather than the hack pg_dump is using right now. Greetings, Andres Freund
On 4/14/16 7:16 AM, Andreas Karlsson wrote: > On 04/14/2016 12:22 PM, Craig Ringer wrote: >> I'd find a pg_get_tabledef(...) built-in function more interesting for >> this particular purpose than pg_dump as a library would be. We already >> have pg_get_viewdef(...), pg_get_functiondef(...) etc. > > I am personally not a fan of the pg_get_Xdef() functions due to their > heavy reliance on the syscache which feels rather unsafe in combination > with concurrent DDL. I would not be surprised if we have some low > probability bugs which cause inconsistent backups there which just has > not hit enough people yet to have been reported. And this problem will > only get worse as we reduce the lock level of more DDL. As far as I know pg_dump share locks everything before it starts so there shouldn't be issues with concurrent DDL. Try creating a new inherited table with FKs, etc. during a pg_dump and you'll see lots of fun lock waits. -- -David david@pgmasters.net
David Steele <david@pgmasters.net> writes: > On 4/14/16 7:16 AM, Andreas Karlsson wrote: >> I am personally not a fan of the pg_get_Xdef() functions due to their >> heavy reliance on the syscache which feels rather unsafe in combination >> with concurrent DDL. > As far as I know pg_dump share locks everything before it starts so > there shouldn't be issues with concurrent DDL. Try creating a new > inherited table with FKs, etc. during a pg_dump and you'll see lots of > fun lock waits. I think pg_dump is reasonably proof against DDL on tables. It is not at all proof against DDL on other sorts of objects, such as functions, because of the fact that the syscache will follow catalog updates that occur after pg_dump's transaction snapshot. regards, tom lane
On 4/14/16 1:33 PM, Tom Lane wrote: > David Steele <david@pgmasters.net> writes: >> On 4/14/16 7:16 AM, Andreas Karlsson wrote: >>> I am personally not a fan of the pg_get_Xdef() functions due to their >>> heavy reliance on the syscache which feels rather unsafe in combination >>> with concurrent DDL. > >> As far as I know pg_dump share locks everything before it starts so >> there shouldn't be issues with concurrent DDL. Try creating a new >> inherited table with FKs, etc. during a pg_dump and you'll see lots of >> fun lock waits. > > I think pg_dump is reasonably proof against DDL on tables. It is not > at all proof against DDL on other sorts of objects, such as functions, > because of the fact that the syscache will follow catalog updates that > occur after pg_dump's transaction snapshot. Hmm, OK. I'll need to go look at that. I thought that the backend running the pg_dump would fill it's syscache when it took all the locks and then not update them during the actual dump. If that's not the case then it's a bit scary, yes. It seems to make a good case for physical backups vs. logical. -- -David david@pgmasters.net
On Thu, Apr 14, 2016 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-04-14 13:16:20 +0200, Andreas Karlsson wrote: >> I am personally not a fan of the pg_get_Xdef() functions due to their heavy >> reliance on the syscache which feels rather unsafe in combination with >> concurrent DDL. > > I'm not sure I find that convincing: The state portrayed by the syscache > is th state COPY/SELECT et al will be using. I think the angle to > attack this is rather to allow blocking 'globally visible' DDL > efficiently and correctly, rather than the hack pg_dump is using right now. Maybe. I think that idea of changing the pg_get_Xdef() stuff to use the transaction snapshot rather than the latest snapshot might be worth considering, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Apr 14, 2016 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote: >> I'm not sure I find that convincing: The state portrayed by the syscache >> is th state COPY/SELECT et al will be using. I think the angle to >> attack this is rather to allow blocking 'globally visible' DDL >> efficiently and correctly, rather than the hack pg_dump is using right now. > Maybe. I think that idea of changing the pg_get_Xdef() stuff to use > the transaction snapshot rather than the latest snapshot might be > worth considering, too. The problem here is the connection to syscache; changing the behavior of that, in a general context, is very scary. What we might be able to do that would satisfy pg_dump's needs is to invent a mode in which you can run a read-only transaction that uses the transaction snapshot to populate syscache (and then flushes the syscache at the end). It would have to be a pretty "hard" notion of read-only, not the squishy one we have now, but I think it would work safely. Anything that might otherwise break because of stale syscache entries should be prevented from having bad side-effects by the read-only restriction. regards, tom lane
On 04/14/2016 07:28 PM, David Steele wrote: > As far as I know pg_dump share locks everything before it starts so > there shouldn't be issues with concurrent DDL. Try creating a new > inherited table with FKs, etc. during a pg_dump and you'll see lots of > fun lock waits. I am pretty sure that it does not lock functions, types or casts. So if you rename a function during pg_dump there should be a risk of getting a backup which will fail on restore. Andreas
On Fri, Apr 15, 2016 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Apr 14, 2016 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote: >>> I'm not sure I find that convincing: The state portrayed by the syscache >>> is th state COPY/SELECT et al will be using. I think the angle to >>> attack this is rather to allow blocking 'globally visible' DDL >>> efficiently and correctly, rather than the hack pg_dump is using right now. > >> Maybe. I think that idea of changing the pg_get_Xdef() stuff to use >> the transaction snapshot rather than the latest snapshot might be >> worth considering, too. > > The problem here is the connection to syscache; changing the behavior > of that, in a general context, is very scary. What we might be able to > do that would satisfy pg_dump's needs is to invent a mode in which you > can run a read-only transaction that uses the transaction snapshot to > populate syscache (and then flushes the syscache at the end). It would > have to be a pretty "hard" notion of read-only, not the squishy one we > have now, but I think it would work safely. Anything that might otherwise > break because of stale syscache entries should be prevented from having > bad side-effects by the read-only restriction. I think that we could have an alternate set of functions which have the same interface as the syscache functions but using the transaction snapshot and don't actually cache anything, and it would be fine for what the pg_dump support functions need. It wouldn't be nearly as fast as the actual syscache, but server-side CPU when deparsing DDL has never really been a problem to my knowledge. I mean, if we had something like this, it could also be sped up by adding a cache that is flushed when the snapshot changes, but I'm not very sure we really need to go that far. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Apr 15, 2016 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The problem here is the connection to syscache; changing the behavior >> of that, in a general context, is very scary. What we might be able to >> do that would satisfy pg_dump's needs is to invent a mode in which you >> can run a read-only transaction that uses the transaction snapshot to >> populate syscache (and then flushes the syscache at the end). > I think that we could have an alternate set of functions which have > the same interface as the syscache functions but using the transaction > snapshot and don't actually cache anything, and it would be fine for > what the pg_dump support functions need. The problem with that approach is that then you are talking about building duplicate copies of entire layers of the system. For example, namespace.c would have to be duplicated into one copy that uses syscache and one that uses this not-quite-cache. If it were *only* syscache.c that had to be duplicated, probably this would work, but ruleutils.c depends on an awful lot of code above that level. Indeed, if it did not, the idea of reimplementing it on the client side wouldn't be so unattractive. regards, tom lane
On Thu, Apr 14, 2016 at 01:40:21PM -0400, David Steele wrote: > On 4/14/16 1:33 PM, Tom Lane wrote: > > David Steele <david@pgmasters.net> writes: > >> On 4/14/16 7:16 AM, Andreas Karlsson wrote: > >>> I am personally not a fan of the pg_get_Xdef() functions due to their > >>> heavy reliance on the syscache which feels rather unsafe in combination > >>> with concurrent DDL. > > > >> As far as I know pg_dump share locks everything before it starts so > >> there shouldn't be issues with concurrent DDL. Try creating a new > >> inherited table with FKs, etc. during a pg_dump and you'll see lots of > >> fun lock waits. > > > > I think pg_dump is reasonably proof against DDL on tables. It is not > > at all proof against DDL on other sorts of objects, such as functions, > > because of the fact that the syscache will follow catalog updates that > > occur after pg_dump's transaction snapshot. > > Hmm, OK. I'll need to go look at that. > > I thought that the backend running the pg_dump would fill it's syscache > when it took all the locks and then not update them during the actual > dump. If that's not the case then it's a bit scary, yes. > > It seems to make a good case for physical backups vs. logical. I think another issue is that the pg_dump backend gets cache invalidations from other backends that cause it to reload the cache with new contents, so even if you pre-loaded the cache at snapshot time, you would still need to ignore cache invalidations from other backends. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Apr 29, 2016 at 5:02 PM, Bruce Momjian <bruce@momjian.us> wrote: >> > I think pg_dump is reasonably proof against DDL on tables. It is not >> > at all proof against DDL on other sorts of objects, such as functions, >> > because of the fact that the syscache will follow catalog updates that >> > occur after pg_dump's transaction snapshot. >> >> Hmm, OK. I'll need to go look at that. >> >> I thought that the backend running the pg_dump would fill it's syscache >> when it took all the locks and then not update them during the actual >> dump. If that's not the case then it's a bit scary, yes. >> >> It seems to make a good case for physical backups vs. logical. > > I think another issue is that the pg_dump backend gets cache > invalidations from other backends that cause it to reload the cache with > new contents, so even if you pre-loaded the cache at snapshot time, you > would still need to ignore cache invalidations from other backends. If you temporarily nailed the backend's snapshot for syscache lookups to some specific MVCC snapshot, as Tom was suggesting, then it wouldn't matter if it processed invalidations, because reload would pull the same entries as before back into the cache. So I don't think you'd *have* to do this, but you might want to do it as an optimization. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 18, 2016 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I think that we could have an alternate set of functions which have >> the same interface as the syscache functions but using the transaction >> snapshot and don't actually cache anything, and it would be fine for >> what the pg_dump support functions need. > > The problem with that approach is that then you are talking about building > duplicate copies of entire layers of the system. For example, namespace.c > would have to be duplicated into one copy that uses syscache and one that > uses this not-quite-cache. If it were *only* syscache.c that had to be > duplicated, probably this would work, but ruleutils.c depends on an awful > lot of code above that level. Indeed, if it did not, the idea of > reimplementing it on the client side wouldn't be so unattractive. Urgh. Does ruleutils.c really depend on everything in namespace.c? When I last looked at this I had the idea that at least a good chunk of ruleutils.c had only limited outside dependencies that might not be too tough to manage. However, if that's not true, then I agree that's a problem for this approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Apr 18, 2016 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The problem with that approach is that then you are talking about building >> duplicate copies of entire layers of the system. For example, namespace.c >> would have to be duplicated into one copy that uses syscache and one that >> uses this not-quite-cache. If it were *only* syscache.c that had to be >> duplicated, probably this would work, but ruleutils.c depends on an awful >> lot of code above that level. Indeed, if it did not, the idea of >> reimplementing it on the client side wouldn't be so unattractive. > Urgh. Does ruleutils.c really depend on everything in namespace.c? Indirectly, probably most of it. For example, it uses format_type_be() which depends on TypeIsVisible(), and it uses func_get_detail() which depends on FuncnameGetCandidates(). And it's those intermediate functions that are really bloating the depends-on footprint. As things stand, ruleutils depends on significant fractions of backend/catalog/ and backend/parser/, all of which would have to be rewritten if you'd like to make it use some alternate catalog-access infrastructure. But really the killer point here is that it uses SPI in some places. I've always wondered whether that was a good design choice, but right now that implicates just about the whole backend. regards, tom lane
On Mon, May 2, 2016 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Apr 18, 2016 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> The problem with that approach is that then you are talking about building >>> duplicate copies of entire layers of the system. For example, namespace.c >>> would have to be duplicated into one copy that uses syscache and one that >>> uses this not-quite-cache. If it were *only* syscache.c that had to be >>> duplicated, probably this would work, but ruleutils.c depends on an awful >>> lot of code above that level. Indeed, if it did not, the idea of >>> reimplementing it on the client side wouldn't be so unattractive. > >> Urgh. Does ruleutils.c really depend on everything in namespace.c? > > Indirectly, probably most of it. For example, it uses format_type_be() > which depends on TypeIsVisible(), and it uses func_get_detail() > which depends on FuncnameGetCandidates(). And it's those intermediate > functions that are really bloating the depends-on footprint. As things > stand, ruleutils depends on significant fractions of backend/catalog/ > and backend/parser/, all of which would have to be rewritten if you'd > like to make it use some alternate catalog-access infrastructure. > > But really the killer point here is that it uses SPI in some places. > I've always wondered whether that was a good design choice, but right > now that implicates just about the whole backend. Ouch. Well, I think the first thing to do here might be to reconsider whether the footprint could be cut down. Removing the dependency on SPI seems like a good idea even if we do nothing else. Nailing the catalogs to a snapshot isn't crazy - the logical decoding stuff does it already - but having such a wide dependency footprint does not seem especially good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, May 2, 2016 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Mon, Apr 18, 2016 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> The problem with that approach is that then you are talking about building >>>> duplicate copies of entire layers of the system. >>> Urgh. Does ruleutils.c really depend on everything in namespace.c? >> Indirectly, probably most of it. For example, it uses format_type_be() >> which depends on TypeIsVisible(), and it uses func_get_detail() >> which depends on FuncnameGetCandidates(). And it's those intermediate >> functions that are really bloating the depends-on footprint. >> But really the killer point here is that it uses SPI in some places. >> I've always wondered whether that was a good design choice, but right >> now that implicates just about the whole backend. > Ouch. > Well, I think the first thing to do here might be to reconsider > whether the footprint could be cut down. Removing the dependency on > SPI seems like a good idea even if we do nothing else. Nailing the > catalogs to a snapshot isn't crazy - the logical decoding stuff does > it already - but having such a wide dependency footprint does not seem > especially good. Meh. I'd be the first to say that probably a lot of that is because it was convenient; but I don't see any way to avoid it without duplicating vast quantities of code, which does not sound like a great idea from a maintainability standpoint. And have we mentioned the fact that some of this code looks directly at the catalogs because there's no suitable syscache? Really I think the idea of fixing this with an alternate syscache is a nonstarter. regards, tom lane
Hello, Jakob. You wrote: JE> Would anybody else be interested in a pg_dump library? I've found JE> a thread from 2013 related to the idea, but the discussion came to nothing. JE> Thread started here: JE> http://www.postgresql.org/message-id/71e01949.2e16b.13df4707405.Coremail.shuai900217@126.com JE> My Motivation: JE> I'm the developer of a PostgreSQL GUI client, and I am looking JE> for ways to integrate pg_dump into my application. The main use JE> case would be to get the dump of individual tables (for example, JE> when you want to create a table similar to an existing one) JE> Bundling pg_dump with my application and calling it doesn't allow JE> the fine grained control and integration I would like to have. JE> Also, pg_dump always opens a new connection; I would prefer to use JE> an existing database connection instead. JE> In case anybody else is interested in this, I can offer to JE> sponsor some developer time towards this effort. JE> Best regards, JE> Jakob I proposed this several times, but nobody cares. Then we did it. Our PostgresDAC component set (http://microolap.com/products/connectivity/postgresdac/) has TPSQLDump and TPSQLRestore classes. Details: http://microolap.com/products/connectivity/postgresdac/help/tpsqldump_tpsqldump.htm Also we've implemented PaGoDump and PaGoRestore utilities compatible with native pg_dump/pg_restore: http://microolap.com/products/database/pagodump/ -- With best wishes,Pavel mailto:pavel@gf.microolap.com