Thread: [GSOC] questions about idea "rewrite pg_dump as library"
<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial">Hi all, <br /><div class="moz-text-html" lang="x-western"><divclass="moz-forward-container"> I'd like to introduce myself to the dev community. I am Shuai Fan,a student from Dalian University of Technology, DLUT , for short, China. And I am interested in working with PostgreSQLproject in GSOC2013.<br /> I'm interested in the idea "Rewrite (add) pg_dump and pg_restore utilities as libraries(.so, .dll & .dylib)".<span style="color: rgb(0, 0, 0); font-family: sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19px;orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows:2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color:rgb(255, 255, 255); display: inline !important; float: none; "> </span><br /><br /> These days,I had a talk with Mr. Pavel Golub by email, the author of this post. And asked some questions about this idea. He advicedme to post the question to this mail list. <br /><br /> My question is:<br /> There are lots of functionsin "pg_dump.c". If I rewrite this file as library. I should split "pg_dump.c" into two or more files(Mr. Pavel Golub'sadvice). However, some functions, do have return value. e.g. <br /><br /> static DumpableObject *createBoundaryObjects(void);<br/><br /> I thought, these functions must get return value through function argument bypassing pointer to it, when using library. But, the question is: If I did so, function prototype may be changed. And Mr.Pavel Golub said "it's better to keep all functions with the same declaration", and so "we will only have one code basefor both console application and library". I think he is right. <br /> But then, how can I deal with this situation?From my point of view, I can't malloc a variable (e.g. DumpableObject) in one library (e.g. pg_dumplib.so), andthen return it's pointer to another library (or excutable program). Maybe, it's not safe(?). Or has something to do with"memory leak"(?). I'm not sure.<br /> <br /> Do you have any suggestions?<br /><br /><br /> Best wishes,<br /> Shuai<br /></div></div></div><br /><br /><span title="neteasefooter"><span id="netease_mail_footer"></span></span>
On 4/10/13 10:54 AM, ˧ wrote: > I'm interested in the idea "Rewrite (add) pg_dump and pg_restore > utilities as libraries (.so, .dll & .dylib)". The pg_dump code is a giant mess, and refactoring it as a library is perhaps not a project for a new hacker. Independent of that, I think the first consideration in such a project would be, who else would be using that library? What are the use cases?And then come up with an interface around that, andthen see about refactoring pg_dump. I think the main uses cases mentioned in connection with this idea are usually in the direction of finer-grained control over what gets dumped and how. But making pg_dump into a library would not necessarily address that.
Peter Eisentraut wrote: > I think the main uses cases mentioned in connection with this idea are > usually in the direction of finer-grained control over what gets dumped > and how. But making pg_dump into a library would not necessarily > address that. There's also the matter of embedding pg_dump into other programs. For example, calling the pg_dump executable from inside pgAdmin is a rather messy solution to the problem. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 04/10/2013 11:02 PM, Alvaro Herrera wrote: > Peter Eisentraut wrote: > >> I think the main uses cases mentioned in connection with this idea are >> usually in the direction of finer-grained control over what gets dumped >> and how. But making pg_dump into a library would not necessarily >> address that. > There's also the matter of embedding pg_dump into other programs. For > example, calling the pg_dump executable from inside pgAdmin is a rather > messy solution to the problem. > Natural solution to this seems to move most of pg_dump functionality into backend as functions, so we have pg_dump_xxx() for everything we want to dump plus a topological sort function for getting the objects in right order. The main things left into pg_dump the library would be support various ways to format the dump results into text, tar and "dump" files. -------------------- Hannu
Hannu Krosing wrote: > On 04/10/2013 11:02 PM, Alvaro Herrera wrote: > >Peter Eisentraut wrote: > > > >>I think the main uses cases mentioned in connection with this idea are > >>usually in the direction of finer-grained control over what gets dumped > >>and how. But making pg_dump into a library would not necessarily > >>address that. > >There's also the matter of embedding pg_dump into other programs. For > >example, calling the pg_dump executable from inside pgAdmin is a rather > >messy solution to the problem. > > Natural solution to this seems to move most of pg_dump functionality > into backend as functions, so we have pg_dump_xxx() for everything > we want to dump plus a topological sort function for getting the > objects in right order. This idea doesn't work because of back-patch considerations (i.e. we would not be able to create the functions in back branches, and so this new style of pg_dump would only work with future server versions). So pg_dump itself would have to retain capability to dump stuff from old servers. This seems unlikely to fly at all, because we'd be then effectively maintaining pg_dump in two places, both backend and the pg_dump source code. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Hannu Krosing wrote: >> Natural solution to this seems to move most of pg_dump functionality >> into backend as functions, so we have pg_dump_xxx() for everything >> we want to dump plus a topological sort function for getting the >> objects in right order. > This idea doesn't work because of back-patch considerations (i.e. we > would not be able to create the functions in back branches, and so this > new style of pg_dump would only work with future server versions). So > pg_dump itself would have to retain capability to dump stuff from old > servers. This seems unlikely to fly at all, because we'd be then > effectively maintaining pg_dump in two places, both backend and the > pg_dump source code. There are other issues too, in particular that most of the backend's code tends to work on SnapshotNow time whereas pg_dump would really prefer it was all done according to the transaction snapshot. We have got bugs of that ilk already in pg_dump, but we shouldn't introduce a bunch more. Doing this right would therefore mean that we'd have to write a lot of duplicative code in the backend, ie, it's not clear that we gain any synergy by pushing the functionality over. It might simplify cross-backend-version issues (at least for backend versions released after we'd rewritten all that code) but otherwise I'm afraid it'd just be pushing the problems somewhere else. In any case, "push it to the backend" offers no detectable help with the core design issue here, which is figuring out what functionality needs to be exposed with what API. regards, tom lane
Hello, 帅. You wrote: 帅> Hi all, 帅> I'd like to introduce myself to the dev community. I am Shuai 帅> Fan, a student from Dalian University of Technology, DLUT , for 帅> short, China. And I am interested in working with PostgreSQL project in GSOC2013. 帅> I'm interested in the idea "Rewrite (add) pg_dump and 帅> pg_restore utilities as libraries (.so, .dll & .dylib)". 帅> These days, I had a talk with Mr. Pavel Golub by email, the 帅> author of this post. And asked some questions about this idea. He 帅> adviced me to post the question to this mail list. 帅> My question is: 帅> There are lots of functions in "pg_dump.c". If I rewrite this 帅> file as library. I should split "pg_dump.c" into two or more 帅> files(Mr. Pavel Golub's advice). However, some functions, do have return value. e.g. 帅> static DumpableObject *createBoundaryObjects(void); 帅> I thought, these functions must get return value through 帅> function argument by passing pointer to it, when using library. 帅> But, the question is: If I did so, function prototype may be 帅> changed. And Mr. Pavel Golub said "it's better to keep all 帅> functions with the same declaration", and so "we will only have one 帅> code base for both console application and library". I think he is right. 帅> But then, how can I deal with this situation? From my point of 帅> view, I can't malloc a variable (e.g. DumpableObject) in one 帅> library (e.g. pg_dumplib.so), and then return it's pointer to 帅> another library (or excutable program). Maybe, it's not safe(?). Or 帅> has something to do with "memory leak"(?). I'm not sure. 帅> 帅> Do you have any suggestions? From my point of view the new library should export only two functions: 1. The execution function: ExecStatusType PGdumpdbParams(const char * const *keywords, const char * const *values); Return type may be other, but ExecStatusType seems to be OK for this purpose: PGRES_TUPLES_OK - for success, PGRES_FATAL_ERROR - for fail. Parameters will remain the same as usual command-line options for pg_dump. Thus we will have less work for existing application, e.g. pgAdmin. 2. Logging or notice processing function: typedef void (*PGdumplogProcessor) (char *relname, int done, char *message); PGdumplogProcessor PQsetNoticeProcessor(PGdumplogProcessor proc, void *arg); The purpose of this function is process output of the dump. The first argument is callback-function which accepts information about current relname (or operation, or stage), done indicates how much work done (for progress bars etc.), message contains some extra information. That's all! Only two functions. Indeed we don't need all those low-level dump functions like createBoundaryObjects etc. There will be the only one entry to the whole logic. And if one wants the only one table, funcction or schema - combination of correct parameters should be passed to PGdumpdbParams. 帅> Best wishes, 帅> Shuai -- With best wishes,Pavel mailto:pavel@gf.microolap.com
Pavel Golub <pavel@microolap.com> writes: > From my point of view the new library should export only two > functions: > 1. The execution function: > ExecStatusType PGdumpdbParams(const char * const *keywords, > const char * const *values); No, this is exactly *wrong*. You might as well not bother to refactor, if the only API the library presents is exactly equivalent to what you could get with system("pg_dump ..."). I don't know what the right answer is, but this isn't it. Most people who are interested in this topic are interested because they want to get output that is different from anything pg_dump would produce on its own, for instance applying a more complex object-selection rule than anything pg_dump offers. Right now, the only way they can do that is lobby to add new switch options to pg_dump. With a change like this, it'd still be the case that they can't get what they want except by adding new switch options to pg_dump. I don't see any advantage gained. regards, tom lane
Hello, Tom. You wrote: TL> Pavel Golub <pavel@microolap.com> writes: >> From my point of view the new library should export only two >> functions: >> 1. The execution function: >> ExecStatusType PGdumpdbParams(const char * const *keywords, >> const char * const *values); TL> No, this is exactly *wrong*. You might as well not bother to refactor, TL> if the only API the library presents is exactly equivalent to what you TL> could get with system("pg_dump ..."). Well, yes. You're absolutely right. But should this be a starting point? TL> I don't know what the right answer is, but this isn't it. Most people TL> who are interested in this topic are interested because they want to get TL> output that is different from anything pg_dump would produce on its own, TL> for instance applying a more complex object-selection rule than anything TL> pg_dump offers. Right now, the only way they can do that is lobby to TL> add new switch options to pg_dump. With a change like this, it'd still TL> be the case that they can't get what they want except by adding new TL> switch options to pg_dump. I don't see any advantage gained. TL> regards, tom lane -- With best wishes,Pavel mailto:pavel@gf.microolap.com
On 04/11/2013 09:51 AM, Tom Lane wrote: > Pavel Golub <pavel@microolap.com> writes: >> From my point of view the new library should export only two >> functions: >> 1. The execution function: >> ExecStatusType PGdumpdbParams(const char * const *keywords, >> const char * const *values); > No, this is exactly *wrong*. You might as well not bother to refactor, > if the only API the library presents is exactly equivalent to what you > could get with system("pg_dump ..."). > > I don't know what the right answer is, but this isn't it. Most people > who are interested in this topic are interested because they want to get > output that is different from anything pg_dump would produce on its own, > for instance applying a more complex object-selection rule than anything > pg_dump offers. Right now, the only way they can do that is lobby to > add new switch options to pg_dump. With a change like this, it'd still > be the case that they can't get what they want except by adding new > switch options to pg_dump. I don't see any advantage gained. > > Well, either they want that or they want that output more accessibly, and without all the baggage that pg_dump necessarily brings to the table. pg_dump does a lot of stuff that's basically designed for bulk operations, and often what people want is a way to get, say, the creation DDL for some object, without any locks than the usual locks any transaction takes. Last year I started writing a package to provide such functions, which i called RetailDDL, and it was well received at the conference where I talked about it, but I have not had time since then to work on it, as JSON development among other things has had a rather higher priority. But I think it's very well worth doing. I think in principle having database functions for the creation DDL for its own objects is a good idea. And yes, that would mean keeping knowledge of how to produce such output in two places - pg_dump is going to need to keep historical knowledge, for one thing. But I can live with that. It could be interesting to have a library that would output database metadata in some machine readable and manipulatable format such as JSON or XML. One thing that's annoying about the text output pg_dump produces is that it's not at all structured, so if you want, for example, to restore a table but to a table of a different name, or to a different schema, then you're reduced to having to mangle the SQL by using hand editing or regular expression matching. Something with the potential to ease that pain would be worth having. cheers andrew
* Andrew Dunstan (andrew@dunslane.net) wrote: > On 04/11/2013 09:51 AM, Tom Lane wrote: > >No, this is exactly *wrong*. You might as well not bother to refactor, > >if the only API the library presents is exactly equivalent to what you > >could get with system("pg_dump ..."). Agreed. > Well, either they want that or they want that output more > accessibly, and without all the baggage that pg_dump necessarily > brings to the table. pg_dump does a lot of stuff that's basically > designed for bulk operations, and often what people want is a way to > get, say, the creation DDL for some object, without any locks than > the usual locks any transaction takes. Yes- being able to get that from a simple database function would be very nice. I wonder if some of what's been done with the "event" triggers would inform us about what that API should look like. > And yes, that would mean keeping knowledge of how to produce such > output in two places - pg_dump is going to need to keep historical > knowledge, for one thing. But I can live with that. Agreed. If it doesn't live in two places, for some period, we'll never actually have it outside of pg_dump. > It could be interesting to have a library that would output database > metadata in some machine readable and manipulatable format such as > JSON or XML. The DB function to produce the DDL command might have options to produce it in multiple formats..? Seems like it'd be nice to have, though perhaps not initially. Thanks, Stephen
<div dir="ltr">On Fri, Apr 12, 2013 at 1:00 AM, Stephen Frost <span dir="ltr"><<a href="mailto:sfrost@snowman.net" target="_blank">sfrost@snowman.net</a>></span>wrote:<br /><div class="gmail_extra"><div class="gmail_quote"><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">>Well, either they want that or they want that output more<br /><div class="im"> > accessibly,and without all the baggage that pg_dump necessarily<br /> > brings to the table. pg_dump does a lot of stuffthat's basically<br /> > designed for bulk operations, and often what people want is a way to<br /> > get, say,the creation DDL for some object, without any locks than<br /> > the usual locks any transaction takes.<br /><br /></div>Yes-being able to get that from a simple database function would be<br /> very nice. I wonder if some of what'sbeen done with the "event"<br /> triggers would inform us about what that API should look like.<br /></blockquote></div>Irecall discussions about reverse engineering of a parsed query tree in<br /></div><div class="gmail_extra">theevent trigger threads but nothing has been committed I think. Also, you<br /></div><div class="gmail_extra">need to consider that implementing such reverse engineering mechanism in<br />core might not be a goodthing for new features and maintenance, as it<br />would mean that it is necessary to change those APIs consistentlywith what<br /> is added on the parsing side.<br /></div><div class="gmail_extra">It could make more sense tohave such a set of functions created as a<br /></div><div class="gmail_extra">separate project.<br /><br /></div><div class="gmail_extra">My 2c.<br /></div><div class="gmail_extra">-- <br />Michael<br /></div></div>
>> > Well, either they want that or they want that output more >> > accessibly, and without all the baggage that pg_dump necessarily >> > brings to the table. pg_dump does a lot of stuff that's basically >> > designed for bulk operations, and often what people want is a way to >> > get, say, the creation DDL for some object, without any locks than >> > the usual locks any transaction takes. >> >> Yes- being able to get that from a simple database function would be >> very nice. I wonder if some of what's been done with the "event" >> triggers would inform us about what that API should look like. >> > I recall discussions about reverse engineering of a parsed query tree in > the event trigger threads but nothing has been committed I think. Also, you > need to consider that implementing such reverse engineering mechanism in > core might not be a good thing for new features and maintenance, as it > would mean that it is necessary to change those APIs consistently with what > is added on the parsing side. > It could make more sense to have such a set of functions created as a > separate project. This may or may not related to, but... pgpool-II already does "reverse engineering" from a parsed query. It parses a query, genetrates raw parse tree, rewrites it for certain purpose and generates text query. If you are interested, you could take a look at pgpool-II source code. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Tom Lane <tgl@sss.pgh.pa.us> writes: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> This idea doesn't work because of back-patch considerations (i.e. we >> would not be able to create the functions in back branches, and so this >> new style of pg_dump would only work with future server versions). So That is a policy question, not a technical one. We could either add the new functions in the backend binary itself, or provide it as an extension that pg_dump would know to install when needed, if we decided it's ok. My understanding is that will need to change that policy anyways the day we have a page disk format change and pg_upgrade needs to flag the old cluster pages with the old page version number before being able to run, or something. > There are other issues too, in particular that most of the backend's > code tends to work on SnapshotNow time whereas pg_dump would really > prefer it was all done according to the transaction snapshot. We have Would that be solved by having MVCC catalogs, or the backend code you're talking about wouldn't be included in there? (which would be surprising to me, as much as trumping the benefits of MVCC catalogs, but well). > In any case, "push it to the backend" offers no detectable help with the > core design issue here, which is figuring out what functionality needs > to be exposed with what API. Andrew did begin to work on that parts with the "Retail DDL" project. We know of several "getddl" implementation, and you can also have a look at the pg_dump -Fs (split format) patch that didn't make it for 9.3, where some API work has been done. The need exists and some thinking over the API to get here did happen. Some more certainly needs to be done, granted. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Michael Paquier <michael.paquier@gmail.com> writes: > I recall discussions about reverse engineering of a parsed query tree in > the event trigger threads but nothing has been committed I think. Also, you Yes. The name used in there was "Normalized Command String". > need to consider that implementing such reverse engineering mechanism in > core might not be a good thing for new features and maintenance, as it > would mean that it is necessary to change those APIs consistently with what > is added on the parsing side. The goal is to retain the same API, which is quite simple: function get_command_string(Node *parsetree) returns text At the SQL level, the Node * is of datatype "internal" and you can't forge it, you need to be given it in some ways. In the Event Trigger case we though of a TG_PARSETREE magic variable, or maybe another function get_current_parsetree() that only work when called from an event trigger. The other part of the API of course is how to represent the data, and as we're talking about a Normalized Command String, there's no choice but issuing a valid SQL command string that the server would know how to execute and which would have the same side effects. So of course a 9.3 and a 9.4 server equiped with that hypothetical function would behave differently when the syntax did change. And that's exactly why I think it the best choice here is to have that code embedded and maintained in core. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 04/11/2013 11:48 PM, Andrew Dunstan wrote: > It could be interesting to have a library that would output database > metadata in some machine readable and manipulatable format such as > JSON or XML. One thing that's annoying about the text output pg_dump > produces is that it's not at all structured, so if you want, for > example, to restore a table but to a table of a different name, or to > a different schema, then you're reduced to having to mangle the SQL by > using hand editing or regular expression matching. Something with the > potential to ease that pain would be worth having. > Yes. This is really interesting. Current code in pg_dump, supports 4 ArchiveFormat, e.g. archCustom, archTar, archNull and archDirectory. These formats are implementation of interface "pg_backup". Maybe I could try to add two implementation of "XML" and "JSON". It is worth to mention that I wrote a program to parse XML format file into csv one using library "libxercesc" a month ago, Although, this program is just like helloworld. But, maybe I could get benefit from that small program, because both of them use XML format. And what I need to do is try another xml library. I had a look at JSON on wiki. The format is a little like XML. Both of them are nested. And there are some library could be used, e.g. libjson (or json.c, or other json library writting in C) and libxml2 (or something else). BTW, could it be an idea for GSOC? If so, I can have a try. Add XML and JSON output format for pg_dump. Thank you all for your attention. Best regards, Shuai
On 04/11/2013 12:17 AM, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Hannu Krosing wrote: >>> Natural solution to this seems to move most of pg_dump functionality >>> into backend as functions, so we have pg_dump_xxx() for everything >>> we want to dump plus a topological sort function for getting the >>> objects in right order. >> This idea doesn't work because of back-patch considerations (i.e. we >> would not be able to create the functions in back branches, and so this >> new style of pg_dump would only work with future server versions). So >> pg_dump itself would have to retain capability to dump stuff from old >> servers. This seems unlikely to fly at all, because we'd be then >> effectively maintaining pg_dump in two places, both backend and the >> pg_dump source code. > There are other issues too, in particular that most of the backend's > code tends to work on SnapshotNow time whereas pg_dump would really > prefer it was all done according to the transaction snapshot. I was just thinking of moving the queries the pg_dump currently uses into UDF-s, which do _not_ use catalog cache, but will use the same SQL to query catalogs as pg_dump currently does using whatever snapshot mode is currently set . the pg_dump will need to still have the same queries for older versions of postgresql but for new versions pg_dump can become catalog-agnostic. and I think that we can retire pg_dump support for older postgresql versions the same way we drop support for older versions of postgresql itself. Hannu > We have > got bugs of that ilk already in pg_dump, but we shouldn't introduce a > bunch more. Doing this right would therefore mean that we'd have to > write a lot of duplicative code in the backend, ie, it's not clear that > we gain any synergy by pushing the functionality over. It might > simplify cross-backend-version issues (at least for backend versions > released after we'd rewritten all that code) but otherwise I'm afraid > it'd just be pushing the problems somewhere else. > > In any case, "push it to the backend" offers no detectable help with the > core design issue here, which is figuring out what functionality needs > to be exposed with what API. main things I see would be * get_list_of_objects(object_type, pattern or namelist) * get_sql_def_for_object(object_type, object_name) * sort_by_dependency(listof [obj_type, obj_name]) from this you could easily construct most uses, especially if sort_by_dependency(list of [obj_type, obj_name]) would be smart enough to break circular dependencies, like turning to tables with mutual FK-s into tabledefs without FKs + separate constraints. Or we could always have constraints separately, so that the ones depending on non-exported objects would be easy to leave out My be the dependency API analysis itself is something worth a GSOC effort ? Hannu > > regards, tom lane
On Fri, Apr 12, 2013 at 1:07 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
I was just thinking of moving the queries the pg_dump currentlyuses into UDF-s, which do _not_ use catalog cache, but will use
the same SQL to query catalogs as pg_dump currently does
using whatever snapshot mode is currently set .
the pg_dump will need to still have the same queries for older
versions of postgresql but for new versions pg_dump can become
catalog-agnostic.
and I think that we can retire pg_dump support for older
postgresql versions the same way we drop support for
older versions of postgresql itself.
main things I see would be
* get_list_of_objects(object_type, pattern or namelist)
* get_sql_def_for_object(object_type, object_name)
* sort_by_dependency(list of [obj_type, obj_name])
from this you could easily construct most uses, especially if
sort_by_dependency(list of [obj_type, obj_name])
would be smart enough to break circular dependencies, like
turning to tables with mutual FK-s into tabledefs without
FKs + separate constraints.
+1
This is an excellent idea. This would allow doing all kinds of crazy things outside of the scope of pg_dump.
2 years ago I was working on a system to version control the schema, inside the database.
Don't know if it's a good idea or not, but one thing which bugged me a lot was the lack of pg_get_[object type]def(oid) functions for all different object types.
It also turned out to be quite complicated to do the pg_depend topological sort yourself. I managed eventually, but it was running to slow because I had to pass the entire content of pg_depend to a plperl function I wrote.
With this in place I would be motivated enough to resume my old project, which is still online at https://github.com/gluefinance/pov if anyone is interested.
Is it really necessary to write all the missing pg_get_[object type]def(oid) functions in C? I think it would be quite easy to put them together using pure SQL, you wouldn't even need PL/pgSQL.
This old view I once wrote manage to produce working create and drop statements for most object types using SQL only:
It would also be nice with functions which returned the proper command to DROP an object. I need it in this project in order to do schema modifications where objects have to be dropped/recreated in a particular order to not break dependencies. Perhaps there are other use cases out there.