Thread: pg_dump: Remove "blob" terminology
For historical reasons, pg_dump refers to large objects as "BLOBs". This term is not used anywhere else in PostgreSQL, and it also means something different in the SQL standard and other SQL systems. This patch renames internal functinos, code comments, documentation, etc. to use the "large object" or "LO" terminology instead. There is no functionality change, so the archive format still uses the name "BLOB" for the archive entry. Additional long command-line options are added with the new naming.
Attachment
> On 30 Nov 2022, at 08:04, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > For historical reasons, pg_dump refers to large objects as "BLOBs". This term is not used anywhere else in PostgreSQL,and it also means something different in the SQL standard and other SQL systems. > > This patch renames internal functinos, code comments, documentation, etc. to use the "large object" or "LO" terminologyinstead. There is no functionality change, so the archive format still uses the name "BLOB" for the archive entry. Additional long command-line options are added with the new naming. +1 on doing this. No pointy bits stood out when reading, just a few small comments: The commit message contains a typo: functinos * called for both BLOB and TABLE data; it is the responsibility of - * the format to manage each kind of data using StartBlob/StartData. + * the format to manage each kind of data using StartLO/StartData. Should BLOB be changed to BLOBS here (and in similar comments) to make it clearer that it refers to the archive entry and the concept of a binary large object in general? Theres an additional mention in src/test/modules/test_pg_dump/t/001_base.pl: # Tests which are considered 'full' dumps by pg_dump, but there. # are flags used to exclude specific items (ACLs, blobs, etc). -- Daniel Gustafsson https://vmware.com/
On 30.11.22 09:07, Daniel Gustafsson wrote: > The commit message contains a typo: functinos fixed > * called for both BLOB and TABLE data; it is the responsibility of > - * the format to manage each kind of data using StartBlob/StartData. > + * the format to manage each kind of data using StartLO/StartData. > > Should BLOB be changed to BLOBS here (and in similar comments) to make it > clearer that it refers to the archive entry and the concept of a binary large > object in general? I changed this one and went through it again to tidy it up a bit more. (There are both "BLOB" and "BLOBS" archive entries, so both forms still exist in the code now.) > Theres an additional mention in src/test/modules/test_pg_dump/t/001_base.pl: > > # Tests which are considered 'full' dumps by pg_dump, but there. > # are flags used to exclude specific items (ACLs, blobs, etc). fixed I also put back the old long options forms in the documentation and help but marked them deprecated.
Attachment
> On 2 Dec 2022, at 08:09, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > fixed +1 on this version of the patch, LGTM. > I also put back the old long options forms in the documentation and help but marked them deprecated. + <term><option>--blobs</option> (deprecated)</term> While not in scope for this patch, I wonder if we should add a similar (deprecated) marker on other commandline options which are documented to be deprecated. -i on bin/postgres comes to mind as one candidate. -- Daniel Gustafsson https://vmware.com/
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > On 30.11.22 09:07, Daniel Gustafsson wrote: >> Should BLOB be changed to BLOBS here (and in similar comments) to make it >> clearer that it refers to the archive entry and the concept of a binary large >> object in general? > I changed this one and went through it again to tidy it up a bit more. > (There are both "BLOB" and "BLOBS" archive entries, so both forms still > exist in the code now.) I've not read this patch and don't have time right this moment, but I wanted to make a note about something to have in the back of your head: we really need to do something about situations with $BIGNUM large objects. Currently those tend to run pg_dump or pg_restore out of memory because of TOC bloat, and we've seen multiple field reports of that actually happening. The scheme I've vaguely thought about, but not got round to writing, is to merge all blobs with the same owner and ACL into one TOC entry. One would hope that would get it down to a reasonable number of TOC entries in practical applications. (Perhaps there'd need to be a switch to make this optional.) I'm not asking you to make that happen as part of this patch, but please don't refactor things in a way that will make it harder. regards, tom lane
> On 2 Dec 2022, at 15:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not asking you to make that happen as part of this patch, but > please don't refactor things in a way that will make it harder. I have that on my TODO as well since 7da8823d83a2b66bdd917aa6cb2c5c2619d86011.camel@credativ.de, and having read this patch I don't think it moves the needle in any way which complicates that. -- Daniel Gustafsson https://vmware.com/
On 2022-12-02 Fr 09:18, Tom Lane wrote: > we really need to do something about situations with $BIGNUM > large objects. Currently those tend to run pg_dump or pg_restore > out of memory because of TOC bloat, and we've seen multiple field > reports of that actually happening. > > The scheme I've vaguely thought about, but not got round to writing, > is to merge all blobs with the same owner and ACL into one TOC entry. > One would hope that would get it down to a reasonable number of > TOC entries in practical applications. (Perhaps there'd need to be > a switch to make this optional.) +1 for fixing this. Your scheme seems reasonable. This has been a pain point for a long time. I'm not sure what we'd gain by making the fix optional. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2022-12-02 Fr 09:18, Tom Lane wrote: >> The scheme I've vaguely thought about, but not got round to writing, >> is to merge all blobs with the same owner and ACL into one TOC entry. >> One would hope that would get it down to a reasonable number of >> TOC entries in practical applications. (Perhaps there'd need to be >> a switch to make this optional.) > +1 for fixing this. Your scheme seems reasonable. This has been a pain > point for a long time. I'm not sure what we'd gain by making the fix > optional. Well, what this would lose is the ability to selectively restore individual large objects using "pg_restore -L". I'm not sure who out there might be depending on that, but if we assume that's the empty set I fear we'll find out it's not. So a workaround switch seemed possibly worth the trouble. I don't have a position yet on which way ought to be the default. regards, tom lane
On 2022-12-02 Fr 12:40, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 2022-12-02 Fr 09:18, Tom Lane wrote: >>> The scheme I've vaguely thought about, but not got round to writing, >>> is to merge all blobs with the same owner and ACL into one TOC entry. >>> One would hope that would get it down to a reasonable number of >>> TOC entries in practical applications. (Perhaps there'd need to be >>> a switch to make this optional.) >> +1 for fixing this. Your scheme seems reasonable. This has been a pain >> point for a long time. I'm not sure what we'd gain by making the fix >> optional. > Well, what this would lose is the ability to selectively restore > individual large objects using "pg_restore -L". I'm not sure who > out there might be depending on that, but if we assume that's the > empty set I fear we'll find out it's not. So a workaround switch > seemed possibly worth the trouble. I don't have a position yet > on which way ought to be the default. > > OK, fair point. I suspect making the batched mode the default would gain more friends than enemies. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 02.12.22 09:34, Daniel Gustafsson wrote: >> On 2 Dec 2022, at 08:09, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > >> fixed > > +1 on this version of the patch, LGTM. committed >> I also put back the old long options forms in the documentation and help but marked them deprecated. > > + <term><option>--blobs</option> (deprecated)</term> > While not in scope for this patch, I wonder if we should add a similar > (deprecated) marker on other commandline options which are documented to be > deprecated. -i on bin/postgres comes to mind as one candidate. makes sense
On Sat, Dec 3, 2022 at 11:07 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > Well, what this would lose is the ability to selectively restore > > individual large objects using "pg_restore -L". I'm not sure who > > out there might be depending on that, but if we assume that's the > > empty set I fear we'll find out it's not. So a workaround switch > > seemed possibly worth the trouble. I don't have a position yet > > on which way ought to be the default. > > OK, fair point. I suspect making the batched mode the default would gain > more friends than enemies. A lot of people probably don't know that selective restore even exists but it is an AWESOME feature and I'd hate to see us break it, or even just degrade it. I wonder if we can't find a better solution than bunching TOC entries together. Perhaps we don't need every TOC entry in memory simultaneously, for instance, especially things like LOBs that don't have dependencies. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I wonder if we can't find a better solution than bunching TOC entries > together. Perhaps we don't need every TOC entry in memory > simultaneously, for instance, especially things like LOBs that don't > have dependencies. Interesting idea. We'd have to either read the TOC multiple times, or shove the LOB TOC entries into a temporary file, either of which has downsides. regards, tom lane
On Mon, Dec 5, 2022 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Interesting idea. We'd have to either read the TOC multiple times, > or shove the LOB TOC entries into a temporary file, either of which > has downsides. I wonder if we'd need to read the TOC entries repeatedly, or just incrementally. I haven't studied the pg_dump format much, but I wonder if we could arrange things so that the "boring" entries without dependencies, or maybe the LOB entries specifically, are grouped together in some way where we know the byte-length of that section and can just skip over it. Then when we need them we go back and read 'em one by one. (Of course this doesn't work if reading for standard input or if multiple phases of processing need them or whatever. Not trying to say I've solved the problem. But in general I think we need to give more thought to the possibility that "just read all the data into memory" is not an adequate solution to $PROBLEM.) -- Robert Haas EDB: http://www.enterprisedb.com