Thread: pg_dump: Remove "blob" terminology

pg_dump: Remove "blob" terminology

From

Peter Eisentraut

Date:

30 November 2022, 07:04:01

For historical reasons, pg_dump refers to large objects as "BLOBs". 
This term is not used anywhere else in PostgreSQL, and it also means 
something different in the SQL standard and other SQL systems.

This patch renames internal functinos, code comments, documentation, 
etc. to use the "large object" or "LO" terminology instead.  There is no 
functionality change, so the archive format still uses the name "BLOB" 
for the archive entry.  Additional long command-line options are added 
with the new naming.

Attachment

0001-pg_dump-Remove-blob-terminology.patch

Re: pg_dump: Remove "blob" terminology

From

Daniel Gustafsson

Date:

30 November 2022, 08:07:53

> On 30 Nov 2022, at 08:04, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> For historical reasons, pg_dump refers to large objects as "BLOBs". This term is not used anywhere else in
PostgreSQL,and it also means something different in the SQL standard and other SQL systems. 
>
> This patch renames internal functinos, code comments, documentation, etc. to use the "large object" or "LO"
terminologyinstead.  There is no functionality change, so the archive format still uses the name "BLOB" for the archive
entry. Additional long command-line options are added with the new naming. 

+1 on doing this.  No pointy bits stood out when reading, just a few small
comments:

The commit message contains a typo: functinos

  * called for both BLOB and TABLE data; it is the responsibility of
- * the format to manage each kind of data using StartBlob/StartData.
+ * the format to manage each kind of data using StartLO/StartData.

Should BLOB be changed to BLOBS here (and in similar comments) to make it
clearer that it refers to the archive entry and the concept of a binary large
object in general?

Theres an additional mention in src/test/modules/test_pg_dump/t/001_base.pl:

    # Tests which are considered 'full' dumps by pg_dump, but there.
    # are flags used to exclude specific items (ACLs, blobs, etc).

--
Daniel Gustafsson        https://vmware.com/

Re: pg_dump: Remove "blob" terminology

From

Peter Eisentraut

Date:

02 December 2022, 07:09:46

On 30.11.22 09:07, Daniel Gustafsson wrote:
> The commit message contains a typo: functinos

fixed

>    * called for both BLOB and TABLE data; it is the responsibility of
> - * the format to manage each kind of data using StartBlob/StartData.
> + * the format to manage each kind of data using StartLO/StartData.
> 
> Should BLOB be changed to BLOBS here (and in similar comments) to make it
> clearer that it refers to the archive entry and the concept of a binary large
> object in general?

I changed this one and went through it again to tidy it up a bit more. 
(There are both "BLOB" and "BLOBS" archive entries, so both forms still 
exist in the code now.)

> Theres an additional mention in src/test/modules/test_pg_dump/t/001_base.pl:
> 
>      # Tests which are considered 'full' dumps by pg_dump, but there.
>      # are flags used to exclude specific items (ACLs, blobs, etc).

fixed

I also put back the old long options forms in the documentation and help 
but marked them deprecated.

Attachment

v2-0001-pg_dump-Remove-blob-terminology.patch

Re: pg_dump: Remove "blob" terminology

From

Daniel Gustafsson

Date:

02 December 2022, 08:34:11

> On 2 Dec 2022, at 08:09, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

> fixed

+1 on this version of the patch, LGTM.

> I also put back the old long options forms in the documentation and help but marked them deprecated.

+      <term><option>--blobs</option> (deprecated)</term>
While not in scope for this patch, I wonder if we should add a similar
(deprecated) marker on other commandline options which are documented to be
deprecated.  -i on bin/postgres comes to mind as one candidate.

--
Daniel Gustafsson        https://vmware.com/

Re: pg_dump: Remove "blob" terminology

From

Tom Lane

Date:

02 December 2022, 14:18:57

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> On 30.11.22 09:07, Daniel Gustafsson wrote:
>> Should BLOB be changed to BLOBS here (and in similar comments) to make it
>> clearer that it refers to the archive entry and the concept of a binary large
>> object in general?

> I changed this one and went through it again to tidy it up a bit more.
> (There are both "BLOB" and "BLOBS" archive entries, so both forms still
> exist in the code now.)

I've not read this patch and don't have time right this moment, but
I wanted to make a note about something to have in the back of your
head: we really need to do something about situations with $BIGNUM
large objects.  Currently those tend to run pg_dump or pg_restore
out of memory because of TOC bloat, and we've seen multiple field
reports of that actually happening.

The scheme I've vaguely thought about, but not got round to writing,
is to merge all blobs with the same owner and ACL into one TOC entry.
One would hope that would get it down to a reasonable number of
TOC entries in practical applications.  (Perhaps there'd need to be
a switch to make this optional.)

I'm not asking you to make that happen as part of this patch, but
please don't refactor things in a way that will make it harder.

            regards, tom lane

Re: pg_dump: Remove "blob" terminology

From

Daniel Gustafsson

Date:

02 December 2022, 14:53:33

> On 2 Dec 2022, at 15:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I'm not asking you to make that happen as part of this patch, but
> please don't refactor things in a way that will make it harder.

I have that on my TODO as well since 7da8823d83a2b66bdd917aa6cb2c5c2619d86011.camel@credativ.de,
and having read this patch I don't think it moves the needle in any
way which complicates that.

--
Daniel Gustafsson        https://vmware.com/

Re: pg_dump: Remove "blob" terminology

From

Andrew Dunstan

Date:

02 December 2022, 17:26:29

On 2022-12-02 Fr 09:18, Tom Lane wrote:
> we really need to do something about situations with $BIGNUM
> large objects.  Currently those tend to run pg_dump or pg_restore
> out of memory because of TOC bloat, and we've seen multiple field
> reports of that actually happening.
>
> The scheme I've vaguely thought about, but not got round to writing,
> is to merge all blobs with the same owner and ACL into one TOC entry.
> One would hope that would get it down to a reasonable number of
> TOC entries in practical applications.  (Perhaps there'd need to be
> a switch to make this optional.)


+1 for fixing this. Your scheme seems reasonable. This has been a pain
point for a long time. I'm not sure what we'd gain by making the fix
optional.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_dump: Remove "blob" terminology

From

Tom Lane

Date:

02 December 2022, 17:40:35

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2022-12-02 Fr 09:18, Tom Lane wrote:
>> The scheme I've vaguely thought about, but not got round to writing,
>> is to merge all blobs with the same owner and ACL into one TOC entry.
>> One would hope that would get it down to a reasonable number of
>> TOC entries in practical applications.  (Perhaps there'd need to be
>> a switch to make this optional.)

> +1 for fixing this. Your scheme seems reasonable. This has been a pain
> point for a long time. I'm not sure what we'd gain by making the fix
> optional.

Well, what this would lose is the ability to selectively restore
individual large objects using "pg_restore -L".  I'm not sure who
out there might be depending on that, but if we assume that's the
empty set I fear we'll find out it's not.  So a workaround switch
seemed possibly worth the trouble.  I don't have a position yet
on which way ought to be the default.

            regards, tom lane

Re: pg_dump: Remove "blob" terminology

From

Andrew Dunstan

Date:

03 December 2022, 16:07:09

On 2022-12-02 Fr 12:40, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 2022-12-02 Fr 09:18, Tom Lane wrote:
>>> The scheme I've vaguely thought about, but not got round to writing,
>>> is to merge all blobs with the same owner and ACL into one TOC entry.
>>> One would hope that would get it down to a reasonable number of
>>> TOC entries in practical applications.  (Perhaps there'd need to be
>>> a switch to make this optional.)
>> +1 for fixing this. Your scheme seems reasonable. This has been a pain
>> point for a long time. I'm not sure what we'd gain by making the fix
>> optional.
> Well, what this would lose is the ability to selectively restore
> individual large objects using "pg_restore -L".  I'm not sure who
> out there might be depending on that, but if we assume that's the
> empty set I fear we'll find out it's not.  So a workaround switch
> seemed possibly worth the trouble.  I don't have a position yet
> on which way ought to be the default.
>
>             


OK, fair point. I suspect making the batched mode the default would gain
more friends than enemies.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_dump: Remove "blob" terminology

From

Peter Eisentraut

Date:

05 December 2022, 08:04:58

On 02.12.22 09:34, Daniel Gustafsson wrote:
>> On 2 Dec 2022, at 08:09, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
> 
>> fixed
> 
> +1 on this version of the patch, LGTM.

committed

>> I also put back the old long options forms in the documentation and help but marked them deprecated.
> 
> +      <term><option>--blobs</option> (deprecated)</term>
> While not in scope for this patch, I wonder if we should add a similar
> (deprecated) marker on other commandline options which are documented to be
> deprecated.  -i on bin/postgres comes to mind as one candidate.

makes sense

Re: pg_dump: Remove "blob" terminology

From

Robert Haas

Date:

05 December 2022, 15:08:57

On Sat, Dec 3, 2022 at 11:07 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> > Well, what this would lose is the ability to selectively restore
> > individual large objects using "pg_restore -L".  I'm not sure who
> > out there might be depending on that, but if we assume that's the
> > empty set I fear we'll find out it's not.  So a workaround switch
> > seemed possibly worth the trouble.  I don't have a position yet
> > on which way ought to be the default.
>
> OK, fair point. I suspect making the batched mode the default would gain
> more friends than enemies.

A lot of people probably don't know that selective restore even exists
but it is an AWESOME feature and I'd hate to see us break it, or even
just degrade it.

I wonder if we can't find a better solution than bunching TOC entries
together. Perhaps we don't need every TOC entry in memory
simultaneously, for instance, especially things like LOBs that don't
have dependencies.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump: Remove "blob" terminology

From

Tom Lane

Date:

05 December 2022, 15:56:43

Robert Haas <robertmhaas@gmail.com> writes:
> I wonder if we can't find a better solution than bunching TOC entries
> together. Perhaps we don't need every TOC entry in memory
> simultaneously, for instance, especially things like LOBs that don't
> have dependencies.

Interesting idea.  We'd have to either read the TOC multiple times,
or shove the LOB TOC entries into a temporary file, either of which
has downsides.

            regards, tom lane

Re: pg_dump: Remove "blob" terminology

From

Robert Haas

Date:

05 December 2022, 16:16:43

On Mon, Dec 5, 2022 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Interesting idea.  We'd have to either read the TOC multiple times,
> or shove the LOB TOC entries into a temporary file, either of which
> has downsides.

I wonder if we'd need to read the TOC entries repeatedly, or just incrementally.

I haven't studied the pg_dump format much, but I wonder if we could
arrange things so that the "boring" entries without dependencies, or
maybe the LOB entries specifically, are grouped together in some way
where we know the byte-length of that section and can just skip over
it. Then when we need them we go back and read 'em one by one.

(Of course this doesn't work if reading for standard input or if
multiple phases of processing need them or whatever. Not trying to say
I've solved the problem. But in general I think we need to give more
thought to the possibility that "just read all the data into memory"
is not an adequate solution to $PROBLEM.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com