Thread: Inconvenience of pg_read_binary_file()

Inconvenience of pg_read_binary_file()

From
Kyotaro Horiguchi
Date:
If I want to read a file that I'm not sure of the existence but I want
to read the whole file if exists, I would call
pg_read_binary_file('path', 0, -1, true) but unfortunately this
doesn't work.

Does it make sense to change the function so as to accept the
parameter specification above? Or the arguments could be ('path',
null, null, true) but (0,-1) is simpler considering the
characteristics of the function.

(We could also rearrange the the parameter order as "filename,
missing_ok, offset, length" but that is simply confusing..)

If it is, pg_read_file() is worth receive the same modification and
I'll post the version containing doc part.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: Inconvenience of pg_read_binary_file()

From
Michael Paquier
Date:
On Tue, Jun 07, 2022 at 04:05:20PM +0900, Kyotaro Horiguchi wrote:
> If I want to read a file that I'm not sure of the existence but I want
> to read the whole file if exists, I would call
> pg_read_binary_file('path', 0, -1, true) but unfortunately this
> doesn't work.

Yeah, the "normal" cases that I have seen in the past just used an
extra call to pg_stat_file() to retrieve the size of the file before
reading it, but arguably it does not help if the file gets extended
between the stat() call and the read() call (we don't need to care
about this case with pg_rewind that has been the reason why the
missing_ok argument was introduced first, for one, as file extensions
don't matter as we'd replay from the LSN point where the rewind
began, adding the new blocks at replay).

There is also an argument for supporting negative values rather than
just -1.  For example -2 could mean to read all the file except the
last byte.  Or you could have an extra flavor, as of
pg_read_file(text, bool) to read the whole by default.  Supporting
just -1 as special value for the amount of data to read would be
confusing IMO.  So I would tend to choose for a set of arguments based
on (text, bool).
--
Michael

Attachment

Re: Inconvenience of pg_read_binary_file()

From
Kyotaro Horiguchi
Date:
At Tue, 7 Jun 2022 16:33:53 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Tue, Jun 07, 2022 at 04:05:20PM +0900, Kyotaro Horiguchi wrote:
> > If I want to read a file that I'm not sure of the existence but I want
> > to read the whole file if exists, I would call
> > pg_read_binary_file('path', 0, -1, true) but unfortunately this
> > doesn't work.
> 
> Yeah, the "normal" cases that I have seen in the past just used an
> extra call to pg_stat_file() to retrieve the size of the file before
> reading it, but arguably it does not help if the file gets extended
> between the stat() call and the read() call (we don't need to care
> about this case with pg_rewind that has been the reason why the
> missing_ok argument was introduced first, for one, as file extensions
> don't matter as we'd replay from the LSN point where the rewind
> began, adding the new blocks at replay).

Sure.

> There is also an argument for supporting negative values rather than
> just -1.  For example -2 could mean to read all the file except the
> last byte.  Or you could have an extra flavor, as of
> pg_read_file(text, bool) to read the whole by default.  Supporting
> just -1 as special value for the amount of data to read would be
> confusing IMO.  So I would tend to choose for a set of arguments based
> on (text, bool).

I'm not sure about the negative length smaller than -1, since I don't
find an apprpriate offset that represents (last byte + 1).

pg_read_file(text, bool) makes sense to me, but it doesn't seem like
to be able to share C function with other variations.
pg_read_binary_file() need to accept some out-of-range value for
offset or length to signal that offset and length are not specified.

In the attached pg_read(_binary)_file_all() is modifiedf so that they
have the different body from pg_read(_binary)_file().

(function comments needs to be edited and docs are needed)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: Inconvenience of pg_read_binary_file()

From
Kyotaro Horiguchi
Date:
At Tue, 07 Jun 2022 17:29:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> pg_read_file(text, bool) makes sense to me, but it doesn't seem like
> to be able to share C function with other variations.
> pg_read_binary_file() need to accept some out-of-range value for
> offset or length to signal that offset and length are not specified.

In this version all the polypmorphic variations share the same body
function.  I tempted to add tail-reading feature but it would be
another feature.

> (function comments needs to be edited and docs are needed)

- Simplified the implementation (by complexifying argument handling..).
- REVOKEd EXECUTE from the new functions.
- Edited the signature of the two functions.

> - pg_read_file ( filename text [, offset bigint, length bigint [, missing_ok boolean ]] ) → text
> + pg_read_file ( filename text [, offset bigint, length bigint ] [, missing_ok boolean ] ) → text

And registered this to the next CF.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: Inconvenience of pg_read_binary_file()

From
Tom Lane
Date:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> - Simplified the implementation (by complexifying argument handling..).
> - REVOKEd EXECUTE from the new functions.
> - Edited the signature of the two functions.

>> - pg_read_file ( filename text [, offset bigint, length bigint [, missing_ok boolean ]] ) $B"*(B text
>> + pg_read_file ( filename text [, offset bigint, length bigint ] [, missing_ok boolean ] ) $B"*(B text

I'm okay with allowing this variant of the functions.  Since there's
no implicit cast between bigint and bool, plus the fact that you
can't give just offset without length, there shouldn't be much risk
of confusion as to which variant to invoke.

I don't really like the implementation style though.  That mess of
PG_NARGS tests is illegible code already and this makes it worse.
I think it'd be way cleaner to have all the PG_GETARG calls in the
bottom SQL-callable functions (which are already one-per-signature)
and then pass them on to a common function that has an ordinary C
call signature, along the lines of

static Datum
pg_read_file_common(text *filename_t,
                    int64 seek_offset, int64 bytes_to_read,
                    bool read_to_eof, bool missing_ok)
{
    if (read_to_eof)
        bytes_to_read = -1;    // or just Assert that it's -1 ?
    else if (bytes_to_read < 0)
        ereport(...);
    ...
}

Datum
pg_read_file_off_len(PG_FUNCTION_ARGS)
{
    text       *filename_t = PG_GETARG_TEXT_PP(0);
    int64       seek_offset = PG_GETARG_INT64(1);
    int64       bytes_to_read = PG_GETARG_INT64(2);

    return pg_read_file_common(filename_t, seek_offset, bytes_to_read,
                               false, false);
}

            regards, tom lane



Re: Inconvenience of pg_read_binary_file()

From
Kyotaro Horiguchi
Date:
Thanks for taking a look!

At Thu, 28 Jul 2022 16:22:17 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > - Simplified the implementation (by complexifying argument handling..).
> > - REVOKEd EXECUTE from the new functions.
> > - Edited the signature of the two functions.
> 
> >> - pg_read_file ( filename text [, offset bigint, length bigint [, missing_ok boolean ]] ) → text
> >> + pg_read_file ( filename text [, offset bigint, length bigint ] [, missing_ok boolean ] ) → text
> 
> I'm okay with allowing this variant of the functions.  Since there's
> no implicit cast between bigint and bool, plus the fact that you
> can't give just offset without length, there shouldn't be much risk
> of confusion as to which variant to invoke.

Grad to hear that.

> I don't really like the implementation style though.  That mess of
> PG_NARGS tests is illegible code already and this makes it worse.

Ah..., I have to admit that I faintly felt that feeling while on it...

> I think it'd be way cleaner to have all the PG_GETARG calls in the
> bottom SQL-callable functions (which are already one-per-signature)
> and then pass them on to a common function that has an ordinary C
> call signature, along the lines of
> 
> static Datum
> pg_read_file_common(text *filename_t,
>                     int64 seek_offset, int64 bytes_to_read,
>                     bool read_to_eof, bool missing_ok)
> {
>     if (read_to_eof)
>         bytes_to_read = -1;    // or just Assert that it's -1 ?

I prefer assertion since that parameter cannot be passed by users.

>     else if (bytes_to_read < 0)
>         ereport(...);
>     ...
> }

This function cannot return NULL directly. Without the ability to
return NULL, it is pointless for the function to return Datum.  In the
attached the function returns text*.

> Datum
> pg_read_file_off_len(PG_FUNCTION_ARGS)
> {
>     text       *filename_t = PG_GETARG_TEXT_PP(0);
>     int64       seek_offset = PG_GETARG_INT64(1);
>     int64       bytes_to_read = PG_GETARG_INT64(2);
> 
>     return pg_read_file_common(filename_t, seek_offset, bytes_to_read,
>                                false, false);

As the result this function need to return NULL or TEXT_P according to
the returned value from pg_read_file_common.

+    if (!ret)
+        PG_RETURN_NULL();
+
+    PG_RETURN_TEXT_P(ret);
> }

# I'm tempted to call read_text_file() directly from each SQL functions..

Please find the attached.  I added some regression tests for both
pg_read_file() and pg_read_binary_file().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: Inconvenience of pg_read_binary_file()

From
Tom Lane
Date:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> Please find the attached.  I added some regression tests for both
> pg_read_file() and pg_read_binary_file().

Yeah, I definitely find this way cleaner even if it's a bit more verbose.

I think that the PG_RETURN_NULL code paths are not reachable in the
wrappers that don't have missing_ok.  I concur with your decision
to write them all the same, though.

Pushed after some fooling with the docs and test cases.  (Notably,
I do not think we can assume that pg_hba.conf exists in $PGDATA; some
installations keep it elsewhere.  I used postgresql.auto.conf instead.)

            regards, tom lane



Re: Inconvenience of pg_read_binary_file()

From
Michael Paquier
Date:
On Fri, Jul 29, 2022 at 03:44:25PM -0400, Tom Lane wrote:
> Pushed after some fooling with the docs and test cases.  (Notably,
> I do not think we can assume that pg_hba.conf exists in $PGDATA; some
> installations keep it elsewhere.  I used postgresql.auto.conf instead.)

Are you sure that this last part is a good idea?  We don't force the
creation of postgresql.auto.conf when starting a server, so this
impacts the portability of the tests with installcheck if one decides
to remove it from the data folder, and it sounds plausible to me that
some distributions do exactly that..

I guess that you could rely on config_file or hba_file instead.
--
Michael

Attachment

Re: Inconvenience of pg_read_binary_file()

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Jul 29, 2022 at 03:44:25PM -0400, Tom Lane wrote:
>> Pushed after some fooling with the docs and test cases.  (Notably,
>> I do not think we can assume that pg_hba.conf exists in $PGDATA; some
>> installations keep it elsewhere.  I used postgresql.auto.conf instead.)

> Are you sure that this last part is a good idea?  We don't force the
> creation of postgresql.auto.conf when starting a server, so this
> impacts the portability of the tests with installcheck if one decides
> to remove it from the data folder, and it sounds plausible to me that
> some distributions do exactly that..

Hm.  I considered reading PG_VERSION instead, or postmaster.pid.
PG_VERSION would be a very boring test case, but it should certainly
be present in $PGDATA.

            regards, tom lane



Re: Inconvenience of pg_read_binary_file()

From
Michael Paquier
Date:
On Fri, Jul 29, 2022 at 11:35:36PM -0400, Tom Lane wrote:
> Hm.  I considered reading PG_VERSION instead, or postmaster.pid.
> PG_VERSION would be a very boring test case, but it should certainly
> be present in $PGDATA.

PG_VERSION would be simpler.  Looking at postmaster.pid would require
a lookup at external_pid_file, and as it is not set by default you
would need to add some extra logic in the tests where
external_pid_file = NULL <=> PGDATA/postmaster.pid.
--
Michael

Attachment

Re: Inconvenience of pg_read_binary_file()

From
Alvaro Herrera
Date:
On 2022-Jul-30, Michael Paquier wrote:

> PG_VERSION would be simpler.  Looking at postmaster.pid would require
> a lookup at external_pid_file, and as it is not set by default you
> would need to add some extra logic in the tests where
> external_pid_file = NULL <=> PGDATA/postmaster.pid.

Hmm, no? as I recall external_pid_file is an *additional* PID file; it
doesn't supplant postmaster.pid.

postmaster.opts is also an option.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"E pur si muove" (Galileo Galilei)



Re: Inconvenience of pg_read_binary_file()

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2022-Jul-30, Michael Paquier wrote:
>> PG_VERSION would be simpler.  Looking at postmaster.pid would require
>> a lookup at external_pid_file, and as it is not set by default you
>> would need to add some extra logic in the tests where
>> external_pid_file = NULL <=> PGDATA/postmaster.pid.

> Hmm, no? as I recall external_pid_file is an *additional* PID file; it
> doesn't supplant postmaster.pid.

Right.  postmaster.pid absolutely should be there if the postmaster
is up (and if it ain't, you're going to have lots of other difficulty
in running the regression tests...).  It doesn't feel quite as static
as PG_VERSION, though.

            regards, tom lane



Re: Inconvenience of pg_read_binary_file()

From
Kyotaro Horiguchi
Date:
At Sat, 30 Jul 2022 10:24:39 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2022-Jul-30, Michael Paquier wrote:
> >> PG_VERSION would be simpler.  Looking at postmaster.pid would require
> >> a lookup at external_pid_file, and as it is not set by default you
> >> would need to add some extra logic in the tests where
> >> external_pid_file = NULL <=> PGDATA/postmaster.pid.
> 
> > Hmm, no? as I recall external_pid_file is an *additional* PID file; it
> > doesn't supplant postmaster.pid.
> 
> Right.  postmaster.pid absolutely should be there if the postmaster
> is up (and if it ain't, you're going to have lots of other difficulty
> in running the regression tests...).  It doesn't feel quite as static
> as PG_VERSION, though.

Thanks for committing it.  Also the revised test (being suggested by
Michael) looks good.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center