Thread: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

pg_dump throwing "column number -1 is out of range 0..36" on HEAD

From
Michael Paquier
Date:
Hi all,

Trying to do pg_dump[all] on a 9.5 or older server results in spurious
failures:
pg_dump: column number -1 is out of range 0..36

After looking around, the problem comes from
check_tuple_field_number(), more specifically from getTables() where
someone has forgotten to add NULL values for amname when querying
older server versions.

Attached is a patch to fix that.  I am not seeing other failures with
an instance that includes all the contents of installcheck, so it
seems that the rest is fine.

This needs to be applied to HEAD, so I am adding an open item.

Any objections to the attached?
--
Michael

Attachment

Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

From
Dmitry Dolgov
Date:
> On Wed, May 22, 2019 at 10:34 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> Trying to do pg_dump[all] on a 9.5 or older server results in spurious
> failures:
> pg_dump: column number -1 is out of range 0..36
>
> After looking around, the problem comes from
> check_tuple_field_number(), more specifically from getTables() where
> someone has forgotten to add NULL values for amname when querying
> older server versions.

Yeah, sorry, looks like it was my fault.

> Attached is a patch to fix that.  I am not seeing other failures with
> an instance that includes all the contents of installcheck, so it
> seems that the rest is fine.
>
> This needs to be applied to HEAD, so I am adding an open item.
>
> Any objections to the attached?

I've checked it too (on 9.4), don't see any issues after applying this patch,
so +1.



Michael Paquier <michael@paquier.xyz> writes:
> Trying to do pg_dump[all] on a 9.5 or older server results in spurious
> failures:
> pg_dump: column number -1 is out of range 0..36

> After looking around, the problem comes from
> check_tuple_field_number(), more specifically from getTables() where
> someone has forgotten to add NULL values for amname when querying
> older server versions.

> Attached is a patch to fix that.  I am not seeing other failures with
> an instance that includes all the contents of installcheck, so it
> seems that the rest is fine.

Looks like the right fix.  I'm sad that the buildfarm did not catch
this ... why wouldn't the cross-version-upgrade tests have seen it?

            regards, tom lane



Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

From
Andres Freund
Date:
Hi,

On 2019-05-22 09:46:19 -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > Trying to do pg_dump[all] on a 9.5 or older server results in spurious
> > failures:
> > pg_dump: column number -1 is out of range 0..36
> 
> > After looking around, the problem comes from
> > check_tuple_field_number(), more specifically from getTables() where
> > someone has forgotten to add NULL values for amname when querying
> > older server versions.

Thanks for catching!


> > Attached is a patch to fix that.

Wouldn't the better fix be to change

        if (PQgetisnull(res, i, i_amname))
            tblinfo[i].amname = NULL;

into

        if (i_amname == -1 || PQgetisnull(res, i, i_amname))
            tblinfo[i].amname = NULL;

it's much more scalable than adding useless columns everywhere, and we
already use that approach with i_checkoption (and at a number of other
places).


> > Attached is a patch to fix that.  I am not seeing other failures with
> > an instance that includes all the contents of installcheck, so it
> > seems that the rest is fine.
> 
> Looks like the right fix.  I'm sad that the buildfarm did not catch
> this ... why wouldn't the cross-version-upgrade tests have seen it?

I suspect we just didn't notice that it saw that:

    if (field_num < 0 || field_num >= res->numAttributes)
    {
        pqInternalNotice(&res->noticeHooks,
                         "column number %d is out of range 0..%d",
                         field_num, res->numAttributes - 1);
        return false;
    }

as it's just a notice, not a failure.

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> Wouldn't the better fix be to change
>         if (PQgetisnull(res, i, i_amname))
>             tblinfo[i].amname = NULL;
> into
>         if (i_amname == -1 || PQgetisnull(res, i, i_amname))
>             tblinfo[i].amname = NULL;
> it's much more scalable than adding useless columns everywhere, and we
> already use that approach with i_checkoption (and at a number of other
> places).

FWIW, I think that's a pretty awful idea, and the fact that some
people have had it before doesn't make it less awful.  It's giving
up the ability to detect errors-of-omission, which might easily
be harmful rather than harmless errors.

It does seem like we're overdue to rethink how pg_dump handles
cross-version query differences ... but inconsistently lobotomizing
its internal error detection is not a good start on that.

>> Looks like the right fix.  I'm sad that the buildfarm did not catch
>> this ... why wouldn't the cross-version-upgrade tests have seen it?

> I suspect we just didn't notice that it saw that:
> as it's just a notice, not a failure.

Hm.  But shouldn't we have gotten garbage output from the pg_dump run?

            regards, tom lane



Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

From
Andres Freund
Date:
Hi,

On 2019-05-22 14:17:41 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Wouldn't the better fix be to change
> >         if (PQgetisnull(res, i, i_amname))
> >             tblinfo[i].amname = NULL;
> > into
> >         if (i_amname == -1 || PQgetisnull(res, i, i_amname))
> >             tblinfo[i].amname = NULL;
> > it's much more scalable than adding useless columns everywhere, and we
> > already use that approach with i_checkoption (and at a number of other
> > places).
> 
> FWIW, I think that's a pretty awful idea, and the fact that some
> people have had it before doesn't make it less awful.  It's giving
> up the ability to detect errors-of-omission, which might easily
> be harmful rather than harmless errors.

Well, if we explicitly have to check for -1, it's not really an error of
omission for everything. Yes, we could forget returning the amname in a
newer version of the query, but given that we usually just forward copy
the query that's not that likely. And instead having to change a lot of
per-branch queries also adds potential for error, and also makes it more
painful to fix cross-branch bugs etc.


> >> Looks like the right fix.  I'm sad that the buildfarm did not catch
> >> this ... why wouldn't the cross-version-upgrade tests have seen it?
> 
> > I suspect we just didn't notice that it saw that:
> > as it's just a notice, not a failure.
> 
> Hm.  But shouldn't we have gotten garbage output from the pg_dump run?

What garbage? We'd take the column as NULL, and assume it doesn't have
an assigned AM. Which is the right behaviour when dumping from < 12?

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2019-05-22 14:17:41 -0400, Tom Lane wrote:
>> FWIW, I think that's a pretty awful idea, and the fact that some
>> people have had it before doesn't make it less awful.  It's giving
>> up the ability to detect errors-of-omission, which might easily
>> be harmful rather than harmless errors.

> Well, if we explicitly have to check for -1, it's not really an error of
> omission for everything. Yes, we could forget returning the amname in a
> newer version of the query, but given that we usually just forward copy
> the query that's not that likely.

No, the concerns I have are about (1) failure to insert the extra return
column into all branches where it's needed; (2) some unexpected run-time
problem causing the PGresult to not have the expected column.

I think we've had some discussions about restructuring those giant
if-nests so that they build up the query in pieces, making it possible
to write things along the lines of

        appendPQExpBuffer(query,
                          "SELECT c.tableoid, c.oid, c.relname, "
                          // version-independent stuff here
                          ...);
        ...
        if (fout->remoteVersion >= 120000)
            appendPQExpBuffer(query, "am.amname, ");
        else
            appendPQExpBuffer(query, "NULL as amname, ");
        ...

I'm not sure if it'd work quite that cleanly when we need changes in the
FROM part, but certainly for newly-added result fields this would be
hugely better than repeating the whole query.  And yes, I'd still insist
that explicitly providing the alternative NULL value is not optional.


>> Hm.  But shouldn't we have gotten garbage output from the pg_dump run?

> What garbage? We'd take the column as NULL, and assume it doesn't have
> an assigned AM. Which is the right behaviour when dumping from < 12?

Oh, I see:

int
PQgetisnull(const PGresult *res, int tup_num, int field_num)
{
    if (!check_tuple_field_number(res, tup_num, field_num))
        return 1;                /* pretend it is null */

which just happens to be the right thing --- in this case --- for
the back branches.

            regards, tom lane



Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

From
Andrew Dunstan
Date:
On 5/22/19 9:46 AM, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Trying to do pg_dump[all] on a 9.5 or older server results in spurious
>> failures:
>> pg_dump: column number -1 is out of range 0..36
>> After looking around, the problem comes from
>> check_tuple_field_number(), more specifically from getTables() where
>> someone has forgotten to add NULL values for amname when querying
>> older server versions.
>> Attached is a patch to fix that.  I am not seeing other failures with
>> an instance that includes all the contents of installcheck, so it
>> seems that the rest is fine.
> Looks like the right fix.  I'm sad that the buildfarm did not catch
> this ... why wouldn't the cross-version-upgrade tests have seen it?


That's a good question.


It's in the output - see for example

<https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2019-05-22%2017%3A47%3A30&stg=xversion-upgrade-REL9_4_STABLE-HEAD>
and scroll down a bit.


But since this doesn't cause pg_dumpall to fail, it doesn't on its own
cause the buildfarm to fail either, and this is apparently sufficiently
benign to allow the tests to succeed.



cheers


andrew


--
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

From
Michael Paquier
Date:
On Wed, May 22, 2019 at 02:39:54PM -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> Well, if we explicitly have to check for -1, it's not really an error of
>> omission for everything. Yes, we could forget returning the amname in a
>> newer version of the query, but given that we usually just forward copy
>> the query that's not that likely.
>
> No, the concerns I have are about (1) failure to insert the extra return
> column into all branches where it's needed; (2) some unexpected run-time
> problem causing the PGresult to not have the expected column.

Using a -1 check is not something I find much helpful, because this
masks the real problem that some queries may not have the output they
expect.

> I'm not sure if it'd work quite that cleanly when we need changes in the
> FROM part, but certainly for newly-added result fields this would be
> hugely better than repeating the whole query.  And yes, I'd still insist
> that explicitly providing the alternative NULL value is not optional.

This makes the addition of JOIN clauses and WHERE quals harder to
follow and read, and it makes back-patching harder (with testing to
older versions it is already complicated enough) so I don't think that
this is a good idea.  One extra idea I have would be to add a
compile-time flag which we could use to enforce a hard failure with an
assertion or such in those code paths, because we never expect it in
the in-core clients.  And that would cause any failure to be
immediately visible, at the condition of using the flag of course.

> int
> PQgetisnull(const PGresult *res, int tup_num, int field_num)
> {
>     if (!check_tuple_field_number(res, tup_num, field_num))
>         return 1;                /* pretend it is null */
>
> which just happens to be the right thing --- in this case --- for
> the back branches.

Yes.  I don't think that this is completely wrong.  So, are there any
objections if I just apply the patch at the top of this thread and fix
the issue?
--
Michael

Attachment

Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

From
Andres Freund
Date:
Hi,

On 2019-05-23 09:11:33 +0900, Michael Paquier wrote:
> On Wed, May 22, 2019 at 02:39:54PM -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> >> Well, if we explicitly have to check for -1, it's not really an error of
> >> omission for everything. Yes, we could forget returning the amname in a
> >> newer version of the query, but given that we usually just forward copy
> >> the query that's not that likely.
> > 
> > No, the concerns I have are about (1) failure to insert the extra return
> > column into all branches where it's needed; (2) some unexpected run-time
> > problem causing the PGresult to not have the expected column.
> 
> Using a -1 check is not something I find much helpful, because this
> masks the real problem that some queries may not have the output they
> expect.

I don't buy this, at all. The likelihood of introducing failures by
having to modify a lot of queries nobody runs is much higher than what
we gain by the additional "checks". If this were something on the type
system level, where the compiler would detect the error, even without
running the query: Yea, ok. But it's not.


> Yes.  I don't think that this is completely wrong.  So, are there any
> objections if I just apply the patch at the top of this thread and fix
> the issue?

Well, I think the approach of duplicating code all over is a bad idea,
and the fix is many times too big. But it's better than not fixing.

Greetings,

Andres Freund



Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

From
Michael Paquier
Date:
On Thu, May 23, 2019 at 03:31:30PM -0700, Andres Freund wrote:
> Well, I think the approach of duplicating code all over is a bad idea,
> and the fix is many times too big. But it's better than not fixing.

Well, I can see why the current solution is not perfect, but we have
been doing that for some time now, and redesigning that part has a
much larger impact than a single column.  I have committed the initial
fix now.  We can always break the wheel later on in 13~.
--
Michael

Attachment

Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

From
Andres Freund
Date:
Hi,

On 2019-05-24 08:32:29 +0900, Michael Paquier wrote:
> On Thu, May 23, 2019 at 03:31:30PM -0700, Andres Freund wrote:
> > Well, I think the approach of duplicating code all over is a bad idea,
> > and the fix is many times too big. But it's better than not fixing.
> 
> Well, I can see why the current solution is not perfect, but we have
> been doing that for some time now, and redesigning that part has a
> much larger impact than a single column.  I have committed the initial
> fix now.

That argument would hold some sway if we there weren't a number of cases
doing it differently in the tree already:

        if (i_checkoption == -1 || PQgetisnull(res, i, i_checkoption))
            tblinfo[i].checkoption = NULL;
        else
            tblinfo[i].checkoption = pg_strdup(PQgetvalue(res, i, i_checkoption));

    if (PQfnumber(res, "protrftypes") != -1)
        protrftypes = PQgetvalue(res, 0, PQfnumber(res, "protrftypes"));
    else
        protrftypes = NULL;

    if (PQfnumber(res, "proparallel") != -1)
        proparallel = PQgetvalue(res, 0, PQfnumber(res, "proparallel"));
    else
        proparallel = NULL;

    if (i_proparallel != -1)
        proparallel = PQgetvalue(res, 0, PQfnumber(res, "proparallel"));
    else
        proparallel = NULL;

And no, I don't buy the consistency argument. Continuing to do redundant
work just because we always have, isn't better than having one useful
and one redundant approach. And yes, a full blown redesign would be
better, but that doesn't get harder by having the -1 checks.

- Andres