Thread: BUG #16285: bt_metap fails with value is out of range for type integer

BUG #16285: bt_metap fails with value is out of range for type integer

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16285
Logged by:          Victor Yegorov
Email address:      vyegorov@gmail.com
PostgreSQL version: 12.2
Operating system:   Ubuntu 18.04.3 LTS
Description:

I have an index, that is giving issues pageinspect-ing it:

SELECT * FROM bt_metap('index')\gx
ERROR:  value "2180413846" is out of range for type integer

At the same time:

SELECT * FROM pgstatindex('index')\gx
-[ RECORD 1 ]------+----------
version            | 3
tree_level         | 2
index_size         | 131571712
root_block_no      | 290
internal_pages     | 56
leaf_pages         | 16003
empty_pages        | 0
deleted_pages      | 1
avg_leaf_density   | 50.06
leaf_fragmentation | 66.08

Looking at the sources of both extensions, I can see, that pgstatindex() is
using psprintf(INT64_FORMAT) for page counters and psprintf("%u") for root
page, while bt_metap() is using only psprintf("%d");

I assume psprintf("%u") should be used at least for metad->btm_root and
metad->btm_fastroot in the bt_metap(PG_FUNCTION_ARGS) function.


Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Peter Geoghegan
Date:
On Mon, Mar 2, 2020 at 2:40 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> Looking at the sources of both extensions, I can see, that pgstatindex() is
> using psprintf(INT64_FORMAT) for page counters and psprintf("%u") for root
> page, while bt_metap() is using only psprintf("%d");
>
> I assume psprintf("%u") should be used at least for metad->btm_root and
> metad->btm_fastroot in the bt_metap(PG_FUNCTION_ARGS) function.

While this has been wrong forever, the immediate problem here is
probably not metad->btm_root. In practice, the root block number is
usually fairly low. Very large indexes tend to have a root page that
only has a small number of tuples, because the root page was created
relatively early in the lifetime of the index. Past a certain point,
the root page receives new tuples so infrequently that it almost never
happens.

I think it's more likely that the problem here is the relatively new
column returned by bt_metap(), oldest_xact. That has only been around
since Postgres v11.

I'm not sure how we should handle this in the backbranches, since only
a change in the CREATE FUNCTION declaration of bt_metap() can truly
fix the problem. I suppose that we could work around the problem with
some kind of kludge, but come up with a real fix for v13.

-- 
Peter Geoghegan



Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Victor Yegorov
Date:
вт, 3 мар. 2020 г. в 01:03, Peter Geoghegan <pg@bowt.ie>:
I think it's more likely that the problem here is the relatively new
column returned by bt_metap(), oldest_xact. That has only been around
since Postgres v11.

This made me look at the stats, as this is a test copy of the DB upgraded to 12.2 recently. Per pg_stat_user_tables,
table had never been vacuumed since upgrade, only analyzed.
I've run vacuum manually and this made the issue go away.

It is beyond my skills to try to find the real place for this issue now. I think we might be hitting some uninitialized value somewhere, perhaps?..

--
Victor Yegorov

Re: BUG #16285: bt_metap fails with value is out of range for typeinteger

From
Alvaro Herrera
Date:
On 2020-Mar-02, PG Bug reporting form wrote:

> I have an index, that is giving issues pageinspect-ing it:
> 
> SELECT * FROM bt_metap('index')\gx
> ERROR:  value "2180413846" is out of range for type integer

I understand it's gone now, but you could have used \errverbose to
display the source function name that caused the error, followed by
setting backtrace_functions to that function and repeated the query.
That may have resulted in a more precise location of the problem.

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Peter Geoghegan
Date:
On Mon, Mar 2, 2020 at 3:19 PM Victor Yegorov <vyegorov@gmail.com> wrote:
> This made me look at the stats, as this is a test copy of the DB upgraded to 12.2 recently. Per pg_stat_user_tables,
> table had never been vacuumed since upgrade, only analyzed.
> I've run vacuum manually and this made the issue go away.
>
> It is beyond my skills to try to find the real place for this issue now. I think we might be hitting some
uninitializedvalue somewhere, perhaps?..
 

The issue is that bt_metap() was declared using the wrong data types
-- simple as that. An int4 cannot represent 2180413846 -- only a
uint32 (or a TransactionId) can. Technically this is not a recent
issue, since we got btm_root wrong right from the start. However, it
seems more likely that the relatively recently added oldest_xact field
will cause problems in practice. The fact that a manual VACUUM made
that go away for you (at least temporarily) is not surprising.

The declaration itself is wrong, so it is the declaration itself that
we must fix. I can't really see a way of fixing it without introducing
a new version of contrib/pageinspect. :-(

I propose the attached fix for the master branch only. This is a draft
patch. The patch changes the data types to more appropriate, wider
integer types. It follows the convention of using int8/bigint where
the natural data type to use at the C code level is actually uint32 --
pg_stat_statements does this with queryId (actually, we switched over
to 64-bit hashes some time later, but it worked that way before commit
cff440d3686).

The patch takes another idea from pg_stat_statements: Using a
tupledesc's natts field to determine the extension version in use, to
maintain compatibility when there are breaking changes to a function's
definition. I simply error out when bt_metap() is called using the
definition from an older version of the extension, unlike
pg_stat_statements. It isn't worth going to the trouble of preserving
a set of behaviors that are more or less broken anyway. Better to
error out in an obvious way. Especially given that contrib/pageinspect
is fundamentally a superuser-only extension, with a relatively small
user base.

Theoretically, I should also change bt_page_items() (and several
functions) to make its blkno In parameter of type int8/bigint (instead
of int4/integer). I haven't bothered with that, though.

--
Peter Geoghegan

Attachment

Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Victor Yegorov
Date:
вт, 3 мар. 2020 г. в 03:39, Alvaro Herrera <alvherre@2ndquadrant.com>:
I understand it's gone now, but you could have used \errverbose to
display the source function name that caused the error, followed by
setting backtrace_functions to that function and repeated the query.
That may have resulted in a more precise location of the problem.

It is not entirely gone, I still have some indexes around that produce this.
This backtracing functionality is available only in the HEAD as far as I can see, so it took me a while
to get to it:

[3760] ERROR:  value "4282444360" is out of range for type integer
[3760] BACKTRACE:
        postgres: postgres postgres [local] SELECT(pg_strtoint32+0xdf) [0x55ab094e4d7f]
        postgres: postgres postgres [local] SELECT(int4in+0xd) [0x55ab094acc5d]
        postgres: postgres postgres [local] SELECT(InputFunctionCall+0x7b) [0x55ab095675ab]
        postgres: postgres postgres [local] SELECT(BuildTupleFromCStrings+0xa7) [0x55ab093066a7]
        /var/lib/postgresql/opt/pg13devel/lib/pageinspect.so(bt_metap+0x199) [0x7f7ec35b7619]
        postgres: postgres postgres [local] SELECT(ExecMakeTableFunctionResult+0x40b) [0x55ab09302f0b]
        postgres: postgres postgres [local] SELECT(+0x27bd11) [0x55ab09311d11]
        postgres: postgres postgres [local] SELECT(ExecScan+0x7b) [0x55ab0930381b]
        postgres: postgres postgres [local] SELECT(standard_ExecutorRun+0x102) [0x55ab092fa3f2]
        postgres: postgres postgres [local] SELECT(+0x3b6d3d) [0x55ab0944cd3d]
        postgres: postgres postgres [local] SELECT(PortalRun+0x2ba) [0x55ab0944e17a]
        postgres: postgres postgres [local] SELECT(+0x3b3d07) [0x55ab09449d07]
        postgres: postgres postgres [local] SELECT(PostgresMain+0x1df6) [0x55ab0944c046]
        postgres: postgres postgres [local] SELECT(+0x33de85) [0x55ab093d3e85]
        postgres: postgres postgres [local] SELECT(PostmasterMain+0xf40) [0x55ab093d5000]
        postgres: postgres postgres [local] SELECT(main+0x4a4) [0x55ab09154964]
        /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f7ecd26eb97]
        postgres: postgres postgres [local] SELECT(_start+0x2a) [0x55ab09154a2a]
[3760] STATEMENT:  select * from bt_metap('table_pkey');


--
Victor Yegorov

Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Peter Geoghegan
Date:
On Fri, Mar 6, 2020 at 1:38 AM Victor Yegorov <vyegorov@gmail.com> wrote:
> It is not entirely gone, I still have some indexes around that produce this.
> This backtracing functionality is available only in the HEAD as far as I can see, so it took me a while
> to get to it:
>
> [3760] ERROR:  value "4282444360" is out of range for type integer

This has to be the oldest_xact field. If it was any of the other
fields, the "%d" format would not result in an error (it would just
result in incorrectly displaying a negative integer). oldest_xact is
the only field that uses "%u" (unfortunately, the declaration makes
the field an int4/integer, so you may see this error).


--
Peter Geoghegan



Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Peter Geoghegan
Date:
On Fri, Mar 6, 2020 at 2:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This has to be the oldest_xact field. If it was any of the other
> fields, the "%d" format would not result in an error (it would just
> result in incorrectly displaying a negative integer). oldest_xact is
> the only field that uses "%u" (unfortunately, the declaration makes
> the field an int4/integer, so you may see this error).

Pushed a fix for this just now.

Thanks for the report!

--
Peter Geoghegan



Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Victor Yegorov
Date:
вс, 8 мар. 2020 г. в 02:45, Peter Geoghegan <pg@bowt.ie>:
Pushed a fix for this just now.

Thanks for the fix! 


--
Victor Yegorov

Re: BUG #16285: bt_metap fails with value is out of range for typeinteger

From
Andres Freund
Date:
Hi,

On 2020-03-07 16:45:27 -0800, Peter Geoghegan wrote:
> On Fri, Mar 6, 2020 at 2:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > This has to be the oldest_xact field. If it was any of the other
> > fields, the "%d" format would not result in an error (it would just
> > result in incorrectly displaying a negative integer). oldest_xact is
> > the only field that uses "%u" (unfortunately, the declaration makes
> > the field an int4/integer, so you may see this error).
> 
> Pushed a fix for this just now.
> 
> Thanks for the report!

ISTM that we need some fix for the back-branches too. Being unable to
look at some indexes till 12 has aged out doesn't strike me as good.

How about simply printing the wrapped value? That's far from perfect, of
course, but clearly better than the current situation in the back
branches.

Greetings,

Andres Freund



Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Peter Geoghegan
Date:
On Mon, Mar 9, 2020 at 3:09 PM Andres Freund <andres@anarazel.de> wrote:
> ISTM that we need some fix for the back-branches too. Being unable to
> look at some indexes till 12 has aged out doesn't strike me as good.

Actually, the oldest_xact field was added in Postgres 11.

> How about simply printing the wrapped value? That's far from perfect, of
> course, but clearly better than the current situation in the back
> branches.

Would you be happy if we always raised a NOTICE that had information
about the affected fields? I don't think that we should try to be
clever and only do it when we know that it will fail. We should admit
that it's broken with a HINT that gets associated with the NOTICE, in
order to discourage relying on the number within automated tools.

If we were to do this, it would probably only be necessary to
backpatch to Postgres 11 and 12. Those are the only stable releases
with the oldest_xact field. In practice, it is highly likely to be the
thing that causes problems. We will report the root block number at a
negative block number when it happens to exceed 2^31-1, but that
condition is almost impossible to hit in practice, even when the index
size is close to the system-wide limit of relation size. That's why
nobody has complained about it in all these years.

-- 
Peter Geoghegan



Re: BUG #16285: bt_metap fails with value is out of range for typeinteger

From
Andres Freund
Date:
Hi,

On 2020-03-09 15:31:38 -0700, Peter Geoghegan wrote:
> On Mon, Mar 9, 2020 at 3:09 PM Andres Freund <andres@anarazel.de> wrote:
> > ISTM that we need some fix for the back-branches too. Being unable to
> > look at some indexes till 12 has aged out doesn't strike me as good.
> 
> Actually, the oldest_xact field was added in Postgres 11.

What do you mean? Since 12 is the newest release affected, we'd
potentially (and with increasing likelihood due to clusters living
longer) have the problem till 12 is not supported anymore. What am I
missing?


> > How about simply printing the wrapped value? That's far from perfect, of
> > course, but clearly better than the current situation in the back
> > branches.
> 
> Would you be happy if we always raised a NOTICE that had information
> about the affected fields? I don't think that we should try to be
> clever and only do it when we know that it will fail. We should admit
> that it's broken with a HINT that gets associated with the NOTICE, in
> order to discourage relying on the number within automated tools.

I'd just do the s/%u/%d/.


> If we were to do this, it would probably only be necessary to
> backpatch to Postgres 11 and 12. Those are the only stable releases
> with the oldest_xact field. In practice, it is highly likely to be the
> thing that causes problems. We will report the root block number at a
> negative block number when it happens to exceed 2^31-1, but that
> condition is almost impossible to hit in practice, even when the index
> size is close to the system-wide limit of relation size. That's why
> nobody has complained about it in all these years.

pg_class.relpages is also reported as a signed integer :(. Since
btm_root/fastroot use %d, it'll just have similar wrapping behaviour.

Greetings,

Andres Freund



Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Peter Geoghegan
Date:
On Mon, Mar 9, 2020 at 3:36 PM Andres Freund <andres@anarazel.de> wrote:
> What do you mean? Since 12 is the newest release affected, we'd
> potentially (and with increasing likelihood due to clusters living
> longer) have the problem till 12 is not supported anymore. What am I
> missing?

But 12 isn't the latest release affected. It just so happens that
Victor was using 12, but oldest_xact was actually added by commit
857f9c36 -- that's Postgres 11.

To be very precise: I imagine that Victor was using bt_metap() in
production on a Postgres 12 installation because he wanted to make
sure that his installation had the new stuff (he did a talk about it
at EU, so clearly it's of interest to him). The problem is
nevertheless not new to Postgres 12.

> I'd just do the s/%u/%d/.

That's a pretty gross hack. So be it.

> pg_class.relpages is also reported as a signed integer :(. Since
> btm_root/fastroot use %d, it'll just have similar wrapping behaviour.

I guess that means that pageinspect was correct after all!

-- 
Peter Geoghegan



Re: BUG #16285: bt_metap fails with value is out of range for typeinteger

From
Andres Freund
Date:
Hi,

On 2020-03-09 17:16:47 -0700, Peter Geoghegan wrote:
> On Mon, Mar 9, 2020 at 3:36 PM Andres Freund <andres@anarazel.de> wrote:
> > What do you mean? Since 12 is the newest release affected, we'd
> > potentially (and with increasing likelihood due to clusters living
> > longer) have the problem till 12 is not supported anymore. What am I
> > missing?
>
> But 12 isn't the latest release affected. It just so happens that
> Victor was using 12, but oldest_xact was actually added by commit
> 857f9c36 -- that's Postgres 11.

Huh? I think we might be miscommunicating here. My point isn't about the
*earliest* release affected, it's about the *latest* version without a
fix. IOW, until when is there a supported release without a fix.

And once 12 is not supported anymore, 11 is also unsupported. So we'd
have a live bug (which would mainly hit while investigating issues)
until 12 is unsupported?


> To be very precise: I imagine that Victor was using bt_metap() in
> production on a Postgres 12 installation because he wanted to make
> sure that his installation had the new stuff (he did a talk about it
> at EU, so clearly it's of interest to him). The problem is
> nevertheless not new to Postgres 12.
>
> > I'd just do the s/%u/%d/.
>
> That's a pretty gross hack. So be it.

Yea, it is.


> > pg_class.relpages is also reported as a signed integer :(. Since
> > btm_root/fastroot use %d, it'll just have similar wrapping behaviour.
>
> I guess that means that pageinspect was correct after all!

Well, for some value of correct. I was arguing quite a while ago that we
should just make pg_class.relpages a 64bit integer, or introduce an sql
level type for block numbers.

Greetings,

Andres Freund



Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Peter Geoghegan
Date:
On Mon, Mar 9, 2020 at 5:22 PM Andres Freund <andres@anarazel.de> wrote:
> Huh? I think we might be miscommunicating here. My point isn't about the
> *earliest* release affected, it's about the *latest* version without a
> fix. IOW, until when is there a supported release without a fix.

Got it.

> And once 12 is not supported anymore, 11 is also unsupported. So we'd
> have a live bug (which would mainly hit while investigating issues)
> until 12 is unsupported?

> > To be very precise: I imagine that Victor was using bt_metap() in
> > production on a Postgres 12 installation because he wanted to make
> > sure that his installation had the new stuff (he did a talk about it
> > at EU, so clearly it's of interest to him). The problem is
> > nevertheless not new to Postgres 12.
> >
> > > I'd just do the s/%u/%d/.
> >
> > That's a pretty gross hack. So be it.
>
> Yea, it is.

Right. But we only need the gross kludge on 11 and 12 -- there is no
"%u" to change on earlier Postgres versions. That will allow all
Postgres/pageinspect versions to at least manage to consistently
display something within the bt_metap() fields.

-- 
Peter Geoghegan



Re: BUG #16285: bt_metap fails with value is out of range for type integer

From
Peter Geoghegan
Date:
On Mon, Mar 9, 2020 at 5:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Right. But we only need the gross kludge on 11 and 12 -- there is no
> "%u" to change on earlier Postgres versions. That will allow all
> Postgres/pageinspect versions to at least manage to consistently
> display something within the bt_metap() fields.

I pushed commits that make this change to both REL_11_STABLE and
REL_12_STABLE. It's a total hack, but better than doing nothing.

Somebody just asked about checking the B-Tree version on -general, and
I pointed them in the direction of bt_metap(). I think that we'd hear
more complaints like Victor's if we left 11 and 12 alone.

-- 
Peter Geoghegan