Thread: pg_class.relpages/allvisible probably shouldn't be a int4

pg_class.relpages/allvisible probably shouldn't be a int4

From
Andres Freund
Date:
Hi,

postgres=# SELECT relpages FROM pg_class WHERE relname = 'large'; relpages
--------------1804468224
(1 row)

postgres=# \dt+ large                  List of relationsSchema | Name  | Type  | Owner  | Size  | Description
--------+-------+-------+--------+-------+-------------public | large | table | andres | 19 TB |
(1 row)

That's nothing for 9.4 anymore, but shouldn't we make pg_class.relpages
a int8 (sounds slightly better than float to me) or somesuch?

I think most calculations actually work out because they're performed
after casting to BlockNumber, but ...

Greetings,

Andres Freund

PS: _mdfd_getseg is considered harmful(). Not fun once you have couple
of hundred gigabytes.

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> That's nothing for 9.4 anymore, but shouldn't we make pg_class.relpages
> a int8 (sounds slightly better than float to me) or somesuch?

No; those are really BlockNumbers, and have always been.  float4 would
lose information and float8 or int8 would waste space.  If we had an
unsigned int type it'd be better.  I suppose we could declare them as OID,
but that would probably confuse people no end.
        regards, tom lane



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Andres Freund
Date:
On May 9, 2014 10:37:49 PM CEST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Andres Freund <andres@2ndquadrant.com> writes:
>> That's nothing for 9.4 anymore, but shouldn't we make
>pg_class.relpages
>> a int8 (sounds slightly better than float to me) or somesuch?
>
>No; those are really BlockNumbers, and have always been.  float4 would
>lose information and float8 or int8 would waste space.  If we had an
>unsigned int type it'd be better.  I suppose we could declare them as
>OID,
>but that would probably confuse people no end.

Well negative numbers aren't great either.  Although admittedly it's not yet affecting many...

I think the waste of storing 2*4 additional bytes isn't going to hurt much.
And adding a proper unsigned type doesn't sound like a small amount of work. Not to speak of overloading troubles....
I realize they are block numbers and casted in most places - that's why the overflow doesn't seem to cause too many
troubles.

Andres

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Peter Geoghegan
Date:
On Fri, May 9, 2014 at 1:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> And adding a proper unsigned type doesn't sound like a small amount of work.

Perhaps not, but it's overdue. We ought to have one.

-- 
Peter Geoghegan



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Andres Freund
Date:
On 2014-05-10 23:21:34 -0700, Peter Geoghegan wrote:
> On Fri, May 9, 2014 at 1:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > And adding a proper unsigned type doesn't sound like a small amount of work.
> 
> Perhaps not, but it's overdue. We ought to have one.

Maybe. But there's so many things to decide around it that I don't think
it's a good prerequisite for not showing essentially corrupted values in
a supported scenario.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-05-10 23:21:34 -0700, Peter Geoghegan wrote:
>> On Fri, May 9, 2014 at 1:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> And adding a proper unsigned type doesn't sound like a small amount of work.

>> Perhaps not, but it's overdue. We ought to have one.

> Maybe. But there's so many things to decide around it that I don't think
> it's a good prerequisite for not showing essentially corrupted values in
> a supported scenario.

It's a lot harder than it sounds at first; see past discussions about how
we could shoehorn one into the numeric type hierarchy.  And think about
how C expressions that mix signed and unsigned inputs tend to give
surprising results :-(

The bigger picture though is that it's hard to get excited about this
particular scenario, because if you are up to the point where your table
size overflows int32, you are less than a factor of 2 away from the hard
limit of BlockNumber, and will therefore soon have much bigger problems to
worry about than whether pg_class.relpages prints confusingly.  My advice
to anyone who reported this from the field would certainly be "time to
think about partitioning that table".  So if I were to take Andres'
complaint seriously at all, I'd be thinking in terms of "do we need to
widen BlockNumber to int64?", not "how do we make this print as
unsigned?".  But I doubt such a proposal would fly, because of the
negative impact on index sizes.

A possible cosmetic fix is to just clamp the value vacuum/analyze
will store into relpages at INT_MAX, while scaling reltuples so that
the all-important reltuples/relpages ratio stays at reality.  But
that might be harder than it sounds too, because of the moving
average tracking behavior for reltuples.  And it'd be a code path
that gets no meaningful testing :-(
        regards, tom lane



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Andres Freund
Date:
On 2014-05-11 12:24:30 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-05-10 23:21:34 -0700, Peter Geoghegan wrote:
> >> On Fri, May 9, 2014 at 1:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >>> And adding a proper unsigned type doesn't sound like a small amount of work.
> 
> >> Perhaps not, but it's overdue. We ought to have one.
> 
> > Maybe. But there's so many things to decide around it that I don't think
> > it's a good prerequisite for not showing essentially corrupted values in
> > a supported scenario.
> 
> It's a lot harder than it sounds at first; see past discussions about how
> we could shoehorn one into the numeric type hierarchy.  And think about
> how C expressions that mix signed and unsigned inputs tend to give
> surprising results :-(

Yea. I really don't like to take on such a major project to solve a
minor problem.
What I am thinking about right now is to expose a 'pg_blocknumber'
type. That only does very basic operations and implicitly casts to
int64. That's probably a much more handleable problem and it also might
give us some more experience with unsigned types. Comments?

> The bigger picture though is that it's hard to get excited about this
> particular scenario, because if you are up to the point where your table
> size overflows int32, you are less than a factor of 2 away from the hard
> limit of BlockNumber, and will therefore soon have much bigger problems to
> worry about than whether pg_class.relpages prints confusingly.

Well. It takes several years to get to 16TB, so properly supporting >16TB &&
< 32TB is a pretty good step. It'll allow a couple of years continued operation.

> My advice
> to anyone who reported this from the field would certainly be "time to
> think about partitioning that table".

That definitely got to be the mid to longterm plan. The problem is that
our partitioning sucks. Majorly. My hope is that by allowing that large
tables to work more properly we have time to improve partitioning to the
level that it doesn't basically requires to remove all nontrivial
constraints.

> So if I were to take Andres'
> complaint seriously at all, I'd be thinking in terms of "do we need to
> widen BlockNumber to int64?", not "how do we make this print as
> unsigned?".  But I doubt such a proposal would fly, because of the
> negative impact on index sizes.

Yea, I am not wild for that either. I guess migrating to a postgres with
a larger blocksize is the next step.

> A possible cosmetic fix is to just clamp the value vacuum/analyze
> will store into relpages at INT_MAX, while scaling reltuples so that
> the all-important reltuples/relpages ratio stays at reality.  But
> that might be harder than it sounds too, because of the moving
> average tracking behavior for reltuples.  And it'd be a code path
> that gets no meaningful testing :-(

I am absolutely not a fan of that either :(.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Heikki Linnakangas
Date:
On 05/12/2014 12:30 AM, Andres Freund wrote:
>> >So if I were to take Andres'
>> >complaint seriously at all, I'd be thinking in terms of "do we need to
>> >widen BlockNumber to int64?", not "how do we make this print as
>> >unsigned?".  But I doubt such a proposal would fly, because of the
>> >negative impact on index sizes.
> Yea, I am not wild for that either. I guess migrating to a postgres with
> a larger blocksize is the next step.

A larger block size won't buy you very much time either.

We could steal some bits from the OffsetNumber portion of an 
ItemPointer. If we assume the max. block size of 32kb, and that each 
Item takes at least 16 bytes, you only need 11 bits for the offset 
number. That leaves 5 bits unused, and if we use them to expand the 
block number to 37 bits in total, that's enough for 1 PB with the 
default 8k block size.

But I concur that in practice, if you're dealing with 16TB tables, it's 
time to partition.

- Heikki



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Andres Freund
Date:
On 2014-05-12 10:07:29 +0300, Heikki Linnakangas wrote:
> On 05/12/2014 12:30 AM, Andres Freund wrote:
> >>>So if I were to take Andres'
> >>>complaint seriously at all, I'd be thinking in terms of "do we need to
> >>>widen BlockNumber to int64?", not "how do we make this print as
> >>>unsigned?".  But I doubt such a proposal would fly, because of the
> >>>negative impact on index sizes.
> >Yea, I am not wild for that either. I guess migrating to a postgres with
> >a larger blocksize is the next step.
> 
> A larger block size won't buy you very much time either.

Well. If you mean 'a year or five with that... :)

> We could steal some bits from the OffsetNumber portion of an ItemPointer. If
> we assume the max. block size of 32kb, and that each Item takes at least 16
> bytes, you only need 11 bits for the offset number. That leaves 5 bits
> unused, and if we use them to expand the block number to 37 bits in total,
> that's enough for 1 PB with the default 8k block size.

Hm. That's not a generally bad idea. I think we'll have to do that in a
couple of years. Regardless of better partitioning.

> But I concur that in practice, if you're dealing with 16TB tables, it's time
> to partition.

Well, we need to improve our partitioning for that to be viable for all
relations. Not having usable foreign and unique keys makes it a pita in
some cases.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-05-12 10:07:29 +0300, Heikki Linnakangas wrote:
>> But I concur that in practice, if you're dealing with 16TB tables, it's time
>> to partition.

> Well, we need to improve our partitioning for that to be viable for all
> relations. Not having usable foreign and unique keys makes it a pita in
> some cases.

Well, yeah, but that's on the to-do list in any case.
        regards, tom lane



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Andres Freund
Date:
On 2014-05-11 23:30:37 +0200, Andres Freund wrote:
> On 2014-05-11 12:24:30 -0400, Tom Lane wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > On 2014-05-10 23:21:34 -0700, Peter Geoghegan wrote:
> > >> On Fri, May 9, 2014 at 1:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > >>> And adding a proper unsigned type doesn't sound like a small amount of work.
> > 
> > >> Perhaps not, but it's overdue. We ought to have one.
> > 
> > > Maybe. But there's so many things to decide around it that I don't think
> > > it's a good prerequisite for not showing essentially corrupted values in
> > > a supported scenario.
> > 
> > It's a lot harder than it sounds at first; see past discussions about how
> > we could shoehorn one into the numeric type hierarchy.  And think about
> > how C expressions that mix signed and unsigned inputs tend to give
> > surprising results :-(
> 
> Yea. I really don't like to take on such a major project to solve a
> minor problem.
> What I am thinking about right now is to expose a 'pg_blocknumber'
> type. That only does very basic operations and implicitly casts to
> int64. That's probably a much more handleable problem and it also might
> give us some more experience with unsigned types. Comments?

As a note towards that: e.g. pageinspect deals with blocknumbers and
uses int4 for that. That makes accessing the higher blocks really
awkward...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Simon Riggs
Date:
On 12 May 2014 08:15, Andres Freund <andres@2ndquadrant.com> wrote:

>> But I concur that in practice, if you're dealing with 16TB tables, it's time
>> to partition.
>
> Well, we need to improve our partitioning for that to be viable for all
> relations. Not having usable foreign and unique keys makes it a pita in
> some cases.

As discussed, declarative partitioning is on the roadmap for this next
release, so I would say lets just document that tablesizes above 16TB
don't report correctly and move on.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Andres Freund
Date:
On 2014-05-30 12:00:47 +0100, Simon Riggs wrote:
> On 12 May 2014 08:15, Andres Freund <andres@2ndquadrant.com> wrote:
> 
> >> But I concur that in practice, if you're dealing with 16TB tables, it's time
> >> to partition.
> >
> > Well, we need to improve our partitioning for that to be viable for all
> > relations. Not having usable foreign and unique keys makes it a pita in
> > some cases.
> 
> As discussed, declarative partitioning is on the roadmap for this next
> release, so I would say lets just document that tablesizes above 16TB
> don't report correctly and move on.

I doubt we'll fix all the snags - like foreign keys, unique keys, etc -
that partitioning has in this release... Introducing a blocknumber type
seems easy and mechanical enough.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: pg_class.relpages/allvisible probably shouldn't be a int4

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2014-05-11 23:30:37 +0200, Andres Freund wrote:

> > Yea. I really don't like to take on such a major project to solve a
> > minor problem.
> > What I am thinking about right now is to expose a 'pg_blocknumber'
> > type. That only does very basic operations and implicitly casts to
> > int64. That's probably a much more handleable problem and it also might
> > give us some more experience with unsigned types. Comments?
> 
> As a note towards that: e.g. pageinspect deals with blocknumbers and
> uses int4 for that. That makes accessing the higher blocks really
> awkward...

Sounds like pg_blocknumber could be a good idea, but I wouldn't add the
bigint implicit casts to int64.  We might enlarge BlockNumber to
unsigned 64 bits in the future -- we will regret the implicit casts to
signed int64 then.  Anyway it's not like implicit casts will get us any
extra functionality.  We can just add + and - operators, and that should
cover 99% of what people would initially want ...

Now, adding casts FROM int32 and int64 to pg_blocknumber doesn't sound
bad.  (Maybe have them error out if the value is negative?)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services