Thread: Reducing tuple overhead

Reducing tuple overhead

From

Andres Freund

Date:

23 April 2015, 16:24:41

Split into a new thread, the other one is already growing fast
enough. This discussion started at
http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi

On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>Stop right there. You need to reserve enough space on the page to store
>
>an xmax for *every* tuple on the page. Because if you don't, what are 
>you going to do when every tuple on the page is deleted by a different 
>transaction.
>
>Even if you store the xmax somewhere else than the page header, you
>need 
>to reserve the same amount of space for them, so it doesn't help at
>all.

Depends on how you do it and what you optimize for (disk space, runtime,
code complexity..).  You can e.g. use apply a somewhat similar trick to
xmin/xmax as done to cmin/cmax; only that the data structure needs to be
persistent.

In fact, we already have combocid like structure for xids that's
persistent - multixacts. We could just have one xid saved that's either
xmin or xmax (indicated by bits) or a multixact.  When a tuple is
updated/deleted whose xmin is still required we could replace the former
xmin with a multixact, otherwise just change the flag that it's now a
xmax without a xmin.  To check visibility and if the xid is a multixact
we'd just have to look for the relevant member for the actual xmin and
xmax.

To avoid exessive overhead when a tuple is repeatedly updated within one
session we could store some of the data in the combocid entry that we
anyway need in that case.

Whether that's feasible complexity wise is debatable, but it's certainly
possible.

I do wonder what, in realistic cases, is actually the bigger contributor
to the overhead. The tuple header or the padding we liberally add in
many cases...

Andres

Re: Reducing tuple overhead

From

Jim Nasby

Date:

23 April 2015, 16:42:49

On 4/23/15 11:24 AM, Andres Freund wrote:
> I do wonder what, in realistic cases, is actually the bigger contributor
> to the overhead. The tuple header or the padding we liberally add in
> many cases...

Assuming you're talking about padding between fields...

Several years ago Enova paid Command Prompt to start work on logical 
column ordering, and part of the motivation for that was to allow 
re-ordering physical tuples into the most efficient on-disk format 
possible. I think I did some tests re-arranging some tables into the 
theoretically most efficient order and measuring heap size. I think 
there was some modest size improvement, maybe 10-15%? This was several 
years ago so it's all foggy. Maybe Josh can find some of this in CMD's 
ticketing system?

Aside from that, something else that might be interesting is Tom's 
recent work on decoupling on-page representation of types from what's 
passed around internally. That might offer some gains here, even if it's 
just in reducing the need for alignment.

I also wonder if a similar technique would be useful at the tuple level. 
One possibility would be attempting to compress the tuple before putting 
it on the page.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Reducing tuple overhead

From

Petr Jelinek

Date:

23 April 2015, 16:45:48

On 23/04/15 18:24, Andres Freund wrote:
> Whether that's feasible complexity wise is debatable, but it's certainly
> possible.
>
>
> I do wonder what, in realistic cases, is actually the bigger contributor
> to the overhead. The tuple header or the padding we liberally add in
> many cases...
>

The logical ordering patch + auto optimizations of column layout on 
table creation/rewrite might help partially there.

But what seems to be clear is that we need more in depth analysis of 
what really contributes most to the tuple size in various use-cases and 
then we can debate what we can do about it.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Reducing tuple overhead

From

"Joshua D. Drake"

Date:

23 April 2015, 17:00:52

On 04/23/2015 09:42 AM, Jim Nasby wrote:
>
> On 4/23/15 11:24 AM, Andres Freund wrote:
>> I do wonder what, in realistic cases, is actually the bigger contributor
>> to the overhead. The tuple header or the padding we liberally add in
>> many cases...
>
> Assuming you're talking about padding between fields...
>
> Several years ago Enova paid Command Prompt to start work on logical
> column ordering, and part of the motivation for that was to allow
> re-ordering physical tuples into the most efficient on-disk format
> possible. I think I did some tests re-arranging some tables into the
> theoretically most efficient order and measuring heap size. I think
> there was some modest size improvement, maybe 10-15%? This was several
> years ago so it's all foggy. Maybe Josh can find some of this in CMD's
> ticketing system?

Yeah I dug around. I don't see anything about size improvement but here 
are our notes:

Alvaro said:

I ended up not producing notes as regularly as I had initially hoped. To 
try and make up for it, here's an update covering everything I've done 
since I started working on this issue.

This patch turned out to be completely different than what we had 
initially thought. We had thought it was going to be a matter of finding 
out places that used "attnum" and replace it with either attnum, 
attlognum or attphysnum, depending on what order was necessary on any 
given spot. This wasn't an easy thing to do because there are several 
hundreds of those. So it was supposed to be amazingly time-consuming and 
rather boring work.

This has nothing to do with reality: anywhere from parser down to 
optimizer and executor, the way things work is that a list of attributes 
is built, processed, and referenced. Some places assume that the list is 
in a certain order that's always the same order for those three cases. 
So the way to develop this feature is to change those places so that 
instead of receiving the list in one of these orders, they instead 
receive it in a different order.

So what I had to do early on, was find a way to retrieve the sort order 
from catalogs, preserve it when TupleDescriptors are built, and ensure 
the attribute list is extracted from TupleDesc in the correct order. But 
it turned out that this is not enough, because down in the parser guts, 
a target list is constructed; and later, a TupleDescriptor is built from 
the target list. So it's necessary to preserve the sorting info from the 
original tuple descriptor into the target list (which means adding order 
info to Var and TargetEntry nodes), so that the new TupleDesc can also 
have it.

Today I'm finding that even more than that is necessary. It turns out 
that the RangeTableEntries (i.e. the entries in the FROM clause of a 
query) have an item dubbed "eref" which is a list of column names; due 
to my changes in the earlier parser stages, this list is sorted in 
logical column order; but the code to resolve things such as columns 
used in JOIN/ON clauses walks the list (which is in logical order) and 
then uses the number of times it had to walk the elements in the list to 
construct a Var (column reference) in "attnum" order -- so it finds a 
different column, and it all fails.

So what I'm doing now is modify the RangeTableEntry node to keep a 
mapping list of logical to identity numbers. Then I'll have to search 
for places using the rte->eref->colnames and make sure that they 
correctly use attlognum as index into it.

And then later:

First of all I should note that I discussed the approach mentioned above 
to pgsql-hackers and got a very interesting comment from Tom Lane that 
adding sorting info to Var and TargetEntry nodes was not a very good 
idea because it'd break stored rules whenever a table column changed. So 
I went back and studied that code and noticed that it was really the 
change in RangeTableEntry that's doing the good magic; those other 
changes are fortunately not necessary. (Though there were a necessary 
vehicle for me to understand how the other stuff works.)

I've been continuing to study the backend code looking for uses of 
attribute lists that assume a single ordering. As I get more into it, 
more complex cases appear. The number of cases is fortunately bounded, 
though. Most of the uses of straight attribute lists are in places that 
do not require modification, or require little work or thought to update 
correctly.

However, some other places are not like that. I have "fixed" SQL 
functions two times now, and I just found out that the second fix (which 
I believed to be "mostly correct") was to be the final one, but I found 
out just now that it's not, and the proper fix is going to require 
something a bit more low-level (namely, a projection step that reorders 
columns correctly after the fact). Fortunately, I believe that this 
extra projection step is going to fix a lot of other cases too, which I 
originally had no idea how to attack. Moreover, understanding that bit 
means I also figured out what Tom Lane meant on the second half of his 
response to my original pgsql-hackers comment. So I think we're good on 
that front.

-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.

Re: Reducing tuple overhead

From

Alvaro Herrera

Date:

23 April 2015, 18:10:15

Thanks for posting this.

Joshua D. Drake wrote:

> First of all I should note that I discussed the approach mentioned above to
> pgsql-hackers and got a very interesting comment from Tom Lane that adding
> sorting info to Var and TargetEntry nodes was not a very good idea because
> it'd break stored rules whenever a table column changed. So I went back and
> studied that code and noticed that it was really the change in
> RangeTableEntry that's doing the good magic; those other changes are
> fortunately not necessary. (Though there were a necessary vehicle for me to
> understand how the other stuff works.)

So in the logical columns thread I was saying that this change of eref
didn't work either; it seems that most of the time (if not always) the
list should continue to be in attnum order, and the
logical-ordering-info data should be obtained from the tuple descriptor
and whatever needs to be sorted should be sorted afterwards, i.e. in
setrefs.c, using data previously stored in RelOptInfo.  I tried doing
that and ran into some other mess elsewhere.

> I've been continuing to study the backend code looking for uses of attribute
> lists that assume a single ordering. As I get more into it, more complex
> cases appear. The number of cases is fortunately bounded, though.

<he hopes, and eventually gives up>

> However, some other places are not like that. I have "fixed" SQL functions
> two times now, and I just found out that the second fix (which I believed to
> be "mostly correct") was to be the final one, but I found out just now that
> it's not, and the proper fix is going to require something a bit more
> low-level (namely, a projection step that reorders columns correctly after
> the fact). Fortunately, I believe that this extra projection step is going
> to fix a lot of other cases too, which I originally had no idea how to
> attack. Moreover, understanding that bit means I also figured out what Tom
> Lane meant on the second half of his response to my original pgsql-hackers
> comment. So I think we're good on that front.

I had forgotten my intention to rework things using projection.

In any case, I have posted a lot of patches now which others could
study, if there's interest.  I would sure like to see this split, and
there are multiple benefits such as reduction of padding space.  I'm
sure there's a nonempty set of guys brighter than me that could figure
it out in not that much time.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reducing tuple overhead

From

Jim Nasby

Date:

23 April 2015, 19:34:12

On 4/23/15 11:45 AM, Petr Jelinek wrote:
> On 23/04/15 18:24, Andres Freund wrote:
>> Whether that's feasible complexity wise is debatable, but it's certainly
>> possible.
>>
>>
>> I do wonder what, in realistic cases, is actually the bigger contributor
>> to the overhead. The tuple header or the padding we liberally add in
>> many cases...
>>
>
> The logical ordering patch + auto optimizations of column layout on
> table creation/rewrite might help partially there.
>
> But what seems to be clear is that we need more in depth analysis of
> what really contributes most to the tuple size in various use-cases and
> then we can debate what we can do about it.

Also, what Robert posted was that while we started at something like 
15%-30% larger, we ended the test at 80% larger. That makes me think 
that the bigger win is not in reducing tuple size but tackling bloat.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Reducing tuple overhead

From

Amit Kapila

Date:

24 April 2015, 03:41:00

On Fri, Apr 24, 2015 at 1:03 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>
> On 4/23/15 11:45 AM, Petr Jelinek wrote:
>>
>> On 23/04/15 18:24, Andres Freund wrote:
>>>
>>> Whether that's feasible complexity wise is debatable, but it's certainly
>>> possible.
>>>
>>>
>>> I do wonder what, in realistic cases, is actually the bigger contributor
>>> to the overhead. The tuple header or the padding we liberally add in
>>> many cases...
>>>
>>
>> The logical ordering patch + auto optimizations of column layout on
>> table creation/rewrite might help partially there.
>>
>> But what seems to be clear is that we need more in depth analysis of
>> what really contributes most to the tuple size in various use-cases and
>> then we can debate what we can do about it.
>
>
> Also, what Robert posted was that while we started at something like 15%-30% larger, we ended the test at 80% larger. That makes me think that the bigger win is not in reducing tuple size but tackling bloat.
>

I agree with you and what I think one of the major reasons of bloat is that

Index segment doesn't have visibility information due to which clearing of

Index needs to be tied along with heap. Now if we can move transaction

information at page level, then we can even think of having it in Index

segment as well and then Index can delete/prune it's tuples on it's own

which can reduce the bloat in index significantly and there is a benefit

to Vacuum as well. Now this has some downsides as well like Delete

needs to traverse Index segment as well to Delete mark the tuples, but

I think the upsides of reducing bloat can certainly outweigh the downsides.

In short, reducing the tuple size by moving transaction information at

page level can not only reduce the tuple size but can also help in

reducing bloat.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reducing tuple overhead

From

Jim Nasby

Date:

24 April 2015, 20:28:19

On 4/23/15 10:40 PM, Amit Kapila wrote:
> I agree with you and what I think one of the major reasons of bloat is that
> Index segment doesn't have visibility information due to which clearing of
> Index needs to be tied along with heap.  Now if we can move transaction
> information at page level, then we can even think of having it in Index
> segment as well and then Index can delete/prune it's tuples on it's own
> which can reduce the bloat in index significantly and there is a benefit
> to Vacuum as well.

I don't see how putting visibility at the page level helps indexes at 
all. We could already put XMIN in indexes if we wanted, but it won't 
help, because...

> Now this has some downsides as well like Delete
> needs to traverse Index segment as well to Delete mark the tuples, but
> I think the upsides of reducing bloat can certainly outweigh the downsides.

... which isn't possible. You can not go from a heap tuple to an index 
tuple. This has been discussed in the past. If we could do that then 
vacuum would become REALLY cheap compared to today.

BTW, before actually tackling anything we should try and get more data 
from Robert/Jan about where the extra 80% came from. We don't know if 
it's indexes or what.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Reducing tuple overhead

From

Amit Kapila

Date:

25 April 2015, 05:12:54

On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>
> On 4/23/15 10:40 PM, Amit Kapila wrote:
>>
>> I agree with you and what I think one of the major reasons of bloat is that
>> Index segment doesn't have visibility information due to which clearing of
>> Index needs to be tied along with heap. Now if we can move transaction
>> information at page level, then we can even think of having it in Index
>> segment as well and then Index can delete/prune it's tuples on it's own
>> which can reduce the bloat in index significantly and there is a benefit
>> to Vacuum as well.
>
>
> I don't see how putting visibility at the page level helps indexes at all. We could already put XMIN in indexes if we wanted, but it won't help, because...
>

We can do that by putting transaction info at tuple level in index as

well but that will be huge increase in size of index unless we devise

a way to have variable index tuple header rather than a fixed.

>> Now this has some downsides as well like Delete
>> needs to traverse Index segment as well to Delete mark the tuples, but
>> I think the upsides of reducing bloat can certainly outweigh the downsides.
>
>
> ... which isn't possible. You can not go from a heap tuple to an index tuple.

We will have the access to index value during delete, so why do you

think that we need linkage between heap and index tuple to perform

Delete operation? I think we need to think more to design Delete

.. by CTID, but that should be doable.

> This has been discussed in the past.

I have tried to search in archive, but not getting what is the exact

problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reducing tuple overhead

From

Jim Nasby

Date:

27 April 2015, 21:01:29

On 4/25/15 12:12 AM, Amit Kapila wrote:
>  > ... which isn't possible. You can not go from a heap tuple to an
> index tuple.
>
> We will have the access to index value during delete, so why do you
> think that we need linkage between heap and index tuple to perform
> Delete operation?  I think we need to think more to design Delete
> .. by CTID, but that should be doable.

The problem with just having the value is that if *anything* changes 
between how you evaluated the value when you created the index tuple and 
when you evaluate it a second time you'll corrupt your index. This is 
actually an incredibly easy problem to have; witness how we allowed 
indexing timestamptz::date until very recently. That was clearly broken, 
but because we never attempted to re-run the index expression to do 
vacuuming at least we never corrupted the index itself.

>  > This has been discussed in the past.
>
> I have tried to search in archive, but not getting what is the exact
> problem.

Unfortunately I can't find prior discussion now either... :/
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Reducing tuple overhead

From

Robert Haas

Date:

29 April 2015, 17:18:24

On Mon, Apr 27, 2015 at 5:01 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> The problem with just having the value is that if *anything* changes between
> how you evaluated the value when you created the index tuple and when you
> evaluate it a second time you'll corrupt your index. This is actually an
> incredibly easy problem to have; witness how we allowed indexing
> timestamptz::date until very recently. That was clearly broken, but because
> we never attempted to re-run the index expression to do vacuuming at least
> we never corrupted the index itself.

True.  But I guess what I don't understand is: how big a deal is this,
really?  The "uncorrupted" index can still return wrong answers to
queries.  The fact that you won't end up with index entries pointing
to completely unrelated tuples is nice, but if index scans are missing
tuples that they should see, aren't you still pretty hosed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reducing tuple overhead

From

Jim Nasby

Date:

29 April 2015, 22:29:45

On 4/29/15 12:18 PM, Robert Haas wrote:
> On Mon, Apr 27, 2015 at 5:01 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> The problem with just having the value is that if *anything* changes between
>> how you evaluated the value when you created the index tuple and when you
>> evaluate it a second time you'll corrupt your index. This is actually an
>> incredibly easy problem to have; witness how we allowed indexing
>> timestamptz::date until very recently. That was clearly broken, but because
>> we never attempted to re-run the index expression to do vacuuming at least
>> we never corrupted the index itself.
>
> True.  But I guess what I don't understand is: how big a deal is this,
> really?  The "uncorrupted" index can still return wrong answers to
> queries.  The fact that you won't end up with index entries pointing
> to completely unrelated tuples is nice, but if index scans are missing
> tuples that they should see, aren't you still pretty hosed?

Maybe, maybe not. You could argue it's better to miss some rows than 
have completely unrelated ones.

My recollection is there's other scenarios where this causes problems, 
but that's from several years ago and I wasn't able to find anything on 
a quick search of archives. I've wondered the same in the past and Tom 
had reasons it was bad, but perhaps they're overstated.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Reducing tuple overhead

From

Amit Kapila

Date:

30 April 2015, 04:31:43

On Tue, Apr 28, 2015 at 2:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>
> On 4/25/15 12:12 AM, Amit Kapila wrote:
>>
>> > ... which isn't possible. You can not go from a heap tuple to an
>> index tuple.
>>
>> We will have the access to index value during delete, so why do you
>> think that we need linkage between heap and index tuple to perform
>> Delete operation? I think we need to think more to design Delete
>> .. by CTID, but that should be doable.
>
>
> The problem with just having the value is that if *anything* changes between how you evaluated the value when you created the index tuple and when you evaluate it a second time you'll corrupt your index.

I think I am missing something here, but when this second

evaluation is needed. Basically what I understand from index

insertion is that it evaluates the value to be inserted in index

before calling nbtree module and then nbtree just inserts the

value/tuple passed to it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reducing tuple overhead

From

Robert Haas

Date:

30 April 2015, 11:33:30

On Thu, Apr 30, 2015 at 12:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think I am missing something here, but when this second
> evaluation is needed.  Basically what I understand from index
> insertion is that it evaluates the value to be inserted in index
> before calling nbtree module and then nbtree just inserts the
> value/tuple passed to it.

Sure, but what happens if it doesn't evaluate to the same value?

Consider a tuple where a = 1 and a function f(a).  You insert the
tuple, evaluate f(a), and get 17.  So you insert an index tuple into
the btree with a value of 17, pointing at the tuple.  Now you delete
the tuple, evaluate f(a) again, and this time you get 42.  You search
the btree for an index tuple with a value of 42, and you don't find
one.  But the index tuple is still there.

With the current approach, that doesn't happen, because we effectively
search the entire index for tuples pointing at the heap tuple we're
trying to get rid of.  The only problem with that is that it's
crushingly expensive when the index is large and the number of tuples
we're cleaning out is comparatively small.  But that's a big problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reducing tuple overhead

From

Simon Riggs

Date:

30 April 2015, 12:05:34

On 25 April 2015 at 01:12, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>
> On 4/23/15 10:40 PM, Amit Kapila wrote:
>>
>> I agree with you and what I think one of the major reasons of bloat is that
>> Index segment doesn't have visibility information due to which clearing of
>> Index needs to be tied along with heap. Now if we can move transaction
>> information at page level, then we can even think of having it in Index
>> segment as well and then Index can delete/prune it's tuples on it's own
>> which can reduce the bloat in index significantly and there is a benefit
>> to Vacuum as well.
>
>
> I don't see how putting visibility at the page level helps indexes at all. We could already put XMIN in indexes if we wanted, but it won't help, because...
>

We can do that by putting transaction info at tuple level in index as
well but that will be huge increase in size of index unless we devise
a way to have variable index tuple header rather than a fixed.

>> Now this has some downsides as well like Delete
>> needs to traverse Index segment as well to Delete mark the tuples, but
>> I think the upsides of reducing bloat can certainly outweigh the downsides.
>
>
> ... which isn't possible. You can not go from a heap tuple to an index tuple.

We will have the access to index value during delete, so why do you
think that we need linkage between heap and index tuple to perform
Delete operation? I think we need to think more to design Delete
.. by CTID, but that should be doable.

I see some assumptions here that need to be challenged.

We can keep xmin and/or xmax on index entries. The above discussion assumes that the information needs to be updated synchronously. We already store visibility information on index entries using the lazily updated killtuple mechanism, so I don't see much problem in setting the xmin in a similar lazy manner. That way when we use the index if xmax is set we use it, if it is not we check the heap. (And then you get to freeze indexes as well ;-( )

Anyway, I have no objection to making index AM pass visibility information to indexes that wish to know the information, as long as it is provided lazily.

The second assumption is that if we had visibility information in the index that it would make a difference to bloat. Since as I mention, we already do have visibility information, I don't see that adding xmax or xmin would make any difference at all to bloat. So -1 to adding it **for that reason**.

A much better idea is to work out how to avoid index bloat at cause. If we are running an UPDATE and we cannot get a cleanup lock, we give up and do a non-HOT update, causing the index to bloat. It seems better to wait for a short period to see if we can get the cleanup lock. The short period is currently 0, so lets start there and vary the duration of wait upwards proportionally as the index gets more bloated.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reducing tuple overhead

From

Robert Haas

Date:

30 April 2015, 12:37:13

On Thu, Apr 30, 2015 at 8:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> A much better idea is to work out how to avoid index bloat at cause. If we
> are running an UPDATE and we cannot get a cleanup lock, we give up and do a
> non-HOT update, causing the index to bloat. It seems better to wait for a
> short period to see if we can get the cleanup lock. The short period is
> currently 0, so lets start there and vary the duration of wait upwards
> proportionally as the index gets more bloated.

What I'd be worried about there is that it would be very hard to tune
the wait time, and that the operating system scheduling granularity
(10ms?) would be way too long.

But I'm in vigorous agreement with you on one point: the solution to
index bloat (and probably heap bloat, too) is not to clean it up
faster but to create less of it in the first place.  Making more
updates HOT is one way to do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reducing tuple overhead

From

Amit Kapila

Date:

30 April 2015, 13:46:25

On Thu, Apr 30, 2015 at 5:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Apr 30, 2015 at 12:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think I am missing something here, but when this second
> > evaluation is needed. Basically what I understand from index
> > insertion is that it evaluates the value to be inserted in index
> > before calling nbtree module and then nbtree just inserts the
> > value/tuple passed to it.
>
> Sure, but what happens if it doesn't evaluate to the same value?
>
> Consider a tuple where a = 1 and a function f(a). You insert the
> tuple, evaluate f(a), and get 17. So you insert an index tuple into
> the btree with a value of 17, pointing at the tuple. Now you delete
> the tuple, evaluate f(a) again, and this time you get 42.

As the index expression contain table columns and all the functions

or operators used in expression must be IMMUTABLE, won't that

guarantee to avoid such a situation?

If not, then I think for such indexes we might need to search

for tupleoffset in the entire index and mark it as Delete or may be

for such indexes follow the current mechanism. I think it will still

give us lot of benefit in more common cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reducing tuple overhead

From

Robert Haas

Date:

30 April 2015, 13:54:41

On Thu, Apr 30, 2015 at 9:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> As the index expression contain table columns and all the functions
> or operators used in expression must be IMMUTABLE, won't that
> guarantee to avoid such a situation?

The concern is that they might be labeled as immutable but not
actually behave that way.

The other, related problem is that the ordering operator might start
to return different results than it did at index creation time.  For
example, consider a btree index built on a text column.  Now consider
'yum update'.  glibc gets updated, collation ordering of various
strings change, and now you've got tuples that are in the "wrong
place" in the index, because when the index was built, we thought A <
B, but now we think B < A.  You would think the glibc maintainers
might avoid such changes in minor releases, or that the Red Hat guys
would avoid packaging and shipping those changes in minor releases,
but you'd be wrong.  We've seen cases where the master and the standby
were both running RHEL X.Y (same X.Y on both servers) but they didn't
agree on the collation definitions, so queries that worked on the
master failed on the slave.

> If not, then I think for such indexes we might need to search
> for tupleoffset in the entire index and mark it as Delete or may be
> for such indexes follow the current mechanism.

I think that *is* the current mechanism.

> I think it will still
> give us lot of benefit in more common cases.

It's hard to figure out exactly what can work here.  Aside from
correctness issues, the biggest problem with refinding the index
tuples one-by-one and killing them is that it may suck for bulk
operations.  When you delete one tuple from the table, refinding the
index tuples and killing them immediately is probably the smartest
thing you can do.  If you postpone it, somebody may be forced to split
the page because it's full, whereas if you do it right away, the next
guy who wants to put a tuple on that page may be able to make it fit.
That's a big win.

But if you want to delete 10 million tuples, doing an index scan for
each one may induce a tremendous amount of random I/O on the index
pages, possibly visiting and dirtying the same pages more than once.
Scanning the whole index for tuples to kill avoids that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reducing tuple overhead

From

Alvaro Herrera

Date:

30 April 2015, 15:47:18

Robert Haas wrote:
> On Mon, Apr 27, 2015 at 5:01 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> > The problem with just having the value is that if *anything* changes between
> > how you evaluated the value when you created the index tuple and when you
> > evaluate it a second time you'll corrupt your index. This is actually an
> > incredibly easy problem to have; witness how we allowed indexing
> > timestamptz::date until very recently. That was clearly broken, but because
> > we never attempted to re-run the index expression to do vacuuming at least
> > we never corrupted the index itself.
> 
> True.  But I guess what I don't understand is: how big a deal is this,
> really?  The "uncorrupted" index can still return wrong answers to
> queries.  The fact that you won't end up with index entries pointing
> to completely unrelated tuples is nice, but if index scans are missing
> tuples that they should see, aren't you still pretty hosed?

As I recall, there's a desire not to run expressions when vacuuming,
because to run them means getting a snapshot, in case any functions look
at database state; and you can't get a snapshot while vacuuming because
that means the vacuum gets visible to concurrent processes; in
particular they keep all processes' xmin from moving forward.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reducing tuple overhead

From

Amit Kapila

Date:

01 May 2015, 04:02:38

On Thu, Apr 30, 2015 at 7:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Apr 30, 2015 at 9:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > As the index expression contain table columns and all the functions
> > or operators used in expression must be IMMUTABLE, won't that
> > guarantee to avoid such a situation?
>
> The concern is that they might be labeled as immutable but not
> actually behave that way.
>
> The other, related problem is that the ordering operator might start
> to return different results than it did at index creation time. For
> example, consider a btree index built on a text column. Now consider
> 'yum update'. glibc gets updated, collation ordering of various
> strings change, and now you've got tuples that are in the "wrong
> place" in the index, because when the index was built, we thought A <
> B, but now we think B < A. You would think the glibc maintainers
> might avoid such changes in minor releases, or that the Red Hat guys
> would avoid packaging and shipping those changes in minor releases,
> but you'd be wrong. We've seen cases where the master and the standby
> were both running RHEL X.Y (same X.Y on both servers) but they didn't
> agree on the collation definitions, so queries that worked on the
> master failed on the slave.
>

This is quite similar to IMMUTABLE function, but for an operator. I think

guaranteeing the immutable nature for a function is the job of application/user.

Oracle uses something similar (DETERMINISTIC functions) for function based

indexes and asks user to ensure the DETERMINISTIC property of function.

I think having synchronous delete (delete from index at the same time as

from heap) is one of the main differentiating factor for not having bloat in

indexes in some of the other databases. If we don't want to change the

current property for functions or operators for indexes, then we can have a new

type of index which can have visibility information and users are advised to

use such an index where they can ensure that functions or operators for

that index are IMMUTABLE. Here, there is one argument that users might or

might not be able to ensure the same, but I think if it can be ensured by users

of other successful databases, then the same should be possible for PostgreSQL

users as well, after all this can bring a lot of value on table (avoiding index bloat)

for users.

> > I think it will still
> > give us lot of benefit in more common cases.
>
> It's hard to figure out exactly what can work here. Aside from
> correctness issues, the biggest problem with refinding the index
> tuples one-by-one and killing them is that it may suck for bulk
> operations. When you delete one tuple from the table, refinding the
> index tuples and killing them immediately is probably the smartest
> thing you can do. If you postpone it, somebody may be forced to split
> the page because it's full, whereas if you do it right away, the next
> guy who wants to put a tuple on that page may be able to make it fit.
> That's a big win.
>

Exactly, I have that big win in my mind.

> But if you want to delete 10 million tuples, doing an index scan for
> each one may induce a tremendous amount of random I/O on the index
> pages, possibly visiting and dirtying the same pages more than once.
> Scanning the whole index for tuples to kill avoids that.
>

Even doing it only from heap might not be cheap.
I think for doing BULK delete (as you described), users of other

databases have some smart ways like:

If possible drop the indexes

instead of deleting the records you don't want, SELECT OUT the

records you do -- drop the old table -- rename new table.

Archive instead of delete Instead of deleting, just set a flag to mark

a row as archived, invalid, deleted, etc. This will avoid the immediate

overhead of index maintenance. Ignore the rows temporarily and let

some other job delete them later. The large downside to this is that it

affects any query on the table.

...

possibly some other ways.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reducing tuple overhead

From

Amit Kapila

Date:

01 May 2015, 14:45:23

On Thu, Apr 30, 2015 at 5:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 25 April 2015 at 01:12, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> >
>> > On 4/23/15 10:40 PM, Amit Kapila wrote:
>> >>
>> >> I agree with you and what I think one of the major reasons of bloat is that
>> >> Index segment doesn't have visibility information due to which clearing of
>> >> Index needs to be tied along with heap. Now if we can move transaction
>> >> information at page level, then we can even think of having it in Index
>> >> segment as well and then Index can delete/prune it's tuples on it's own
>> >> which can reduce the bloat in index significantly and there is a benefit
>> >> to Vacuum as well.
>> >
>> >
>> > I don't see how putting visibility at the page level helps indexes at all. We could already put XMIN in indexes if we wanted, but it won't help, because...
>> >
>>
>> We can do that by putting transaction info at tuple level in index as
>> well but that will be huge increase in size of index unless we devise
>> a way to have variable index tuple header rather than a fixed.
>>
>> >> Now this has some downsides as well like Delete
>> >> needs to traverse Index segment as well to Delete mark the tuples, but
>> >> I think the upsides of reducing bloat can certainly outweigh the downsides.
>> >
>> >
>> > ... which isn't possible. You can not go from a heap tuple to an index tuple.
>>
>> We will have the access to index value during delete, so why do you
>> think that we need linkage between heap and index tuple to perform
>> Delete operation? I think we need to think more to design Delete
>> .. by CTID, but that should be doable.
>
>
> I see some assumptions here that need to be challenged.
>
> We can keep xmin and/or xmax on index entries. The above discussion assumes that the information needs to be updated synchronously. We already store visibility information on index entries using the lazily updated killtuple mechanism, so I don't see much problem in setting the xmin in a similar lazy manner. That way when we use the index if xmax is set we use it, if it is not we check the heap. (And then you get to freeze indexes as well ;-( )
> Anyway, I have no objection to making index AM pass visibility information to indexes that wish to know the information, as long as it is provided lazily.
>

Providing such an information lazily can help to an extent, but I think

it won't help much in bloat reduction. For example, when an

insert tries to insert a row in index page and found that there is no

space, it can't kill or overwrite any tuple (that is actually dead unless

updated lazily by that time) which is I think one of the main reasons for

index bloat.

> The second assumption is that if we had visibility information in the index that it would make a difference to bloat. Since as I mention, we already do have visibility information, I don't see that adding xmax or xmin would make any difference at all to bloat. So -1 to adding it **for that reason**.
>

The visibility information is only updated when such an index item

is accessed (lazy updation) and by that time already the new space

for index insertion would be used whereas if the information is provided

synchronously the dead space could be reclaimed much earlier (for

insertions on page which has dead tuples) and will reduce the chances

of bloat.

>
> A much better idea is to work out how to avoid index bloat at cause. If we are running an UPDATE and we cannot get a cleanup lock, we give up and do a non-HOT update, causing the index to bloat. It seems better to wait for a short period to see if we can get the cleanup lock. The short period is currently 0, so lets start there and vary the duration of wait upwards proportionally as the index gets more bloated.
>

I think this is a separate and another good way of avoiding the

bloat, but independent of this having something like what we

discussed above will even reduce the chances of bloat for a

non-HOT update in a scenario described by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reducing tuple overhead

From

Jim Nasby

Date:

01 May 2015, 22:56:12

On 4/30/15 7:37 AM, Robert Haas wrote:
> On Thu, Apr 30, 2015 at 8:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> A much better idea is to work out how to avoid index bloat at cause. If we
>> are running an UPDATE and we cannot get a cleanup lock, we give up and do a
>> non-HOT update, causing the index to bloat. It seems better to wait for a
>> short period to see if we can get the cleanup lock. The short period is
>> currently 0, so lets start there and vary the duration of wait upwards
>> proportionally as the index gets more bloated.

That only happens if there already wasn't enough space on the page so we 
need to Defrag, yes? If there is enough space we can HOT update without 
the cleanup lock.

What would be useful to know is how often we abort a HOT update because 
of lack of free space; that would indicate to a DBA that a lower fill 
factor may be in oredr. What would be useful to -hackers would be stats 
on how often an update would have been HOT if only the page had been pruned.

> What I'd be worried about there is that it would be very hard to tune
> the wait time, and that the operating system scheduling granularity
> (10ms?) would be way too long.

[1] indicates between 0.75 and 6ms by default on Linux. I think FBSD 
still uses a 1000Hz scheduler (1ms), but it's not as clear.

What might be more promising is ways to avoid holding a pin for a long 
time (like the outer side of a nested loop), or being more aggressive 
about attempting the lock (IE: lower the threshold to trigger cleaning).

There's also a (in hindsight) questionable bit of logic in 
heap_page_prune_opt(); once we get the cleanup lock we check the page 
free space a second time. If we managed to actually get the lock, we 
should probably just clean it anyway.

> But I'm in vigorous agreement with you on one point: the solution to
> index bloat (and probably heap bloat, too) is not to clean it up
> faster but to create less of it in the first place.  Making more
> updates HOT is one way to do that.

+1.

1: 
http://stackoverflow.com/questions/16401294/how-to-know-linux-scheduler-time-slice
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Reducing tuple overhead

From

Amit Kapila

Date:

07 June 2015, 05:13:29

On Thu, Apr 23, 2015 at 9:54 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Split into a new thread, the other one is already growing fast
> enough. This discussion started at
> http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi
>
> On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >Stop right there. You need to reserve enough space on the page to store
> >
> >an xmax for *every* tuple on the page. Because if you don't, what are
> >you going to do when every tuple on the page is deleted by a different
> >transaction.
> >
> >Even if you store the xmax somewhere else than the page header, you
> >need
> >to reserve the same amount of space for them, so it doesn't help at
> >all.
>
> Depends on how you do it and what you optimize for (disk space, runtime,
> code complexity..). You can e.g. use apply a somewhat similar trick to
> xmin/xmax as done to cmin/cmax; only that the data structure needs to be
> persistent.

Today while reading how other databases (that stores similar information

at page level) tackle this problem, I came across a link [1] which indicates

that this is controlled by some clauses (options) at table level. The idea

seems to be that have some preallocated space (minimum and maximum

value for which can be specified by user, ofcourse there will be some

default values for the same) for this information in page and if more space

than that is required for a concurrent write operation, then the operation

needs to wait till the space for the same is available.

I am not sure if this is the best way as it depends on how to re-use the

preallocated space for transaction information at page level, but still I

think it is worth considering.

[1] - https://techiedba.wordpress.com/2011/09/03/what-is-initrans-and-maxtrans/

With Regards,
Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

Re: Reducing tuple overhead

From

Peter Geoghegan

Date:

07 June 2015, 08:05:01

On Thu, Apr 30, 2015 at 6:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The other, related problem is that the ordering operator might start
> to return different results than it did at index creation time.  For
> example, consider a btree index built on a text column.  Now consider
> 'yum update'.  glibc gets updated, collation ordering of various
> strings change, and now you've got tuples that are in the "wrong
> place" in the index, because when the index was built, we thought A <
> B, but now we think B < A.  You would think the glibc maintainers
> might avoid such changes in minor releases, or that the Red Hat guys
> would avoid packaging and shipping those changes in minor releases,
> but you'd be wrong.

I would not think that. Unicode Technical Standard #10 states:

"""
Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.
"""

Also, in the paper "Modern B-Tree Techniques", by Goetz Graefe, page
238, it states:

"""
In many operating systems, appropriate functions are provided to
compute a normalized key from a localized string value, date value, or
time value. This functionality is used, for example, to list files in
a directory as appropriate for the local language. Adding
normalization for numeric data types is relatively straightforward, as
is concatenation of multiple normalized values. Database code must not
rely on such operating system code, however. The problem with relying
on operating systems support for database indexes is the update
frequency. An operating system might update its normalization code due
to an error or extension in the code or in the definition of a local
sort order; it is unacceptable, however, if such an update silently
renders existing large database indexes incorrect.
"""

Unfortunately, it is simply not the case that we can rely on OS
collations being immutable. We have *no* contract with any C standard
library concerning collation stability whatsoever. I'm surprised that
we don't see hear more about this kind of corruption.
-- 
Peter Geoghegan

Re: Reducing tuple overhead

From

Simon Riggs

Date:

07 June 2015, 09:32:13

On 23 April 2015 at 17:24, Andres Freund <andres@anarazel.de> wrote:

Split into a new thread, the other one is already growing fast
enough. This discussion started at
http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi

On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>Stop right there. You need to reserve enough space on the page to store
>
>an xmax for *every* tuple on the page. Because if you don't, what are
>you going to do when every tuple on the page is deleted by a different
>transaction.
>
>Even if you store the xmax somewhere else than the page header, you
>need
>to reserve the same amount of space for them, so it doesn't help at
>all.

Depends on how you do it and what you optimize for (disk space, runtime,
code complexity..). You can e.g. use apply a somewhat similar trick to
xmin/xmax as done to cmin/cmax; only that the data structure needs to be
persistent.

In fact, we already have combocid like structure for xids that's
persistent - multixacts. We could just have one xid saved that's either
xmin or xmax (indicated by bits) or a multixact. When a tuple is
updated/deleted whose xmin is still required we could replace the former
xmin with a multixact, otherwise just change the flag that it's now a
xmax without a xmin. To check visibility and if the xid is a multixact
we'd just have to look for the relevant member for the actual xmin and
xmax.

To avoid exessive overhead when a tuple is repeatedly updated within one
session we could store some of the data in the combocid entry that we
anyway need in that case.

Whether that's feasible complexity wise is debatable, but it's certainly
possible.

I do wonder what, in realistic cases, is actually the bigger contributor
to the overhead. The tuple header or the padding we liberally add in
many cases...

It's hard to see how to save space there without reference to a specific use case. I see different solutions depending upon whether we assume a low number of transactions or a high number of transactions.

A case we could optimize for is insert-mostly tables. But in that case if you get rid of the xmax then you still have to freeze the tuples later.

I would have thought a better optimization would be to use the xmax for the xid epoch by default, so that such rows would never need freezing. Then at least we are using the xmax for something useful in a larger insert-mostly database rather than just leaving it at zero.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reducing tuple overhead

From

Amit Kapila

Date:

08 June 2015, 05:39:36

On Sun, Jun 7, 2015 at 3:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 23 April 2015 at 17:24, Andres Freund <andres@anarazel.de> wrote:
>>
>
>
> It's hard to see how to save space there without reference to a specific use case. I see different solutions depending upon whether we assume a low number of transactions or a high number of transactions.
>

I have tried to check theoretically, how much difference such a

change could give us.

Assuming BLK_SZ - 8192 bytes; Page header - 24 bytes; each

line pointer - 4 bytes; average tuple - 150 bytes, roughly 53

tuples could be accommodated in one page. Now each of this

tuple contains 12 bytes transaction information (xmin, xmax,

cid/combocid). Now considering that in average workload 4

transactions operate on a page at the same time (I think for a

workload like pgbench tpc-b, it shouldn't be more otherwise it

should have been visible in perf reports), 4 additional tuples [1]

could be accommodated on a page which is approximately 7% savings

in space (which in-turns means that much less I/O). This gain

could vary based on tuple size, no. of transactions that can operate

on page, page size..

Some additional benefits that I could see from such a change:

1. I think we don't need to traverse the whole page while freezing,

so there should be some savings in freeze operation as well.

2. Now I think with this, we might be able to reduce WAL also

if we can avoid some transaction related info in the cases where

it is currently stored (update/delete).

3. Another gain could come if we want to add transaction information

in index segment as well, because if such information can be stored at

page level, then there won't be much impact in adding it there which

will help us in avoiding multiple-passes of Vaccum (heap and index

could be vacuumed separately which will definitely help in IO and

bloat reduction).

[1]

Calc for 4 additional tuples:

(saving by removing trans info from tuple - new space consumed by trans)

/new tuple size

(53 * 12 - 12 * 4) / (150 - 12) = 4.26

With Regards,
Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com