Thread: Freezing without write I/O

Freezing without write I/O

From
Heikki Linnakangas
Date:
Since we're bashing around ideas around freezing, let me write down the 
idea I've been pondering and discussing with various people for years. I 
don't think I invented this myself, apologies to whoever did for not 
giving credit.

The reason we have to freeze is that otherwise our 32-bit XIDs wrap 
around and become ambiguous. The obvious solution is to extend XIDs to 
64 bits, but that would waste a lot space. The trick is to add a field 
to the page header indicating the 'epoch' of the XID, while keeping the 
XIDs in tuple header 32-bit wide (*).

The other reason we freeze is to truncate the clog. But with 64-bit 
XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid. 
Instead, we could implicitly treat anything older than relfrozenxid as 
frozen.

That's the basic idea. Vacuum freeze only needs to remove dead tuples, 
but doesn't need to dirty pages that contain no dead tuples.

Since we're not storing 64-bit wide XIDs on every tuple, we'd still need 
to replace the XIDs with FrozenXid whenever the difference between the 
smallest and largest XID on a page exceeds 2^31. But that would only 
happen when you're updating the page, in which case the page is dirtied 
anyway, so it wouldn't cause any extra I/O.

This would also be the first step in allowing the clog to grow larger 
than 2 billion transactions, eliminating the need for anti-wraparound 
freezing altogether. You'd still want to truncate the clog eventually, 
but it would be nice to not be pressed against the wall with "run vacuum 
freeze now, or the system will shut down".

(*) "Adding an epoch" is inaccurate, but I like to use that as my mental 
model. If you just add a 32-bit epoch field, then you cannot have xids 
from different epochs on the page, which would be a problem. In reality, 
you would store one 64-bit XID value in the page header, and use that as 
the "reference point" for all the 32-bit XIDs on the tuples. See 
existing convert_txid() function for how that works. Another method is 
to store the 32-bit xid values in tuple headers as offsets from the 
per-page 64-bit value, but then you'd always need to have the 64-bit 
value at hand when interpreting the XIDs, even if they're all recent.

- Heikki



Re: Freezing without write I/O

From
Josh Berkus
Date:
Heikki,

This sounds a lot like my idea for 9.3, which didn't go anywhere.
You've worked out the issues I couldn't, I think.

> Another method is
> to store the 32-bit xid values in tuple headers as offsets from the
> per-page 64-bit value, but then you'd always need to have the 64-bit
> value at hand when interpreting the XIDs, even if they're all recent.

Yeah, -1 on the latter, not least because it would require a 100%
rewrite of the tables in order to upgrade.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Freezing without write I/O

From
Robert Haas
Date:
On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
> and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
> that would waste a lot space. The trick is to add a field to the page header
> indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
> 32-bit wide (*).

Check.

> The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
> wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
> could implicitly treat anything older than relfrozenxid as frozen.

Check.

> That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
> doesn't need to dirty pages that contain no dead tuples.

Check.

> Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
> replace the XIDs with FrozenXid whenever the difference between the smallest
> and largest XID on a page exceeds 2^31. But that would only happen when
> you're updating the page, in which case the page is dirtied anyway, so it
> wouldn't cause any extra I/O.

It would cause some extra WAL activity, but it wouldn't dirty the page
an extra time.

> This would also be the first step in allowing the clog to grow larger than 2
> billion transactions, eliminating the need for anti-wraparound freezing
> altogether. You'd still want to truncate the clog eventually, but it would
> be nice to not be pressed against the wall with "run vacuum freeze now, or
> the system will shut down".

Interesting.  That seems like a major advantage.

> (*) "Adding an epoch" is inaccurate, but I like to use that as my mental
> model. If you just add a 32-bit epoch field, then you cannot have xids from
> different epochs on the page, which would be a problem. In reality, you
> would store one 64-bit XID value in the page header, and use that as the
> "reference point" for all the 32-bit XIDs on the tuples. See existing
> convert_txid() function for how that works. Another method is to store the
> 32-bit xid values in tuple headers as offsets from the per-page 64-bit
> value, but then you'd always need to have the 64-bit value at hand when
> interpreting the XIDs, even if they're all recent.

As I see it, the main downsides of this approach are:

(1) It breaks binary compatibility (unless you do something to
provided for it, like put the epoch in the special space).

(2) It consumes 8 bytes per page.  I think it would be possible to get
this down to say 5 bytes per page pretty easily; we'd simply decide
that the low-order 3 bytes of the reference XID must always be 0.
Possibly you could even do with 4 bytes, or 4 bytes plus some number
of extra bits.

(3) You still need to periodically scan the entire relation, or else
have a freeze map as Simon and Josh suggested.

The upsides of this approach as compared with what Andres and I are
proposing are:

(1) It provides a stepping stone towards allowing indefinite expansion
of CLOG, which is quite appealing as an alternative to a hard
shut-down.

(2) It doesn't place any particular requirements on PD_ALL_VISIBLE.  I
don't personally find this much of a benefit as I want to keep
PD_ALL_VISIBLE, but I know Jeff and perhaps others disagree.

Random thought: Could you compute the reference XID based on the page
LSN?  That would eliminate the storage overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Merlin Moncure
Date:
On Thu, May 30, 2013 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
>> and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
>> that would waste a lot space. The trick is to add a field to the page header
>> indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
>> 32-bit wide (*).
>
> (3) You still need to periodically scan the entire relation, or else
> have a freeze map as Simon and Josh suggested.

Why is this scan required?

Also, what happens if you delete a tuple on a page when another tuple
on the same page with age > 2^32 that is still in an open transaction?

merlin



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 30.05.2013 21:46, Merlin Moncure wrote:
> On Thu, May 30, 2013 at 1:39 PM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com>  wrote:
>>> The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
>>> and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
>>> that would waste a lot space. The trick is to add a field to the page header
>>> indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
>>> 32-bit wide (*).
>>
>> (3) You still need to periodically scan the entire relation, or else
>> have a freeze map as Simon and Josh suggested.
>
> Why is this scan required?

To find all the dead tuples and remove them, and advance relfrozenxid. 
That in turn is required so that you can truncate the clog. This scheme 
relies on assuming that everything older than relfrozenxid committed, so 
if there are any aborted XIDs present in the table, you can't advance 
relfrozenxid past them.

Come to think of it, if there are no aborted XIDs in a range of XIDs, 
only commits, then you could just advance relfrozenxid past that range 
and truncate away the clog, without scanning the table. But that's quite 
a special case - generally there would be at least a few aborted XIDs - 
so it's probably not worth adding any special code to take advantage of 
that.

> Also, what happens if you delete a tuple on a page when another tuple
> on the same page with age>  2^32 that is still in an open transaction?

Can't let that happen. Same as today.

- Heikki



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-05-30 14:39:46 -0400, Robert Haas wrote:
> > Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
> > replace the XIDs with FrozenXid whenever the difference between the smallest
> > and largest XID on a page exceeds 2^31. But that would only happen when
> > you're updating the page, in which case the page is dirtied anyway, so it
> > wouldn't cause any extra I/O.
> 
> It would cause some extra WAL activity, but it wouldn't dirty the page
> an extra time.

You probably could do it similarly to how we currently do
XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
insert/update/delete record will already contain a FPI if necessary, so
that should be safe.

> > This would also be the first step in allowing the clog to grow larger than 2
> > billion transactions, eliminating the need for anti-wraparound freezing
> > altogether. You'd still want to truncate the clog eventually, but it would
> > be nice to not be pressed against the wall with "run vacuum freeze now, or
> > the system will shut down".

> Interesting.  That seems like a major advantage.

Hm. Why? If freezing gets notably cheaper I don't really see much need
for keeping that much clog around? If we still run into anti-wraparound
areas, there has to be some major operational problem.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Bruce Momjian
Date:
On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote:
> This would also be the first step in allowing the clog to grow
> larger than 2 billion transactions, eliminating the need for
> anti-wraparound freezing altogether. You'd still want to truncate
> the clog eventually, but it would be nice to not be pressed against
> the wall with "run vacuum freeze now, or the system will shut down".

Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog
truncation.  Does this help that?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Freezing without write I/O

From
Robert Haas
Date:
On Thu, May 30, 2013 at 3:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-05-30 14:39:46 -0400, Robert Haas wrote:
>> > Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
>> > replace the XIDs with FrozenXid whenever the difference between the smallest
>> > and largest XID on a page exceeds 2^31. But that would only happen when
>> > you're updating the page, in which case the page is dirtied anyway, so it
>> > wouldn't cause any extra I/O.
>>
>> It would cause some extra WAL activity, but it wouldn't dirty the page
>> an extra time.
>
> You probably could do it similarly to how we currently do
> XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
> insert/update/delete record will already contain a FPI if necessary, so
> that should be safe.

Ah, good point.

>> > This would also be the first step in allowing the clog to grow larger than 2
>> > billion transactions, eliminating the need for anti-wraparound freezing
>> > altogether. You'd still want to truncate the clog eventually, but it would
>> > be nice to not be pressed against the wall with "run vacuum freeze now, or
>> > the system will shut down".
>
>> Interesting.  That seems like a major advantage.
>
> Hm. Why? If freezing gets notably cheaper I don't really see much need
> for keeping that much clog around? If we still run into anti-wraparound
> areas, there has to be some major operational problem.

That is true, but we have a decent number of customers who do in fact
have such problems.  I think that's only going to get more common.  As
hardware gets faster and PostgreSQL improves, people are going to
process more and more transactions in shorter and shorter periods of
time.  Heikki's benchmark results for the XLOG scaling patch show
rates of >80,000 tps.  Even at a more modest 10,000 tps, with default
settings, you'll do anti-wraparound vacuums of the entire cluster
about every 8 hours.  That's not fun.

Being able to do such scans only of the not-all-visible pages would be
a huge step forward, of course.  But not having to do them on any
particular deadline would be a whole lot better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Robert Haas
Date:
On Thu, May 30, 2013 at 2:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Random thought: Could you compute the reference XID based on the page
> LSN?  That would eliminate the storage overhead.

After mulling this over a bit, I think this is definitely possible.
We begin a new "half-epoch" every 2 billion transactions.  We remember
the LSN at which the current half-epoch began and the LSN at which the
previous half-epoch began.  When a new half-epoch begins, the first
backend that wants to stamp a tuple with an XID from the new
half-epoch must first emit a "new half-epoch" WAL record, which
becomes the starting LSN for the new half-epoch.

We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch.  Whenever we know
this to be true, we set the bit.  If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.If the page LSN crosses exactly one half-epoch
boundary,then (1) if
 
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit.  The advantage of this is that we avoid an
epidemic of freezing right after a half-epoch change.  Immediately
after a half-epoch change, many pages will mix tuples from the current
and previous half-epoch - but relatively few pages will have tuples
from the current half-epoch and a half-epoch more than one in the
past.

As things stand today, we really only need to remember the last two
half-epoch boundaries; they could be stored, for example, in the
control file.  But if we someday generalize CLOG to allow indefinite
retention as you suggest, we could instead remember all half-epoch
boundaries that have ever occurred; just maintain a file someplace
with 8 bytes of data for every 2 billion XIDs consumed over the
lifetime of the cluster.  In fact, we might want to do it that way
anyhow, just to keep our options open, and perhaps for forensics.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 31.05.2013 00:06, Bruce Momjian wrote:
> On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote:
>> This would also be the first step in allowing the clog to grow
>> larger than 2 billion transactions, eliminating the need for
>> anti-wraparound freezing altogether. You'd still want to truncate
>> the clog eventually, but it would be nice to not be pressed against
>> the wall with "run vacuum freeze now, or the system will shut down".
>
> Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog
> truncation.  Does this help that?

No. If you want to keep autovacuum_freeze_max_age set at less than 2 
billion, you don't need support for more than 2 billion transactions. 
But for those who would like to set autovacuum_freeze_max_age higher 
than 2B, it would be nice to allow it.

Actually, even with autovacuum_freeze_max_age = 200 M, it would be nice 
to not have the hard stop at 2 billion, in case autovacuum falls behind 
really badly. With autovacuum_freeze_max_age = 200M, there's a lot of 
safety margin, but with 1000M or so, not so much.

- Heikki



Re: Freezing without write I/O

From
Bruce Momjian
Date:
On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote:
> > Hm. Why? If freezing gets notably cheaper I don't really see much need
> > for keeping that much clog around? If we still run into anti-wraparound
> > areas, there has to be some major operational problem.
> 
> That is true, but we have a decent number of customers who do in fact
> have such problems.  I think that's only going to get more common.  As
> hardware gets faster and PostgreSQL improves, people are going to
> process more and more transactions in shorter and shorter periods of
> time.  Heikki's benchmark results for the XLOG scaling patch show
> rates of >80,000 tps.  Even at a more modest 10,000 tps, with default
> settings, you'll do anti-wraparound vacuums of the entire cluster
> about every 8 hours.  That's not fun.

Are you assuming these are all write transactions, hence consuming xids?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Freezing without write I/O

From
Robert Haas
Date:
On Fri, May 31, 2013 at 1:26 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote:
>> > Hm. Why? If freezing gets notably cheaper I don't really see much need
>> > for keeping that much clog around? If we still run into anti-wraparound
>> > areas, there has to be some major operational problem.
>>
>> That is true, but we have a decent number of customers who do in fact
>> have such problems.  I think that's only going to get more common.  As
>> hardware gets faster and PostgreSQL improves, people are going to
>> process more and more transactions in shorter and shorter periods of
>> time.  Heikki's benchmark results for the XLOG scaling patch show
>> rates of >80,000 tps.  Even at a more modest 10,000 tps, with default
>> settings, you'll do anti-wraparound vacuums of the entire cluster
>> about every 8 hours.  That's not fun.
>
> Are you assuming these are all write transactions, hence consuming xids?

Well, there might be read-only transactions as well, but the point is
about how many write transactions there can be.  10,000 tps or more is
not out of the question even today, and progressively higher numbers
are only going to become more and more common.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Simon Riggs
Date:
On 30 May 2013 14:33, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> Since we're bashing around ideas around freezing, let me write down the idea
> I've been pondering and discussing with various people for years. I don't
> think I invented this myself, apologies to whoever did for not giving
> credit.
>
> The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
> and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
> that would waste a lot space. The trick is to add a field to the page header
> indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
> 32-bit wide (*).
>
> The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
> wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
> could implicitly treat anything older than relfrozenxid as frozen.
>
> That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
> doesn't need to dirty pages that contain no dead tuples.

I have to say this is pretty spooky. I'd not read hackers all week, so
I had no idea so many other people were thinking about freezing as
well. This idea is damn near identical to what I've suggested. My
suggestion came because I was looking to get rid of fields out of the
tuple header; which didn't come to much. The good news is that is
complete chance, so it must mean we're on the right track.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Simon Riggs
Date:
On 30 May 2013 19:39, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
>> and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
>> that would waste a lot space. The trick is to add a field to the page header
>> indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
>> 32-bit wide (*).
>
> Check.
>
>> The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
>> wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
>> could implicitly treat anything older than relfrozenxid as frozen.
>
> Check.
>
>> That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
>> doesn't need to dirty pages that contain no dead tuples.
>
> Check.

Yes, this is the critical point. Large insert-only tables don't need
to be completely re-written twice.


>> Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
>> replace the XIDs with FrozenXid whenever the difference between the smallest
>> and largest XID on a page exceeds 2^31. But that would only happen when
>> you're updating the page, in which case the page is dirtied anyway, so it
>> wouldn't cause any extra I/O.
>
> It would cause some extra WAL activity, but it wouldn't dirty the page
> an extra time.
>
>> This would also be the first step in allowing the clog to grow larger than 2
>> billion transactions, eliminating the need for anti-wraparound freezing
>> altogether. You'd still want to truncate the clog eventually, but it would
>> be nice to not be pressed against the wall with "run vacuum freeze now, or
>> the system will shut down".
>
> Interesting.  That seems like a major advantage.
>
>> (*) "Adding an epoch" is inaccurate, but I like to use that as my mental
>> model. If you just add a 32-bit epoch field, then you cannot have xids from
>> different epochs on the page, which would be a problem. In reality, you
>> would store one 64-bit XID value in the page header, and use that as the
>> "reference point" for all the 32-bit XIDs on the tuples. See existing
>> convert_txid() function for how that works. Another method is to store the
>> 32-bit xid values in tuple headers as offsets from the per-page 64-bit
>> value, but then you'd always need to have the 64-bit value at hand when
>> interpreting the XIDs, even if they're all recent.
>
> As I see it, the main downsides of this approach are:
>
> (1) It breaks binary compatibility (unless you do something to
> provided for it, like put the epoch in the special space).
>
> (2) It consumes 8 bytes per page.  I think it would be possible to get
> this down to say 5 bytes per page pretty easily; we'd simply decide
> that the low-order 3 bytes of the reference XID must always be 0.
> Possibly you could even do with 4 bytes, or 4 bytes plus some number
> of extra bits.

Yes, the idea of having a "base Xid" on every page is complicated and
breaks compatibility. Same idea can work well if we do this via tuple
headers.


> (3) You still need to periodically scan the entire relation, or else
> have a freeze map as Simon and Josh suggested.

I don't think that is needed with this approach.

(The freeze map was Andres' idea, not mine. I just accepted it as what
I thought was the only way forwards. Now I see other ways)

> The upsides of this approach as compared with what Andres and I are
> proposing are:
>
> (1) It provides a stepping stone towards allowing indefinite expansion
> of CLOG, which is quite appealing as an alternative to a hard
> shut-down.

I would be against expansion of the CLOG beyond its current size. If
we have removed all aborted rows and marked hints, then we don't need
the CLOG values and can trim that down.

I don't mind the hints, its the freezing we don't need.


>> convert_txid() function for how that works. Another method is to store the
>> 32-bit xid values in tuple headers as offsets from the per-page 64-bit
>> value, but then you'd always need to have the 64-bit value at hand when
>> interpreting the XIDs, even if they're all recent.

You've touched here on the idea of putting the epoch in the tuple
header, which is where what I posted comes together. We don't need
anything at page level, we just need something on each tuple.

Please can you look at my recent post on how to put this in the tuple header?

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 31.05.2013 06:02, Robert Haas wrote:
> On Thu, May 30, 2013 at 2:39 PM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> Random thought: Could you compute the reference XID based on the page
>> LSN?  That would eliminate the storage overhead.
>
> After mulling this over a bit, I think this is definitely possible.
> We begin a new "half-epoch" every 2 billion transactions.  We remember
> the LSN at which the current half-epoch began and the LSN at which the
> previous half-epoch began.  When a new half-epoch begins, the first
> backend that wants to stamp a tuple with an XID from the new
> half-epoch must first emit a "new half-epoch" WAL record, which
> becomes the starting LSN for the new half-epoch.

Clever! Pages in unlogged tables need some extra treatment, as they 
don't normally have a valid LSN, but that shouldn't be too hard.

> We define a new page-level bit, something like PD_RECENTLY_FROZEN.
> When this bit is set, it means there are no unfrozen tuples on the
> page with XIDs that predate the current half-epoch.  Whenever we know
> this to be true, we set the bit.  If the page LSN crosses more than
> one half-epoch boundary at a time, we freeze the page and set the bit.
>   If the page LSN crosses exactly one half-epoch boundary, then (1) if
> the bit is set, we clear it and (2) if the bit is not set, we freeze
> the page and set the bit.

Yep, I think that would work. Want to write the patch, or should I? ;-)

- Heikki



Re: Freezing without write I/O

From
Simon Riggs
Date:
On 1 June 2013 19:48, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 31.05.2013 06:02, Robert Haas wrote:
>>
>> On Thu, May 30, 2013 at 2:39 PM, Robert Haas<robertmhaas@gmail.com>
>> wrote:
>>>
>>> Random thought: Could you compute the reference XID based on the page
>>> LSN?  That would eliminate the storage overhead.
>>
>>
>> After mulling this over a bit, I think this is definitely possible.
>> We begin a new "half-epoch" every 2 billion transactions.  We remember
>> the LSN at which the current half-epoch began and the LSN at which the
>> previous half-epoch began.  When a new half-epoch begins, the first
>> backend that wants to stamp a tuple with an XID from the new
>> half-epoch must first emit a "new half-epoch" WAL record, which
>> becomes the starting LSN for the new half-epoch.
>
>
> Clever! Pages in unlogged tables need some extra treatment, as they don't
> normally have a valid LSN, but that shouldn't be too hard.

I like the idea of using the LSN to indicate the epoch. It saves any
other work we might consider, such as setting page or tuple level
epochs.


>> We define a new page-level bit, something like PD_RECENTLY_FROZEN.
>> When this bit is set, it means there are no unfrozen tuples on the
>> page with XIDs that predate the current half-epoch.  Whenever we know
>> this to be true, we set the bit.  If the page LSN crosses more than
>> one half-epoch boundary at a time, we freeze the page and set the bit.
>>   If the page LSN crosses exactly one half-epoch boundary, then (1) if
>> the bit is set, we clear it and (2) if the bit is not set, we freeze
>> the page and set the bit.
>
>
> Yep, I think that would work. Want to write the patch, or should I? ;-)

If we set a bit, surely we need to write the page. Isn't that what we
were trying to avoid?

Why set a bit at all? If we know the LSN of the page and we know the
epoch boundaries, then we can work out when the page was last written
to and infer that the page is "virtually frozen".

As soon as we make a change to a "virtually frozen" page, we can
actually freeze it and then make the change.

But we still have the problem of knowing which pages have been frozen
and which haven't.

Can we clear up those points first? Or at least my understanding of them.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Robert Haas
Date:
On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> We define a new page-level bit, something like PD_RECENTLY_FROZEN.
>> When this bit is set, it means there are no unfrozen tuples on the
>> page with XIDs that predate the current half-epoch.  Whenever we know
>> this to be true, we set the bit.  If the page LSN crosses more than
>> one half-epoch boundary at a time, we freeze the page and set the bit.
>>   If the page LSN crosses exactly one half-epoch boundary, then (1) if
>> the bit is set, we clear it and (2) if the bit is not set, we freeze
>> the page and set the bit.
>
> Yep, I think that would work. Want to write the patch, or should I? ;-)

Have at it.  I think the tricky part is going to be figuring out the
synchronization around half-epoch boundaries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Robert Haas
Date:
On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> If we set a bit, surely we need to write the page. Isn't that what we
> were trying to avoid?

No, the bit only gets set in situations when we were going to dirty
the page for some other reason anyway.  Specifically, if a page
modification discovers that we've switched epochs (but just once) and
the bit isn't already set, we can set it in lieu of scanning the
entire page for tuples that need freezing.

Under this proposal, pages that don't contain any dead tuples needn't
be dirtied for freezing, ever.  Smells like awesome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Simon Riggs
Date:
On 1 June 2013 21:26, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> If we set a bit, surely we need to write the page. Isn't that what we
>> were trying to avoid?
>
> No, the bit only gets set in situations when we were going to dirty
> the page for some other reason anyway.  Specifically, if a page
> modification discovers that we've switched epochs (but just once) and
> the bit isn't already set, we can set it in lieu of scanning the
> entire page for tuples that need freezing.
>
> Under this proposal, pages that don't contain any dead tuples needn't
> be dirtied for freezing, ever.  Smells like awesome.

Agreed, well done both.

What I especially like about it is how little logic it will require,
and no page format changes.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Greg Stark
Date:
On Fri, May 31, 2013 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Even at a more modest 10,000 tps, with default
> settings, you'll do anti-wraparound vacuums of the entire cluster
> about every 8 hours.  That's not fun.

I've forgotten now. What happens if you have a long-lived transaction
still alive from > 2B xid ago?


-- 
greg



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 06.06.2013 15:16, Greg Stark wrote:
> On Fri, May 31, 2013 at 3:04 AM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> Even at a more modest 10,000 tps, with default
>> settings, you'll do anti-wraparound vacuums of the entire cluster
>> about every 8 hours.  That's not fun.
>
> I've forgotten now. What happens if you have a long-lived transaction
> still alive from>  2B xid ago?

That will keep OldestXmin from advancing. Which will keep vacuum from 
advancing relfrozenxid/datfrozenxid. Which will first trigger the 
warnings about wrap-around, then stops new XIDs from being generated, 
and finally a forced shutdown.

The forced shutdown will actually happen some time before going beyond 2 
billion XIDs. So it is not possible to have a long-lived transaction, 
older than 2 B XIDs, still live in the system. But let's imagine that 
you somehow bypass the safety mechanism:

After wraparound, old tuples will look like being in the future, and 
will become invisible to new transactions. That happens even if there 
are no old transactions around. I'm not sure what exactly will happen if 
there is still a transaction alive with an XID and/or snapshots older 
than 2^31 XIDs. New tuples that are not supposed to be visible to the 
old snapshot would suddenly become visible, I guess.

- Heikki



Re: Freezing without write I/O

From
Greg Stark
Date:
On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> That will keep OldestXmin from advancing. Which will keep vacuum from
> advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
> about wrap-around, then stops new XIDs from being generated, and finally a
> forced shutdown.
>
> The forced shutdown will actually happen some time before going beyond 2
> billion XIDs. So it is not possible to have a long-lived transaction, older
> than 2 B XIDs, still live in the system. But let's imagine that you somehow
> bypass the safety mechanism:

Ah, so if you do the epoch in the page header thing or Robert's LSN
trick that I didn't follow then you'll need a new safety check against
this. Since relfrozenxid/datfrozenxid will no longer be necessary.

-- 
greg



Re: Freezing without write I/O

From
Robert Haas
Date:
On Thu, Jun 6, 2013 at 6:28 PM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> That will keep OldestXmin from advancing. Which will keep vacuum from
>> advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
>> about wrap-around, then stops new XIDs from being generated, and finally a
>> forced shutdown.
>>
>> The forced shutdown will actually happen some time before going beyond 2
>> billion XIDs. So it is not possible to have a long-lived transaction, older
>> than 2 B XIDs, still live in the system. But let's imagine that you somehow
>> bypass the safety mechanism:
>
> Ah, so if you do the epoch in the page header thing or Robert's LSN
> trick that I didn't follow then you'll need a new safety check against
> this. Since relfrozenxid/datfrozenxid will no longer be necessary.

Nothing proposed here gets rid of either of those, that I can see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 07.06.2013 20:54, Robert Haas wrote:
> On Thu, Jun 6, 2013 at 6:28 PM, Greg Stark<stark@mit.edu>  wrote:
>> On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com>  wrote:
>>> That will keep OldestXmin from advancing. Which will keep vacuum from
>>> advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
>>> about wrap-around, then stops new XIDs from being generated, and finally a
>>> forced shutdown.
>>>
>>> The forced shutdown will actually happen some time before going beyond 2
>>> billion XIDs. So it is not possible to have a long-lived transaction, older
>>> than 2 B XIDs, still live in the system. But let's imagine that you somehow
>>> bypass the safety mechanism:
>>
>> Ah, so if you do the epoch in the page header thing or Robert's LSN
>> trick that I didn't follow then you'll need a new safety check against
>> this. Since relfrozenxid/datfrozenxid will no longer be necessary.
>
> Nothing proposed here gets rid of either of those, that I can see.

Right. The meaning of relfrozenxid/datfrozenxid changes somewhat; it no 
longer means that all XIDs older than frozenxid are replaced with 
FrozenXid. Instead, it will mean that all XIDs older than frozenxid are 
committed, ie. all dead tuples older than that have been vacuumed away.

- Heikki



Re: Freezing without write I/O

From
Simon Riggs
Date:
On 7 June 2013 19:08, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 07.06.2013 20:54, Robert Haas wrote:
>>
>> On Thu, Jun 6, 2013 at 6:28 PM, Greg Stark<stark@mit.edu>  wrote:
>>>
>>> On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
>>> <hlinnakangas@vmware.com>  wrote:
>>>>
>>>> That will keep OldestXmin from advancing. Which will keep vacuum from
>>>> advancing relfrozenxid/datfrozenxid. Which will first trigger the
>>>> warnings
>>>> about wrap-around, then stops new XIDs from being generated, and finally
>>>> a
>>>> forced shutdown.
>>>>
>>>> The forced shutdown will actually happen some time before going beyond 2
>>>> billion XIDs. So it is not possible to have a long-lived transaction,
>>>> older
>>>> than 2 B XIDs, still live in the system. But let's imagine that you
>>>> somehow
>>>> bypass the safety mechanism:
>>>
>>>
>>> Ah, so if you do the epoch in the page header thing or Robert's LSN
>>> trick that I didn't follow then you'll need a new safety check against
>>> this. Since relfrozenxid/datfrozenxid will no longer be necessary.
>>
>>
>> Nothing proposed here gets rid of either of those, that I can see.
>
>
> Right. The meaning of relfrozenxid/datfrozenxid changes somewhat; it no
> longer means that all XIDs older than frozenxid are replaced with FrozenXid.
> Instead, it will mean that all XIDs older than frozenxid are committed, ie.
> all dead tuples older than that have been vacuumed away.

Now that I consider Greg's line of thought, the idea we focused on
here was about avoiding freezing. But Greg makes me think that we may
also wish to look at allowing queries to run longer than one epoch as
well, if the epoch wrap time is likely to come down substantially.

To do that I think we'll need to hold epoch for relfrozenxid as well,
amongst other things.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 07.06.2013 21:33, Simon Riggs wrote:
> Now that I consider Greg's line of thought, the idea we focused on
> here was about avoiding freezing. But Greg makes me think that we may
> also wish to look at allowing queries to run longer than one epoch as
> well, if the epoch wrap time is likely to come down substantially.
>
> To do that I think we'll need to hold epoch for relfrozenxid as well,
> amongst other things.

The biggest problem I see with that is that if a snapshot can be older 
than 2 billion XIDs, it must be possible to store XIDs on the same page 
that are more than 2 billion XIDs apart. All the discussed schemes where 
we store the epoch at the page level, either explicitly or derived from 
the LSN, rely on the fact that it's not currently necessary to do that. 
Currently, when one XID on a page is older than 2 billion XIDs, that old 
XID can always be replaced with FrozenXid, because there cannot be a 
snapshot old enough to not see it.

I agree that it would be nice if you could find a way around that. You 
had a suggestion on making room on the tuple header for the epoch. I'm 
not sure I like that particular proposal, but we would need something 
like that. If we settle for snapshots that can be at most, say, 512 
billion transactions old, instead of 2 billion, then we would only need 
one byte to store an epoch "offset" in the tuple header. Ie. deduce the 
epoch of tuples on the page from the LSN on the page header, but allow 
individual tuples to specify an offset from that deduced epoch.

In practice, I think we're still quite far from people running into that 
2 billion XID limit on snapshot age. But maybe in a few years, after 
we've solved all the more pressing vacuum and wraparound issues that 
people currently run into before reaching that stage...

- Heikki



Re: Freezing without write I/O

From
Simon Riggs
Date:
On 7 June 2013 19:56, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 07.06.2013 21:33, Simon Riggs wrote:
>>
>> Now that I consider Greg's line of thought, the idea we focused on
>> here was about avoiding freezing. But Greg makes me think that we may
>> also wish to look at allowing queries to run longer than one epoch as
>> well, if the epoch wrap time is likely to come down substantially.
>>
>> To do that I think we'll need to hold epoch for relfrozenxid as well,
>> amongst other things.

> The biggest problem I see with that is that if a snapshot can be older than
> 2 billion XIDs, it must be possible to store XIDs on the same page that are
> more than 2 billion XIDs apart. All the discussed schemes where we store the
> epoch at the page level, either explicitly or derived from the LSN, rely on
> the fact that it's not currently necessary to do that. Currently, when one
> XID on a page is older than 2 billion XIDs, that old XID can always be
> replaced with FrozenXid, because there cannot be a snapshot old enough to
> not see it.

It does seem that there are two problems: avoiding freezing AND long
running queries

The long running query problem hasn't ever been looked at, it seems,
until here and now.

> I agree that it would be nice if you could find a way around that. You had a
> suggestion on making room on the tuple header for the epoch. I'm not sure I
> like that particular proposal, but we would need something like that. If we
> settle for snapshots that can be at most, say, 512 billion transactions old,
> instead of 2 billion, then we would only need one byte to store an epoch
> "offset" in the tuple header. Ie. deduce the epoch of tuples on the page
> from the LSN on the page header, but allow individual tuples to specify an
> offset from that deduced epoch.

I like the modification you propose. And I like it even better because
it uses just 1 byte, which is even more easily squeezed into the
existing tuple header, whether we go with my proposed squeezing route
or not.

> In practice, I think we're still quite far from people running into that 2
> billion XID limit on snapshot age. But maybe in a few years, after we've
> solved all the more pressing vacuum and wraparound issues that people
> currently run into before reaching that stage...

Your WALInsert lock patch will fix that. ;-)

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Robert Haas
Date:
On Fri, Jun 7, 2013 at 3:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The long running query problem hasn't ever been looked at, it seems,
> until here and now.

For what it's worth (and that may not be much), I think most people
will die a horrible death due to bloat after holding a transaction
open for a tiny fraction of 2B XIDs.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-06-07 20:10:55 +0100, Simon Riggs wrote:
> On 7 June 2013 19:56, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> > On 07.06.2013 21:33, Simon Riggs wrote:
> >>
> >> Now that I consider Greg's line of thought, the idea we focused on
> >> here was about avoiding freezing. But Greg makes me think that we may
> >> also wish to look at allowing queries to run longer than one epoch as
> >> well, if the epoch wrap time is likely to come down substantially.
> >>
> >> To do that I think we'll need to hold epoch for relfrozenxid as well,
> >> amongst other things.
> 
> > The biggest problem I see with that is that if a snapshot can be older than
> > 2 billion XIDs, it must be possible to store XIDs on the same page that are
> > more than 2 billion XIDs apart. All the discussed schemes where we store the
> > epoch at the page level, either explicitly or derived from the LSN, rely on
> > the fact that it's not currently necessary to do that. Currently, when one
> > XID on a page is older than 2 billion XIDs, that old XID can always be
> > replaced with FrozenXid, because there cannot be a snapshot old enough to
> > not see it.
> 
> It does seem that there are two problems: avoiding freezing AND long
> running queries
> 
> The long running query problem hasn't ever been looked at, it seems,
> until here and now.

I'd say that's because it's prohibitive to run so long transactions
anyway since it causes too much unremovable bloat. 2bio transactions
really is a quite a bit, I don't think it's a relevant restriction. Yet.

Let's discuss it if we have solved the other problems ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Simon Riggs
Date:
On 7 June 2013 20:16, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-06-07 20:10:55 +0100, Simon Riggs wrote:
>> On 7 June 2013 19:56, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>> > On 07.06.2013 21:33, Simon Riggs wrote:
>> >>
>> >> Now that I consider Greg's line of thought, the idea we focused on
>> >> here was about avoiding freezing. But Greg makes me think that we may
>> >> also wish to look at allowing queries to run longer than one epoch as
>> >> well, if the epoch wrap time is likely to come down substantially.
>> >>
>> >> To do that I think we'll need to hold epoch for relfrozenxid as well,
>> >> amongst other things.
>>
>> > The biggest problem I see with that is that if a snapshot can be older than
>> > 2 billion XIDs, it must be possible to store XIDs on the same page that are
>> > more than 2 billion XIDs apart. All the discussed schemes where we store the
>> > epoch at the page level, either explicitly or derived from the LSN, rely on
>> > the fact that it's not currently necessary to do that. Currently, when one
>> > XID on a page is older than 2 billion XIDs, that old XID can always be
>> > replaced with FrozenXid, because there cannot be a snapshot old enough to
>> > not see it.
>>
>> It does seem that there are two problems: avoiding freezing AND long
>> running queries
>>
>> The long running query problem hasn't ever been looked at, it seems,
>> until here and now.
>
> I'd say that's because it's prohibitive to run so long transactions
> anyway since it causes too much unremovable bloat. 2bio transactions
> really is a quite a bit, I don't think it's a relevant restriction. Yet.
>
> Let's discuss it if we have solved the other problems ;)

Let me say that I think that problem is solvable also.

At the moment we allow all visible tuple versions to be linked
together, so that the last visible and latest update are linked by a
chain. If we break that assumption and say that we will never follow
an update chain from a snapshot in the distant past, then we can
remove intermediate dead rows. We currently regard those as recently
dead, but that just requires some extra thought. We still keep all
*visible* tuple versions, we just don't bother to keep all the
intermediate ones as well.

Perhaps another day, but one day.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



On 07.06.2013 22:15, Robert Haas wrote:
> On Fri, Jun 7, 2013 at 3:10 PM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> The long running query problem hasn't ever been looked at, it seems,
>> until here and now.
>
> For what it's worth (and that may not be much), I think most people
> will die a horrible death due to bloat after holding a transaction
> open for a tiny fraction of 2B XIDs.  :-(

Yeah, we should fix that too ;-).

While we're at it: I've been thinking that we should try harder to 
vacuum dead tuples that are no longer visible to any snapshot, even if 
there's an even old snapshot. The stereotypical scenario is a table with 
a single row that's updated very very frequently. Like a counter. 
Autovacuum can normally keep it in check, but if you have a long-running 
transaction, it will bloat horrendously. But if you only have one 
long-running transaction with one really old snapshot, and everything 
else is recent, you'd really only need to keep one old tuple around for 
the old snapshot to see, and a recent version or two for the rest. At 
worst, the database needs to bloat to double the size, but not more than 
that.

To know which tuples are dead at such a fine-grained level, vacuum would 
need to know in more detail what snapshots the backends have. I'm really 
excited about Ants Aasma's proposal to use a CSN for snapshots, or more 
precisely the variant using commit record's LSN for that. If a snapshot 
is just a single integer, it becomes easier for backends to share their 
snapshots, in limited amount of shared memory.

- Heikki



Re: Freezing without write I/O

From
Hannu Krosing
Date:
On 06/07/2013 08:56 PM, Heikki Linnakangas wrote:
> On 07.06.2013 21:33, Simon Riggs wrote:
>> Now that I consider Greg's line of thought, the idea we focused on
>> here was about avoiding freezing. But Greg makes me think that we may
>> also wish to look at allowing queries to run longer than one epoch as
>> well, if the epoch wrap time is likely to come down substantially.
>>
>> To do that I think we'll need to hold epoch for relfrozenxid as well,
>> amongst other things.
>
> The biggest problem I see with that is that if a snapshot can be older
> than 2 billion XIDs, it must be possible to store XIDs on the same
> page that are more than 2 billion XIDs apart. 
It could be possible to recognise the situation and save the new XIDs on
*another* page ?


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: Freezing without write I/O

From
Hannu Krosing
Date:
On 06/07/2013 09:16 PM, Andres Freund wrote:
> On 2013-06-07 20:10:55 +0100, Simon Riggs wrote:
>> On 7 June 2013 19:56, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>>> On 07.06.2013 21:33, Simon Riggs wrote:
>>>> Now that I consider Greg's line of thought, the idea we focused on
>>>> here was about avoiding freezing. But Greg makes me think that we may
>>>> also wish to look at allowing queries to run longer than one epoch as
>>>> well, if the epoch wrap time is likely to come down substantially.
>>>>
>>>> To do that I think we'll need to hold epoch for relfrozenxid as well,
>>>> amongst other things.
>>> The biggest problem I see with that is that if a snapshot can be older than
>>> 2 billion XIDs, it must be possible to store XIDs on the same page that are
>>> more than 2 billion XIDs apart. All the discussed schemes where we store the
>>> epoch at the page level, either explicitly or derived from the LSN, rely on
>>> the fact that it's not currently necessary to do that. Currently, when one
>>> XID on a page is older than 2 billion XIDs, that old XID can always be
>>> replaced with FrozenXid, because there cannot be a snapshot old enough to
>>> not see it.
>> It does seem that there are two problems: avoiding freezing AND long
>> running queries
>>
>> The long running query problem hasn't ever been looked at, it seems,
>> until here and now.
> I'd say that's because it's prohibitive to run so long transactions
> anyway since it causes too much unremovable bloat. 2bio transactions
> really is a quite a bit, I don't think it's a relevant restriction. Yet.
2bio transaction means at least 2G rows inserted, updated or deleted,
which may or may not bee "too much"

In my simple key/value update tests I was able to do 40k single transaction
updates sec on moderate sized server. at this rate 2G trx takes about 15
hours
which is not really very much.

One may for example have an unfinished 2PC transaction lingering over
weekend and while of course one should have monitoring in place to
avoid this, it still would be nice if this did not cause database shutdown.



-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 01.06.2013 23:21, Robert Haas wrote:
> On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>>> We define a new page-level bit, something like PD_RECENTLY_FROZEN.
>>> When this bit is set, it means there are no unfrozen tuples on the
>>> page with XIDs that predate the current half-epoch.  Whenever we know
>>> this to be true, we set the bit.  If the page LSN crosses more than
>>> one half-epoch boundary at a time, we freeze the page and set the bit.
>>>    If the page LSN crosses exactly one half-epoch boundary, then (1) if
>>> the bit is set, we clear it and (2) if the bit is not set, we freeze
>>> the page and set the bit.
>>
>> Yep, I think that would work. Want to write the patch, or should I? ;-)
>
> Have at it.

Here's a first draft. A lot of stuff missing and broken, but "make
check" passes :-).

In the patch, instead of working with "half-epochs", there are "XID-LSN
ranges", which can be of arbitrary size. An XID-LSN range consists of
three values:

minlsn: The point in WAL where this range begins.
minxid - maxxid: The range of XIDs allowed in this range.

Every point in WAL belongs to exactly one range. The minxid-maxxid of
the ranges can overlap. For example:

1. XIDs 25000942 - 27000003 LSN 0/3BB9938
2. XIDs 23000742 - 26000003 LSN 0/2AB9288
3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0
4. XIDs 22000002 - 24000003 LSN 0/10B1550

The invariant with the ranges is that a page with a certain LSN is only
allowed to contain XIDs that belong to the range specified by that LSN.
For example, if a page has LSN 0/3500000, it belongs to the 2nd range,
and can only contain XIDs between 23000742 - 26000003. If a backend
updates the page, so that it's LSN is updated to, say, 0/3D12345, all
XIDs on the page older than 25000942 must be frozen first, to avoid
violating the rule.

The system keeps track of a small number of these XID-LSN ranges. Where
we currently truncate clog, we can also truncate the ranges with maxxid
< the clog truncation point. Vacuum removes any dead tuples and updates
relfrozenxid as usual, to make sure that there are no dead tuples or
aborted XIDs older than the minxid of the oldest tracked XID-LSN range.
It no longer needs to freeze old committed XIDs, however - that's the
gain from this patch (except to uphold the invariant, if it has to
remove some dead tuples on the page and update its LSN).

A new range is created whenever we reach the maxxid on the current one.
The new range's minxid is set to the current global oldest xmin value,
and maxxid is just the old maxxid plus a fixed number (1 million in the
patch, but we probably want a higher value in reality). This ensures
that if you modify a page and update its LSN, all the existing XIDs on
the page that cannot be frozen yet are greater than the minxid of the
latest range. In other words, you can always freeze old XIDs on a page,
so that any remaining non-frozen XIDs are within the minxid-maxxid of
the latest range.

The HeapTupleSatisfies functions are modified to look at the page's LSN
first. If it's old enough, it doesn't look at the XIDs on the page level
at all, and just considers everything on the page is visible to everyone
(I'm calling this state a "mature page").

> I think the tricky part is going to be figuring out the
> synchronization around half-epoch boundaries.

Yep. I skipped all those difficult parts in this first version. There
are two race conditions that need to be fixed:

1. When a page is updated, we check if it needs to be frozen. If its LSN
is greater than the latest range's LSN. IOW, if we've already modified
the page, and thus frozen all older tuples, within the current range.
However, it's possible that a new range is created immediately after
we've checked that. When we then proceed to do the actual update on the
page and WAL-log that, the new LSN falls within the next range, and we
should've frozen the page. I'm planning to fix that by adding a "parity
bit" on the page header. Each XID-LSN range is assigned a parity bit, 0
or 1. When we check if a page needs to be frozen on update, we make note
of the latest range's parity bit, and write it in the page header.
Later, when we look at the page's LSN to determine which XID-LSN range
it belongs to, we compare the parity. If the parity doesn't match, we
know that the race condition happened, so we treat the page to belong to
the previous range, not the one it would normally belong to, per the LSN.

2. When we look at a page, and determine that it's not old enough to be
"matured", we then check the clog as usual. A page is considered mature,
if the XID-LSN range (and corresponding clog pages) has already been
truncated away. It's possible that between those steps, the XID-LSN
range and clog is truncated away, so that the backend tries to access a
clog page that doesn't exist anymore. To fix that, the XID-LSN range and
clog truncation needs to be done in two steps. First, mark the
truncation point in shared memory. Then somehow wait until all backends
see the new value, and go ahead with actually truncating the clog only
after that.


Aside from those two race conditions, there are plenty of scalability
issues remaining. Currently, the shared XID-LSN range array is checked
every time a page is accessed, so it could quickly become a bottleneck.
Need to cache that information in each backend. Oh, and I didn't
implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will
get a freezing frenzy right after a new XID-LSN range is created.

I'll keep hacking away on those things, but please let me know if you
see some fatal flaw with this plan.

- Heikki

Attachment

Re: Freezing without write I/O

From
Simon Riggs
Date:
On 10 June 2013 19:58, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 01.06.2013 23:21, Robert Haas wrote:
>>
>> On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com>  wrote:
>>>>
>>>> We define a new page-level bit, something like PD_RECENTLY_FROZEN.
>>>> When this bit is set, it means there are no unfrozen tuples on the
>>>> page with XIDs that predate the current half-epoch.  Whenever we know
>>>> this to be true, we set the bit.  If the page LSN crosses more than
>>>> one half-epoch boundary at a time, we freeze the page and set the bit.
>>>>    If the page LSN crosses exactly one half-epoch boundary, then (1) if
>>>> the bit is set, we clear it and (2) if the bit is not set, we freeze
>>>> the page and set the bit.
>>>
>>>
>>> Yep, I think that would work. Want to write the patch, or should I? ;-)
>>
>>
>> Have at it.
>
>
> Here's a first draft. A lot of stuff missing and broken, but "make check"
> passes :-).

Well done, looks like good progress.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Robert Haas
Date:
On Mon, Jun 10, 2013 at 4:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Well done, looks like good progress.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 10.06.2013 21:58, Heikki Linnakangas wrote:
> On 01.06.2013 23:21, Robert Haas wrote:
>> On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>>> We define a new page-level bit, something like PD_RECENTLY_FROZEN.
>>>> When this bit is set, it means there are no unfrozen tuples on the
>>>> page with XIDs that predate the current half-epoch. Whenever we know
>>>> this to be true, we set the bit. If the page LSN crosses more than
>>>> one half-epoch boundary at a time, we freeze the page and set the bit.
>>>> If the page LSN crosses exactly one half-epoch boundary, then (1) if
>>>> the bit is set, we clear it and (2) if the bit is not set, we freeze
>>>> the page and set the bit.
>>>
>>> Yep, I think that would work. Want to write the patch, or should I? ;-)
>>
>> Have at it.
>
> Here's a first draft. A lot of stuff missing and broken, but "make
> check" passes :-).
>
> In the patch, instead of working with "half-epochs", there are "XID-LSN
> ranges", which can be of arbitrary size. An XID-LSN range consists of
> three values:
>
> minlsn: The point in WAL where this range begins.
> minxid - maxxid: The range of XIDs allowed in this range.
>
> Every point in WAL belongs to exactly one range. The minxid-maxxid of
> the ranges can overlap. For example:
>
> 1. XIDs 25000942 - 27000003 LSN 0/3BB9938
> 2. XIDs 23000742 - 26000003 LSN 0/2AB9288
> 3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0
> 4. XIDs 22000002 - 24000003 LSN 0/10B1550
>
> The invariant with the ranges is that a page with a certain LSN is only
> allowed to contain XIDs that belong to the range specified by that LSN.
> For example, if a page has LSN 0/3500000, it belongs to the 2nd range,
> and can only contain XIDs between 23000742 - 26000003. If a backend
> updates the page, so that it's LSN is updated to, say, 0/3D12345, all
> XIDs on the page older than 25000942 must be frozen first, to avoid
> violating the rule.
>
> The system keeps track of a small number of these XID-LSN ranges. Where
> we currently truncate clog, we can also truncate the ranges with maxxid
> < the clog truncation point. Vacuum removes any dead tuples and updates
> relfrozenxid as usual, to make sure that there are no dead tuples or
> aborted XIDs older than the minxid of the oldest tracked XID-LSN range.
> It no longer needs to freeze old committed XIDs, however - that's the
> gain from this patch (except to uphold the invariant, if it has to
> remove some dead tuples on the page and update its LSN).
>
> A new range is created whenever we reach the maxxid on the current one.
> The new range's minxid is set to the current global oldest xmin value,
> and maxxid is just the old maxxid plus a fixed number (1 million in the
> patch, but we probably want a higher value in reality). This ensures
> that if you modify a page and update its LSN, all the existing XIDs on
> the page that cannot be frozen yet are greater than the minxid of the
> latest range. In other words, you can always freeze old XIDs on a page,
> so that any remaining non-frozen XIDs are within the minxid-maxxid of
> the latest range.
>
> The HeapTupleSatisfies functions are modified to look at the page's LSN
> first. If it's old enough, it doesn't look at the XIDs on the page level
> at all, and just considers everything on the page is visible to everyone
> (I'm calling this state a "mature page").
>
>> I think the tricky part is going to be figuring out the
>> synchronization around half-epoch boundaries.
>
> Yep. I skipped all those difficult parts in this first version. There
> are two race conditions that need to be fixed:
>
> 1. When a page is updated, we check if it needs to be frozen. If its LSN
> is greater than the latest range's LSN. IOW, if we've already modified
> the page, and thus frozen all older tuples, within the current range.
> However, it's possible that a new range is created immediately after
> we've checked that. When we then proceed to do the actual update on the
> page and WAL-log that, the new LSN falls within the next range, and we
> should've frozen the page. I'm planning to fix that by adding a "parity
> bit" on the page header. Each XID-LSN range is assigned a parity bit, 0
> or 1. When we check if a page needs to be frozen on update, we make note
> of the latest range's parity bit, and write it in the page header.
> Later, when we look at the page's LSN to determine which XID-LSN range
> it belongs to, we compare the parity. If the parity doesn't match, we
> know that the race condition happened, so we treat the page to belong to
> the previous range, not the one it would normally belong to, per the LSN.
>
> 2. When we look at a page, and determine that it's not old enough to be
> "matured", we then check the clog as usual. A page is considered mature,
> if the XID-LSN range (and corresponding clog pages) has already been
> truncated away. It's possible that between those steps, the XID-LSN
> range and clog is truncated away, so that the backend tries to access a
> clog page that doesn't exist anymore. To fix that, the XID-LSN range and
> clog truncation needs to be done in two steps. First, mark the
> truncation point in shared memory. Then somehow wait until all backends
> see the new value, and go ahead with actually truncating the clog only
> after that.
>
> Aside from those two race conditions, there are plenty of scalability
> issues remaining. Currently, the shared XID-LSN range array is checked
> every time a page is accessed, so it could quickly become a bottleneck.
> Need to cache that information in each backend. Oh, and I didn't
> implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will
> get a freezing frenzy right after a new XID-LSN range is created.
>
> I'll keep hacking away on those things, but please let me know if you
> see some fatal flaw with this plan.

Here's an updated patch. The race conditions I mentioned above have been
fixed.

This is still definitely work-in-progress, but overall I think this is
quite promising. The patch basically works, although there are a bunch
of TODO items like handling unlogged tables.

This patch is also available in my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, branch
"freeze-by-xid-lsn-ranges".

- Heikki

Attachment

Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 27.08.2013 18:56, Heikki Linnakangas wrote:
> Here's an updated patch.

Ah, forgot one thing:

Here's a little extension I've been using to test this. It contains two
functions; one to simply consume N xids, making it faster to hit the
creation of new XID ranges and wraparound. The other,
print_xidlsnranges(), prints out the contents of the current XID-LSN
range array.

Also, just ran into two silly bugs in the patch version I posted, while
checking that that xidtest extension hasn't bitrotted. A fixed version
has been pushed to the git repository, so make sure you use that version
if you want to actually run it.

- Heikki

Attachment

Re: Freezing without write I/O

From
Andres Freund
Date:
Hi Heikki,

On 2013-08-27 18:56:15 +0300, Heikki Linnakangas wrote:
> Here's an updated patch. The race conditions I mentioned above have been
> fixed.

Thanks for posting the new version!

I have a quick question: The reason I'd asked about the status of the
patch was that I was thinking about the state of the "forensic freezing"
patch. After a quick look at your proposal, we still need to freeze in
some situations (old & new data on the same page basically), so I'd say
it still makes sense to apply the forensic freezing patch, right?

Differing Opinions?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Jeff Davis
Date:
On Fri, 2013-08-30 at 20:34 +0200, Andres Freund wrote:
> I have a quick question: The reason I'd asked about the status of the
> patch was that I was thinking about the state of the "forensic freezing"
> patch. After a quick look at your proposal, we still need to freeze in
> some situations (old & new data on the same page basically), so I'd say
> it still makes sense to apply the forensic freezing patch, right?
> 
> Differing Opinions?

The Freeze Forensically patch is nice because (if I understand it
correctly) it allows us to freeze at the same time as we mark
PD_ALL_VISIBLE, which avoids the potential extra page write. But that's
not such a big advantage if we don't ordinarily have to write again for
freezing, anyway.

However, there are still some cases where we'd be able to preserve the
forensic information. If nothing else, that might help debug this patch,
if necessary. There might also be cases where we can freeze more eagerly
to avoid the case where very old (but unfrozen) and very new tuples mix
on the same page. Perhaps Robert has some thoughts here, as well.

Regards,Jeff Davis





Re: Freezing without write I/O

From
Robert Haas
Date:
On Mon, Sep 2, 2013 at 3:16 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2013-08-30 at 20:34 +0200, Andres Freund wrote:
>> I have a quick question: The reason I'd asked about the status of the
>> patch was that I was thinking about the state of the "forensic freezing"
>> patch. After a quick look at your proposal, we still need to freeze in
>> some situations (old & new data on the same page basically), so I'd say
>> it still makes sense to apply the forensic freezing patch, right?
>>
>> Differing Opinions?
>
> The Freeze Forensically patch is nice because (if I understand it
> correctly) it allows us to freeze at the same time as we mark
> PD_ALL_VISIBLE, which avoids the potential extra page write.

The patch itself doesn't actually make that change, but it removes one
major objection to such a change.

> But that's
> not such a big advantage if we don't ordinarily have to write again for
> freezing, anyway.

That was my thought as well.

> However, there are still some cases where we'd be able to preserve the
> forensic information. If nothing else, that might help debug this patch,
> if necessary. There might also be cases where we can freeze more eagerly
> to avoid the case where very old (but unfrozen) and very new tuples mix
> on the same page. Perhaps Robert has some thoughts here, as well.

I basically agree.  I think if we adopt Heikki's patch forensic
freezing becomes much less important, but we might find there's still
a reason to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 27.08.2013 19:37, Heikki Linnakangas wrote:
> On 27.08.2013 18:56, Heikki Linnakangas wrote:
>> Here's an updated patch.
>
> Ah, forgot one thing:
>
> Here's a little extension I've been using to test this. It contains two
> functions; one to simply consume N xids, making it faster to hit the
> creation of new XID ranges and wraparound. The other,
> print_xidlsnranges(), prints out the contents of the current XID-LSN
> range array.
>
> Also, just ran into two silly bugs in the patch version I posted, while
> checking that that xidtest extension hasn't bitrotted. A fixed version
> has been pushed to the git repository, so make sure you use that version
> if you want to actually run it.

Here's a rebased version of the patch, including the above-mentioned
fixes. Nothing else new.

- Heikki

--
- Heikki

Attachment

Re: Freezing without write I/O

From
Peter Eisentraut
Date:
On 9/16/13 9:59 AM, Heikki Linnakangas wrote:
> Here's a rebased version of the patch, including the above-mentioned
> fixes. Nothing else new.

It still fails to apply.  You might need to rebase it again.



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-09-17 09:41:47 -0400, Peter Eisentraut wrote:
> On 9/16/13 9:59 AM, Heikki Linnakangas wrote:
> > Here's a rebased version of the patch, including the above-mentioned
> > fixes. Nothing else new.
> 
> It still fails to apply.  You might need to rebase it again.

FWIW, I don't think it's too important that this specific patch applies
all the time. From the look I had and the discussions onlist, what it
needs so far is mostly architecture review and such.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Alvaro Herrera
Date:
Heikki Linnakangas escribió:

> Here's a rebased version of the patch, including the above-mentioned
> fixes. Nothing else new.

Nice work.  I apologize for the conflicts I created yesterday.

I would suggest renaming varsup_internal.h to varsup_xlog.h.

You added a FIXME comment to HeapTupleHeaderIsOnlyLocked.  I think the
answer to the question is yes, because checking for locked might incur
examining the Xmax of tuples, which cannot be done if the page is
mature; perhaps the check for maturity needs to happen only after the
infomask has been checked.

The new stuff in GetNewTransactionId() is said to be "very expensive"
(it might involve grabbing the ProcArrayLock and scanning the whole
procarray, for example.)  There's a comment about having this be done
only in other processes, and I think that's a good idea, otherwise we
risk adding a lot of latency, randomly, to client-connected processes
which might be latency sensitive.  I think autovacuum is not a good
choice however (it might not even be running).  How about using the
bgwriter or walwriter for this?  As far as I can tell, there's no need
for this process to actually be able to run transactions or scan
catalogs; the ability to lock and scan ProcArray appears sufficient.
Using a RTC (instead of the Xid counter % 128) to determine when to run
this doesn't look like a problem to me.  Something that sleeps for too
long might be, so we would need to ensure that it happens timely.  Not
sure what's an appropriate time for this, however.

Another option is have backends check the Xid counter, and every 128
ticks set a flag in shmem so that the background process knows to
execute the switch.


heap_freeze_tuple() receives one Xid and one MultiXactId; they can be
passed as Invalid to mean forced freezing.  Do we really need to offer
the possibility of freezing xids but not multis, and multis but not
xids?  From a conceptual PoV, it seems to make more sense to me to pass
a boolean "force" meaning freeze everything, and ignore the numerical
values.


The greatest risk in this patch is the possibility that some
heap_freeze_page() call is forgotten.  I think you covered all cases in
heapam.c.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-09-16 16:59:28 +0300, Heikki Linnakangas wrote:
> Here's a rebased version of the patch, including the above-mentioned fixes.
> Nothing else new.

* We need some higherlevel description of the algorithm somewhere in the source. I don't think I've understood the
conceptfrom the patch alone without having read the thread previously.
 
* why do we need to do the PageUpdateNeedsFreezing() dance in heap_page_prune? No xids should change during it.
* Why can we do a GetOldestXmin(allDbs = false) in BeginXidLSNRangeSwitch()?
* Is there any concrete reasoning behind the current values for XID_LSN_RANGE_INTERVAL and NUM_XID_LSN_RANGES or just
gutfeeling?
 
* the lsn ranges file can possibly become bigger than 512bytes (the size we assume to be written atomically) and you
writeit inplace. If we fail halfway through writing, we seem to be able to recover by using the pageMatureLSN from the
lastcheckpoint, but it seems better to do the fsync(),rename(),fsync() dance either way.
 
* Should we preemptively freeze tuples on a page in lazy_scan_heap if we already have dirtied the page? That would make
futuremodifcations cheaper.
 
* lazy_scan_heap now blocks acquiring a cleanup lock on every buffer that contains dead tuples. Shouldn't we use some
kindof cutoff xid there? That might block progress too heavily. Also the comment above it still refers to the old
logic.
* There's no way to force a full table vacuum anymore, that seems problematic to me.
* I wonder if CheckPointVarsup() doesn't need to update minRecoveryPoint. StartupVarsup() should be ok, because we
shouldonly read one from the future during a basebackup?
 
* xidlsnranges_recently[_dirtied] are not obvious on a first glance. Why can't we just reset dirty before the
WriteXidLSNRangesFile()call? There's only one process doing the writeout. Just because the checkpointing process could
bekilled?
 
* I think we should either not require consuming an multixactid or use a function that doesn't need
MultiXactIdSetOldestMember().If the transaction doing so lives for long it will unnecessarily prevent truncation of
mxacts.
* switchFinishXmin and nextSwitchXid should probably be either volatile or have a compiler barrier between accessing
sharedmemory and checking them. The compiler very well could optimize them away and access shmem all the time which
couldlead to weird results.
 
* I wonder whether the fact that we're doing the range switches after acquiring an xid could be problematic if we're
preventingxid allocation due to the checks earlier in that function?
 
* I think heap_lock_tuple() needs to unset all-visible, otherwise we won't vacuum that page again which can lead to
problemssince we don't do full-table vacuums again?
 

So, I think that's enough for a first look. Will think about general
issues a bit more.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Jeff Janes
Date:
On Mon, Sep 16, 2013 at 6:59 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Here's a rebased version of the patch, including the above-mentioned fixes. Nothing else new.

I've applied this to 0892ecbc015930d, the last commit to which it applies cleanly.

When I test this by repeatedly incrementing a counter in a randomly chosen row, then querying the whole table and comparing the results to what my driver knows they should be, I get discrepancies.

No crash/recovery needs to be done to get the behavior.

The number of rows is correct, so one version of every row is visible, but it is the wrong version.

The discrepancy arises shortly after the first time this type of message appears:

6930 UPDATE 2013-09-18 12:36:34.519 PDT:LOG:  started new XID range, XIDs 1000033-, MultiXIDs 1-, tentative LSN 0/FA517F8
6930 UPDATE 2013-09-18 12:36:34.519 PDT:STATEMENT:  update foo set count=count+1 where index=$1
6928 UPDATE 2013-09-18 12:36:34.521 PDT:LOG:  closed old XID range at 1000193 (LSN 0/FA58A08)
6928 UPDATE 2013-09-18 12:36:34.521 PDT:STATEMENT:  update foo set count=count+1 where index=$1

I'll work on getting the driver to shutdown the database the first time it finds a problem so that autovac doesn't destroy evidence.

Cheers,

Jeff

Freezing without write I/O

From
Jeff Janes
Date:
On Wed, Sep 18, 2013 at 12:55 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Sep 16, 2013 at 6:59 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Here's a rebased version of the patch, including the above-mentioned fixes. Nothing else new.

I've applied this to 0892ecbc015930d, the last commit to which it applies cleanly.

When I test this by repeatedly incrementing a counter in a randomly chosen row, then querying the whole table and comparing the results to what my driver knows they should be, I get discrepancies.

No crash/recovery needs to be done to get the behavior.

The number of rows is correct, so one version of every row is visible, but it is sometimes the wrong version.

The discrepancy arises shortly after the first time this type of message appears:

6930 UPDATE 2013-09-18 12:36:34.519 PDT:LOG:  started new XID range, XIDs 1000033-, MultiXIDs 1-, tentative LSN 0/FA517F8
6930 UPDATE 2013-09-18 12:36:34.519 PDT:STATEMENT:  update foo set count=count+1 where index=$1
6928 UPDATE 2013-09-18 12:36:34.521 PDT:LOG:  closed old XID range at 1000193 (LSN 0/FA58A08)
6928 UPDATE 2013-09-18 12:36:34.521 PDT:STATEMENT:  update foo set count=count+1 where index=$1

I'll work on getting the driver to shutdown the database the first time it finds a problem so that autovac doesn't destroy evidence.

I have uploaded the script to reproduce, and a tarball of the data directory (when started, it will go through recovery.  table "foo" is in the jjanes database and role.)


The row with index=8499 should have count of 8, but really has count of 4, and is only findable by seq scan, there is no such row by index scan.

select ctid,* from foo where index=8499;
select ctid,* from foo where index+0=8499;


select * from heap_page_items(get_raw_page('foo',37)) where lp=248 \x\g\x
Expanded display is on.
-[ RECORD 1 ]---------
lp          | 248
lp_off      | 8160
lp_flags    | 1
lp_len      | 32
t_xmin      | 2
t_xmax      | 0
t_field3    | 0
t_ctid      | (37,248)
t_infomask2 | 32770
t_infomask  | 10496
t_hoff      | 24
t_bits      |
t_oid       |

So the xmax is 0 when it really should not be.

What I really want to do is find the not-visible ctids which would have 8499 for index, but I don't know how to do that.

Cheers,

Jeff

Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 18.09.2013 16:22, Andres Freund wrote:
> On 2013-09-16 16:59:28 +0300, Heikki Linnakangas wrote:
>> Here's a rebased version of the patch, including the above-mentioned fixes.
>> Nothing else new.
>
> * We need some higherlevel description of the algorithm somewhere in the
>    source. I don't think I've understood the concept from the patch alone
>    without having read the thread previously.
> * why do we need to do the PageUpdateNeedsFreezing() dance in
>    heap_page_prune? No xids should change during it.

Ah, but its LSN does change. Moving the LSN forward might move the LSN 
from one xid-lsn range to another, requiring any XIDs on the page that 
fall outside the xid range of the new xid-lsn range to be frozen.

> * Why can we do a GetOldestXmin(allDbs = false) in
>    BeginXidLSNRangeSwitch()?

Looks like a bug. I think I got the arguments backwards, was supposed to 
be allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could 
be ignored, but 'false' is the safe option.

> * Is there any concrete reasoning behind the current values for
>    XID_LSN_RANGE_INTERVAL and NUM_XID_LSN_RANGES or just gut feeling?

The values in the patch are:

#define NUM_XID_LSN_RANGES             100
#define XID_LSN_RANGE_INTERVAL 1000000

The interval is probably way too small for production, but a small value 
is handy for testing, as you can get through ranges faster. Something 
like 100 million would probably be a more suitable value for production. 
With a smaller interval, you need to freeze more often, while with a 
large interval, you can't truncate the clog as early.

NUM_XID_LSN_RANGES is also quite arbitrary. I don't have a good feeling 
on what the appropriate sizing for that would be. Something like 5 or 10 
would probably be enough. Although at the moment, the patch doesn't 
handle the situation that you run out of slots very well. You could 
merge some old ranges to make room, but ATM, it just stops creating new 
ones.

> * the lsn ranges file can possibly become bigger than 512bytes (the size
>    we assume to be written atomically) and you write it inplace. If we
>    fail halfway through writing, we seem to be able to recover by using
>    the pageMatureLSN from the last checkpoint, but it seems better to
>    do the fsync(),rename(),fsync() dance either way.

The lsn-range array is also written to the xlog in toto whenever it 
changes, so on recovery, the ranges will be read from the WAL, and the 
ranges file will be recreated at the end-of-recovery checkpoint.

> * Should we preemptively freeze tuples on a page in lazy_scan_heap if we
>    already have dirtied the page? That would make future modifcations
>    cheaper.

Hmm, dunno. Any future modification will also dirty the page, so you're 
not actually saving any I/O by freezing it earlier. You're just choosing 
to do the CPU work and WAL insertion earlier than you have to. And if 
the page is not modified later, it is a waste of cycles. That said, 
maybe it would indeed be better to do it in vacuum when you have a 
chance to reduce latency in the critical path, even if it might be a waste.

> * lazy_scan_heap now blocks acquiring a cleanup lock on every buffer
>    that contains dead tuples. Shouldn't we use some kind of cutoff xid
>    there? That might block progress too heavily. Also the comment above
>    it still refers to the old logic.

Hmm, you mean like skip the page even if there are some dead tuples on 
it, as long as the dead tuples are not older than X xids? I guess we 
could do that. Or the other thing mentioned in the comments, ie. 
remember the page and come back to it later.

For now though, I think it's good enough as it is.

> * There's no way to force a full table vacuum anymore, that seems
>    problematic to me.

Yeah, it probably would be good to have that. An "ignore visibility map" 
option.

> * I wonder if CheckPointVarsup() doesn't need to update
>    minRecoveryPoint.

Hmm,  I guess it should.

> StartupVarsup() should be ok, because we should only
>    read one from the future during a basebackup?

You might read one from the future during recovery, whether it's crash 
or archive recovery, but as soon as you've replayed up to the min 
recovery point, or the end of WAL on crash recovery, the XID ranges 
array in memory should be consistent with the rest of the system, 
because changes to the array are WAL logged.

> * xidlsnranges_recently[_dirtied] are not obvious on a first glance. Why
>    can't we just reset dirty before the WriteXidLSNRangesFile() call?
>    There's only one process doing the writeout. Just because the
>    checkpointing process could be killed?

Right, if the write fails, you need to retry on the next checkpoint.

> * I think we should either not require consuming an multixactid or use a
>    function that doesn't need MultiXactIdSetOldestMember(). If the
>    transaction doing so lives for long it will unnecessarily prevent
>    truncation of mxacts.

Agreed, that was just a quick kludge to get it working.

> * switchFinishXmin and nextSwitchXid should probably be either volatile
>    or have a compiler barrier between accessing shared memory and
>    checking them. The compiler very well could optimize them away and
>    access shmem all the time which could lead to weird results.

Hmm, that would be a weird "optimization". Is that really legal for the 
compiler to do? Better safe than sorry, I guess.

> * I wonder whether the fact that we're doing the range switches after
>    acquiring an xid could be problematic if we're preventing xid
>    allocation due to the checks earlier in that function?

You mean that you might get into a situation where you cannot assign a 
new XID because you've reached xidStopLimit, and you cannot advance 
xidStopLimit because you can't switch to a new xid-lsn range, because no 
new XIDs are being assigned? Yeah, that would be nasty. The range 
management stuff should really be moved out of GetNewTransaction(), 
maybe do them in walwriter or bgwriter as Alvaro suggested.

> * I think heap_lock_tuple() needs to unset all-visible, otherwise we
>    won't vacuum that page again which can lead to problems since we
>    don't do full-table vacuums again?

It's OK if the page is never vacuumed again, the whole point of the 
patch is that old XIDs can be just left lingering in the table. The next 
time the page is updated, perhaps to lock a tuple again, the page will 
be frozen and the old xmax will be cleared.

> So, I think that's enough for a first look. Will think about general
> issues a bit more.

Thanks!

- Heikki



Re: Freezing without write I/O

From
Andres Freund
Date:
Hi,

On 2013-09-19 14:42:19 +0300, Heikki Linnakangas wrote:
> On 18.09.2013 16:22, Andres Freund wrote:
> >* Why can we do a GetOldestXmin(allDbs = false) in
> >   BeginXidLSNRangeSwitch()?
> 
> Looks like a bug. I think I got the arguments backwards, was supposed to be
> allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could be
> ignored, but 'false' is the safe option.

Not sure either...

> >* the lsn ranges file can possibly become bigger than 512bytes (the size
> >   we assume to be written atomically) and you write it inplace. If we
> >   fail halfway through writing, we seem to be able to recover by using
> >   the pageMatureLSN from the last checkpoint, but it seems better to
> >   do the fsync(),rename(),fsync() dance either way.
> 
> The lsn-range array is also written to the xlog in toto whenever it changes,
> so on recovery, the ranges will be read from the WAL, and the ranges file
> will be recreated at the end-of-recovery checkpoint.

But we will read the file before we read the WAL record covering it,
won't we?

> >* Should we preemptively freeze tuples on a page in lazy_scan_heap if we
> >   already have dirtied the page? That would make future modifcations
> >   cheaper.

> Hmm, dunno. Any future modification will also dirty the page, so you're not
> actually saving any I/O by freezing it earlier. You're just choosing to do
> the CPU work and WAL insertion earlier than you have to. And if the page is
> not modified later, it is a waste of cycles. That said, maybe it would
> indeed be better to do it in vacuum when you have a chance to reduce latency
> in the critical path, even if it might be a waste.

Well, if we already dirtied the page (as in vacuum itself, to remove
dead tuples) we already have to do WAL. True it's a separate record, but
that should probably be fixed.

> >* lazy_scan_heap now blocks acquiring a cleanup lock on every buffer
> >   that contains dead tuples. Shouldn't we use some kind of cutoff xid
> >   there? That might block progress too heavily. Also the comment above
> >   it still refers to the old logic.
> 
> Hmm, you mean like skip the page even if there are some dead tuples on it,
> as long as the dead tuples are not older than X xids? I guess we could do
> that. Or the other thing mentioned in the comments, ie. remember the page
> and come back to it later.
>
> For now though, I think it's good enough as it is.

I have seen vacuum being slowed down considerably because we couldn't
acquire a cleanup lock. It's not that infrequent to have pages that are
pinned most of the time. And that was when we only acquired the cleanup
lock when necessary for freezing.

Not processing the page will not mark it as all-visible which basically
is remembering it...

> >* switchFinishXmin and nextSwitchXid should probably be either volatile
> >   or have a compiler barrier between accessing shared memory and
> >   checking them. The compiler very well could optimize them away and
> >   access shmem all the time which could lead to weird results.
> 
> Hmm, that would be a weird "optimization". Is that really legal for the
> compiler to do? Better safe than sorry, I guess.

I think it is. The compiler doesn't know anything about shared memory or
threads, so it doesn't see that those parameters can change. It will be
forced to load them as soon as an non-inlined function is called which
actually might make it safe in this case - unless you compile in LTO
mode where it recognizes that TransactionIdFollowsOrEquals() won't
modify thigns...

> >* I wonder whether the fact that we're doing the range switches after
> >   acquiring an xid could be problematic if we're preventing xid
> >   allocation due to the checks earlier in that function?
> 
> You mean that you might get into a situation where you cannot assign a new
> XID because you've reached xidStopLimit, and you cannot advance xidStopLimit
> because you can't switch to a new xid-lsn range, because no new XIDs are
> being assigned? Yeah, that would be nasty. The range management stuff should
> really be moved out of GetNewTransaction(), maybe do them in walwriter or
> bgwriter as Alvaro suggested.

Yes, I think we should do that. I am not sure yet where to put it
best though. Doing it in the wal writer doesn't seem to be a good idea, doing
more there will increase latency for normal backends.

> >* I think heap_lock_tuple() needs to unset all-visible, otherwise we
> >   won't vacuum that page again which can lead to problems since we
> >   don't do full-table vacuums again?
> 
> It's OK if the page is never vacuumed again, the whole point of the patch is
> that old XIDs can be just left lingering in the table. The next time the
> page is updated, perhaps to lock a tuple again, the page will be frozen and
> the old xmax will be cleared.

Yes, you're right, it should be possible to make it work that way. But
currently, we look at xmax and infomask of a tuple in heap_lock_tuple()
*before* the PageUpdateNeedsFreezing() call. Currently we would thus
create a new multixact with long dead xids as members and such.

Or am I missing something?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-09-19 14:40:35 +0200, Andres Freund wrote:
> > >* I think heap_lock_tuple() needs to unset all-visible, otherwise we
> > >   won't vacuum that page again which can lead to problems since we
> > >   don't do full-table vacuums again?
> > 
> > It's OK if the page is never vacuumed again, the whole point of the patch is
> > that old XIDs can be just left lingering in the table. The next time the
> > page is updated, perhaps to lock a tuple again, the page will be frozen and
> > the old xmax will be cleared.
> 
> Yes, you're right, it should be possible to make it work that way. But
> currently, we look at xmax and infomask of a tuple in heap_lock_tuple()
> *before* the PageUpdateNeedsFreezing() call. Currently we would thus
> create a new multixact with long dead xids as members and such.

That reminds me of something:

There are lots of checks sprayed around that unconditionally look at
xmin, xmax or infomask.

Things I noticed in a quick look after thinking about the previous
statment:
* AlterEnum() looks at xmin
* The PLs look at xmin
* So does afair VACUUM FREEZE, COPY and such.
* Several HeapTupleSatisfiesVacuum() callers look at xmin and stuff without checking for maturity or freezing the page
first.* lazy_scan_heap() itself if the page isn't already marked all_visible * heap_page_is_all_visible()
 
* rewrite_heap_tuple() does an unconditional heap_freeze_tuple() without considering maturity/passing in Invalid*
* heap_page_prune_opt() shouldn't do anything on a mature (or even all visible) page.
* heap_page_prune() marks a buffer dirty, writes a new xid without changing the lsn.
* heap_prune_chain() looks at tuples without considering page maturity, but the current implementation actually looks
safe.
* CheckForSerializableConflictOut() looks at xmins.
* pgrowlocks() looks at xmax unconditionally.
* heap_update() computes, just like heap_lock_tuple(), a new xmax/infomask before freezing.
* Same for heap_lock_updated_tuple(). Not sure if that's an actual concern, but it might if PageMatureLSN advances or
such.
* heap_getsysattr() should probably return different things for mature pages.

There's probably more.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Ants Aasma
Date:
On Thu, Sep 19, 2013 at 2:42 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> * switchFinishXmin and nextSwitchXid should probably be either volatile
>>    or have a compiler barrier between accessing shared memory and
>>    checking them. The compiler very well could optimize them away and
>>    access shmem all the time which could lead to weird results.
>
> Hmm, that would be a weird "optimization". Is that really legal for the
> compiler to do? Better safe than sorry, I guess.

C doesn't really have a multi-threaded memory model before C11, so the
compiler makers have just made one up that suits them best. For unlocked
memory reads the compiler is free to assume that no one else is accessing the
variable, so yes that "optimization" would be legal according to their rules.

I'm tackling similar issues in my patch. What I'm thinking is that we should
change our existing supposedly atomic accesses to be more explicit and make
the API compatible with C11 atomics[1]. For now I think the changes should be
something like this:

* Have separate typedefs for atomic variants of datatypes (I don't think we have a whole lot of them). With C11 atomics
support,this would change to 
   typedef _Atomic TransactionId AtomicTransactionId;

* Require that pg_atomic_init(type, var, val) be used to init atomic variables. Initially it would just pass through to
assignment,as all supported platforms can do 32bit atomic ops. C11 compatible compilers will delegate to atomic_init(). 

* Create atomic read and write macros that look something like:
       #define pg_atomic_load(type, var) (*((volatile type*) &(var)))
 and
       #define pg_atomic_store(type, var, val) do { \           *((volatile type*) &(var)) = (val); \       } while(0)
 C11 would pass through to the compilers implementation with relaxed memory ordering.

* Rename pg_read_barrier()/pg_write_barrier()/pg_memory_barrier() to
pg_acquire_fence()/pg_release_fence()/pg_acq_rel_fence().Herb Sutter makes a convincing argument why loosening up the
barriersemantics is the sane choice here. [2] C11 support can then pass though to atomic_thread_fence(). 

This way we have a relatively future proof baseline for lockfree programming,
with options to expand with other primitives. We could also only do the
volatile access macros part, at least it would make the intention more clear
than the current approach of sprinkling around volatile pointers.

Regards,
Ants Aasma

[1] http://en.cppreference.com/w/c/atomic
[2] (long video about atomics)
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Freezing without write I/O

From
Robert Haas
Date:
On Thu, Sep 19, 2013 at 6:27 PM, Ants Aasma <ants@cybertec.at> wrote:
> I'm tackling similar issues in my patch. What I'm thinking is that we should
> change our existing supposedly atomic accesses to be more explicit and make
> the API compatible with C11 atomics[1]. For now I think the changes should be
> something like this:
>
> * Have separate typedefs for atomic variants of datatypes (I don't think we
>   have a whole lot of them). With C11 atomics support, this would change to
>
>     typedef _Atomic TransactionId AtomicTransactionId;

What's the point of this?

> * Require that pg_atomic_init(type, var, val) be used to init atomic
>   variables. Initially it would just pass through to assignment, as all
>   supported platforms can do 32bit atomic ops. C11 compatible compilers will
>   delegate to atomic_init().

I don't think this is a bad idea for decoration, but again I'm not
sure how much fundamental value it adds.  If it makes it easier for
people to write code that works in C11 and fails on C89, we lose.

> * Create atomic read and write macros that look something like:
>
>         #define pg_atomic_load(type, var) (*((volatile type*) &(var)))
>
>   and
>
>         #define pg_atomic_store(type, var, val) do { \
>             *((volatile type*) &(var)) = (val); \
>         } while(0)
>
>   C11 would pass through to the compilers implementation with relaxed memory
>   ordering.

Same comment.

> * Rename pg_read_barrier()/pg_write_barrier()/pg_memory_barrier() to
>   pg_acquire_fence()/pg_release_fence()/pg_acq_rel_fence(). Herb Sutter makes
>   a convincing argument why loosening up the barrier semantics is the sane
>   choice here. [2] C11 support can then pass though to atomic_thread_fence().

I am entirely unconvinced that we need this.  Some people use acquire
+ release fences, some people use read + write fences, and there are
all combinations (read-acquire, read-release, read-acquire-release,
write-acquire, write-release, write-acquire-release, both-acquire,
both-release, both-acquire-release).  I think we're going to have
enough trouble deciding between the primitives we already have, let
alone with a whole mess of new ones.  Mistakes will only manifest
themselves on certain platforms and in many cases the races are so
tight that actual failures are very unlikely to be reserved in
regression testing.

Personally, I think the biggest change that would help here is to
mandate that spinlock operations serve as compiler fences.  That would
eliminate the need for scads of volatile references all over the
place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Andres Freund
Date:
Hi,

I agree with most of what you said - I think that's a littlebit too much
change for too little benefit.

On 2013-09-20 08:32:29 -0400, Robert Haas wrote:
> Personally, I think the biggest change that would help here is to
> mandate that spinlock operations serve as compiler fences.  That would
> eliminate the need for scads of volatile references all over the
> place.

The effectively already do, don't they? It's an external, no-inlineable
function call (s_lock, not the actual TAS). And even if it were to get
inlined via LTO optimization, the inline assembler we're using is doing
the __asm__ __volatile__ ("...", "memory") dance. That's a full compiler
barrier.
The non-asm implementations call to OS/compiler primitives that are also
fences.

In the case I brougth up here there is no spinlock or something similar.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Robert Haas
Date:
On Fri, Sep 20, 2013 at 8:40 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-09-20 08:32:29 -0400, Robert Haas wrote:
>> Personally, I think the biggest change that would help here is to
>> mandate that spinlock operations serve as compiler fences.  That would
>> eliminate the need for scads of volatile references all over the
>> place.
>
> The effectively already do, don't they? It's an external, no-inlineable
> function call (s_lock, not the actual TAS). And even if it were to get
> inlined via LTO optimization, the inline assembler we're using is doing
> the __asm__ __volatile__ ("...", "memory") dance. That's a full compiler
> barrier.
> The non-asm implementations call to OS/compiler primitives that are also
> fences.
>
> In the case I brougth up here there is no spinlock or something similar.

Well, that doesn't match my previous discussions with Tom Lane, or this comment:
*      Another caution for users of these macros is that it is the caller's*      responsibility to ensure that the
compilerdoesn't re-order accesses*      to shared memory to precede the actual lock acquisition, or follow the*
lockrelease.  Typically we handle this by using volatile-qualified*      pointers to refer to both the spinlock itself
andthe shared data*      structure being accessed within the spinlocked critical section.*      That fixes it because
compilersare not allowed to re-order accesses*      to volatile objects relative to other such accesses.
 

That's not to disagree with your point.  If TAS is a compiler barrier,
then we oughta be OK.  Unless something can migrate into the spot
between a failed TAS and the call to s_lock?  But I'm pretty sure that
we've repeatedly had to change code to keep things from falling over
in this area, see e.g. commits
fa72121594534dda6cc010f0bf6b3e8d22987526,
07eeb9d109c057854b20707562ce517efa591761,
d3fc362ec2ce1cf095950dddf24061915f836c22, and
584f818bef68450d23d1b75afbaf19febe38fd37 (the last apparently a live
bug).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-09-20 08:54:26 -0400, Robert Haas wrote:
> On Fri, Sep 20, 2013 at 8:40 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-09-20 08:32:29 -0400, Robert Haas wrote:
> >> Personally, I think the biggest change that would help here is to
> >> mandate that spinlock operations serve as compiler fences.  That would
> >> eliminate the need for scads of volatile references all over the
> >> place.
> >
> > The effectively already do, don't they? It's an external, no-inlineable
> > function call (s_lock, not the actual TAS). And even if it were to get
> > inlined via LTO optimization, the inline assembler we're using is doing
> > the __asm__ __volatile__ ("...", "memory") dance. That's a full compiler
> > barrier.
> > The non-asm implementations call to OS/compiler primitives that are also
> > fences.
> >
> > In the case I brougth up here there is no spinlock or something similar.
> 
> Well, that doesn't match my previous discussions with Tom Lane, or this comment:
> 
>  *      Another caution for users of these macros is that it is the caller's
>  *      responsibility to ensure that the compiler doesn't re-order accesses
>  *      to shared memory to precede the actual lock acquisition, or follow the
>  *      lock release.  Typically we handle this by using volatile-qualified
>  *      pointers to refer to both the spinlock itself and the shared data
>  *      structure being accessed within the spinlocked critical section.
>  *      That fixes it because compilers are not allowed to re-order accesses
>  *      to volatile objects relative to other such accesses.

To me that doesn't really make much sense to be honest.Note that the next paragraph tells us that
*    On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and*    S_UNLOCK() macros must further include
hardware-levelmemory fence*    instructions to prevent similar re-ordering at the hardware level.*    TAS() and
TAS_SPIN()must guarantee that loads and stores issued after*    the macro are not executed until the lock has been
obtained. Conversely,*    S_UNLOCK() must guarantee that loads and stores issued before the macro*    have been
executedbefore the lock is released.
 

so, TAS has to work as a memory barrier if the architecture doesn't have
strong cache coherency guarantees itself. But memory barriers have to be
compiler barriers?

> That's not to disagree with your point.  If TAS is a compiler barrier,
> then we oughta be OK.  Unless something can migrate into the spot
> between a failed TAS and the call to s_lock?

We're talking compiler barriers right. In that case failure or success
do not play a role, or am I missing something?

> But I'm pretty sure that
> we've repeatedly had to change code to keep things from falling over
> in this area, see e.g. commits

> 584f818bef68450d23d1b75afbaf19febe38fd37 (the last apparently a live
> bug).
> d3fc362ec2ce1cf095950dddf24061915f836c22, and

I've quickly checked those out, and things looked mighty different back
then. And incidentally several of the inline assembly implementations
back then didn't specify that they clobber memory (which is what makes
it a compiler barrier).

> fa72121594534dda6cc010f0bf6b3e8d22987526,

Not sure here. Several of the inline assembly bits where changed to
clobber memory, but not all.

> 07eeb9d109c057854b20707562ce517efa591761,

Hm. Those mostly look to be overly cautios to me.

I think we should go through the various implementations and make sure
they are actual compiler barriers and then change the documented policy.


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Alvaro Herrera
Date:
Andres Freund escribió:
> Hi,
> 
> On 2013-09-19 14:42:19 +0300, Heikki Linnakangas wrote:
> > On 18.09.2013 16:22, Andres Freund wrote:
> > >* Why can we do a GetOldestXmin(allDbs = false) in
> > >   BeginXidLSNRangeSwitch()?
> > 
> > Looks like a bug. I think I got the arguments backwards, was supposed to be
> > allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could be
> > ignored, but 'false' is the safe option.
> 
> Not sure either...

The "ignoreVacuum" flag specifies to ignore backends running non-full
VACUUM, that is, processes that are known never to generate new Xids,
never obtain Xmin, and never to insert tuples anywhere.  With all these
restrictions, I think it's safe to use ignoreVacuum=true here.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
> I think we should go through the various implementations and make sure
> they are actual compiler barriers and then change the documented policy.

From a quick look
* S_UNLOCK for PPC isn't a compiler barrier
* S_UNLOCK for MIPS isn't a compiler barrier
* I don't know enough about unixware (do we still support that as a
platform even) to judge
* True64 Alpha I have no clue about
* PA-RISCs tas() might not be a compiler barrier for !GCC
* PA-RISCs S_UNLOCK might not be a compiler barrier
* HP-UX !GCC might not
* IRIX 5 seems to be a compiler barrier
* SINIX - I don't care
* AIX PPC - compiler barrier
* Sun - TAS is implemented in external assembly, normal function call, compiler barrier
* Win(32|64) - compiler barrier
* Generic S_UNLOCK *NOT* necessarily a compiler barrier.

Ok, so I might have been a bit too optimistic...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
didier
Date:
Hi


On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
> I think we should go through the various implementations and make sure
> they are actual compiler barriers and then change the documented policy.

From a quick look
* S_UNLOCK for PPC isn't a compiler barrier
* S_UNLOCK for MIPS isn't a compiler barrier
* I don't know enough about unixware (do we still support that as a
platform even) to judge
* True64 Alpha I have no clue about
* PA-RISCs tas() might not be a compiler barrier for !GCC
* PA-RISCs S_UNLOCK might not be a compiler barrier
* HP-UX !GCC might not
* IRIX 5 seems to be a compiler barrier
* SINIX - I don't care
* AIX PPC - compiler barrier
* Sun - TAS is implemented in external assembly, normal function call,
  compiler barrier
* Win(32|64) - compiler barrier
* Generic S_UNLOCK *NOT* necessarily a compiler barrier.

Ok, so I might have been a bit too optimistic...

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Freezing without write I/O

From
didier
Date:
Hi,

IMO it's a bug if S_UNLOCK is a not a compiler barrier.

Moreover for volatile remember:
https://www.securecoding.cert.org/confluence/display/seccode/DCL17-C.+Beware+of+miscompiled+volatile-qualified+variables

Who is double checking compiler output? :)

regards
Didier



On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
> I think we should go through the various implementations and make sure
> they are actual compiler barriers and then change the documented policy.

From a quick look
* S_UNLOCK for PPC isn't a compiler barrier
* S_UNLOCK for MIPS isn't a compiler barrier
* I don't know enough about unixware (do we still support that as a
platform even) to judge
* True64 Alpha I have no clue about
* PA-RISCs tas() might not be a compiler barrier for !GCC
* PA-RISCs S_UNLOCK might not be a compiler barrier
* HP-UX !GCC might not
* IRIX 5 seems to be a compiler barrier
* SINIX - I don't care
* AIX PPC - compiler barrier
* Sun - TAS is implemented in external assembly, normal function call,
  compiler barrier
* Win(32|64) - compiler barrier
* Generic S_UNLOCK *NOT* necessarily a compiler barrier.

Ok, so I might have been a bit too optimistic...

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Freezing without write I/O

From
Robert Haas
Date:
On Fri, Sep 20, 2013 at 11:11 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
>> I think we should go through the various implementations and make sure
>> they are actual compiler barriers and then change the documented policy.
>
> From a quick look
> * S_UNLOCK for PPC isn't a compiler barrier
> * S_UNLOCK for MIPS isn't a compiler barrier
> * I don't know enough about unixware (do we still support that as a
> platform even) to judge
> * True64 Alpha I have no clue about
> * PA-RISCs tas() might not be a compiler barrier for !GCC
> * PA-RISCs S_UNLOCK might not be a compiler barrier
> * HP-UX !GCC might not
> * IRIX 5 seems to be a compiler barrier
> * SINIX - I don't care
> * AIX PPC - compiler barrier
> * Sun - TAS is implemented in external assembly, normal function call,
>   compiler barrier
> * Win(32|64) - compiler barrier
> * Generic S_UNLOCK *NOT* necessarily a compiler barrier.
>
> Ok, so I might have been a bit too optimistic...

Yeah, it seems worth fixing, but it's not going to be a 5-minute
project, I fear.

But why do you think that this is not a compiler barrier (PPC):
   __asm__ __volatile__ (" sync \n"); \   *((volatile slock_t *) (lock)) = 0; \

Surely, the __asm__ __volatile__ must be a compiler barrier, but the
compiler could presumably allow the store to the lock itself to move
forward past other non-volatilized stores during optimization?  Is
that what you're concerned about?  If so, that's easily fixed by
sticking __asm__ __volatile__("":::"memory") in there.  But some of
the other cases are less clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-09-23 11:50:05 -0400, Robert Haas wrote:
> On Fri, Sep 20, 2013 at 11:11 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
> >> I think we should go through the various implementations and make sure
> >> they are actual compiler barriers and then change the documented policy.
> >
> > From a quick look
> > * S_UNLOCK for PPC isn't a compiler barrier
> > * S_UNLOCK for MIPS isn't a compiler barrier
> > * I don't know enough about unixware (do we still support that as a
> > platform even) to judge
> > * True64 Alpha I have no clue about
> > * PA-RISCs tas() might not be a compiler barrier for !GCC
> > * PA-RISCs S_UNLOCK might not be a compiler barrier
> > * HP-UX !GCC might not
> > * IRIX 5 seems to be a compiler barrier
> > * SINIX - I don't care
> > * AIX PPC - compiler barrier
> > * Sun - TAS is implemented in external assembly, normal function call,
> >   compiler barrier
> > * Win(32|64) - compiler barrier
> > * Generic S_UNLOCK *NOT* necessarily a compiler barrier.
> >
> > Ok, so I might have been a bit too optimistic...
> 
> Yeah, it seems worth fixing, but it's not going to be a 5-minute
> project, I fear.

Yea :(. I think we should start by trimming the above list by removing
some platforms:
* SINIX - doesn't actually seem to be supported
* Tru64 - not even a zombie anymore
* IRIX - ...

The others probably can't be removed?

> But why do you think that this is not a compiler barrier (PPC):
> 
>     __asm__ __volatile__ (" sync \n"); \
>     *((volatile slock_t *) (lock)) = 0; \
> 
> Surely, the __asm__ __volatile__ must be a compiler barrier, but the
> compiler could presumably allow the store to the lock itself to move
> forward past other non-volatilized stores during optimization?  Is
> that what you're concerned about?  If so, that's easily fixed by
> sticking __asm__ __volatile__("":::"memory") in there.

Yes, the memory clobber is missing for PPC and MIPS.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Andres Freund
Date:
Just some notes, before I forget them again.

On 2013-09-23 11:50:05 -0400, Robert Haas wrote:
> On Fri, Sep 20, 2013 at 11:11 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-09-20 16:47:24 +0200, Andres Freund wrote:
> >> I think we should go through the various implementations and make sure
> >> they are actual compiler barriers and then change the documented policy.
> >
> > From a quick look
> > * S_UNLOCK for PPC isn't a compiler barrier
> > * S_UNLOCK for MIPS isn't a compiler barrier

Needs memory clobber looks easy enough given only gcc seems to be
supported.

> > * I don't know enough about unixware (do we still support that as a
> > platform even) to judge

Looks like another thing to remove?

> > * PA-RISCs tas() might not be a compiler barrier for !GCC
> > * PA-RISCs S_UNLOCK might not be a compiler barrier

http://h21007.www2.hp.com/portal/download/files/unprot/ddk/mem_ordering_pa_ia.pdf

_MF(); seems to work on both PA-RISC and itanic.

> > * HP-UX !GCC might not

The document linked from the source code explains how to implement it:
http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf

#define COMP_FENCE _Asm_fence(_UP_MEM_FENCE|_DOWN_MEM_FENCE)

> > * Generic S_UNLOCK *NOT* necessarily a compiler barrier.

Not sure how to handle that one.


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 19.09.2013 16:24, Andres Freund wrote:
> On 2013-09-19 14:40:35 +0200, Andres Freund wrote:
>>>> * I think heap_lock_tuple() needs to unset all-visible, otherwise we
>>>>    won't vacuum that page again which can lead to problems since we
>>>>    don't do full-table vacuums again?
>>>
>>> It's OK if the page is never vacuumed again, the whole point of the patch is
>>> that old XIDs can be just left lingering in the table. The next time the
>>> page is updated, perhaps to lock a tuple again, the page will be frozen and
>>> the old xmax will be cleared.
>>
>> Yes, you're right, it should be possible to make it work that way. But
>> currently, we look at xmax and infomask of a tuple in heap_lock_tuple()
>> *before* the PageUpdateNeedsFreezing() call. Currently we would thus
>> create a new multixact with long dead xids as members and such.

Ah, I see.

> That reminds me of something:
>
> There are lots of checks sprayed around that unconditionally look at
> xmin, xmax or infomask.
>
> Things I noticed in a quick look after thinking about the previous
> statment:
> * AlterEnum() looks at xmin
> * The PLs look at xmin
> * So does afair VACUUM FREEZE, COPY and such.
> * Several HeapTupleSatisfiesVacuum() callers look at xmin and stuff
>    without checking for maturity or freezing the page first.
>    * lazy_scan_heap() itself if the page isn't already marked all_visible
>    * heap_page_is_all_visible()
> * rewrite_heap_tuple() does an unconditional heap_freeze_tuple() without
>    considering maturity/passing in Invalid*
> * heap_page_prune_opt() shouldn't do anything on a mature (or even all
>    visible) page.
> * heap_page_prune() marks a buffer dirty, writes a new xid without
>    changing the lsn.
> * heap_prune_chain() looks at tuples without considering page maturity,
>    but the current implementation actually looks safe.
> * CheckForSerializableConflictOut() looks at xmins.
> * pgrowlocks() looks at xmax unconditionally.
> * heap_update() computes, just like heap_lock_tuple(), a new
>    xmax/infomask before freezing.
> * Same for heap_lock_updated_tuple(). Not sure if that's an actual
>    concern, but it might if PageMatureLSN advances or such.
> * heap_getsysattr() should probably return different things for mature
>    pages.
>
> There's probably more.

Hmm, some of those are trivial, but others, rewrite_heap_tuple() are
currently only passed the HeapTuple, with no reference to the page where
the tuple came from. Instead of changing code to always pass that along
with a tuple, I think we should add a boolean to HeapTupleData, to
indicate if the tuple came from a mature page. If the flag is set, the
tuple should be considered visible to everyone, without looking at the
xmin/xmax. This might affect extensions, although usually external C
code that have to deal with HeapTuples will use functions like
heap_form_tuple to do so, and won't need to deal with low-level stuff or
visibility themselves.

Attached is a new version, which adds that field to HeapTupleData. Most
of the issues on you listed above have been fixed, plus a bunch of other
bugs I found myself. The bug that Jeff ran into with his count.pl script
has also been fixed.

- Heikki

Attachment

Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 18.09.2013 22:55, Jeff Janes wrote:
> On Mon, Sep 16, 2013 at 6:59 AM, Heikki Linnakangas<hlinnakangas@vmware.com
>> wrote:
>> Here's a rebased version of the patch, including the above-mentioned
>> fixes. Nothing else new.
>
> I've applied this to 0892ecbc015930d, the last commit to which it applies
> cleanly.
>
> When I test this by repeatedly incrementing a counter in a randomly chosen
> row, then querying the whole table and comparing the results to what my
> driver knows they should be, I get discrepancies.

Ok, I found the bug. The problem was that when a HOT chain begins with a 
dead tuple, when the page was frozen, the dead tuple was not removed, 
but the xmin of the live tuple in the chain was replaced with FrozenXid. 
That breaks the HOT-chain following code, which checks that the xmin of 
the next tuple in the chain matches the xmax of the previous tuple.

I fixed that by simply not freezing a page which contains any dead 
tuples. That's OK because the page will be visited by vacuum before it 
becomes old enough to be "mature". However, it means that the invariant 
that a page can only contain XIDs within one XID-LSN range, determined 
by the LSN, is no longer true. AFAICS it everything still works, but I 
would feel more comfortable if we could uphold that invariant, for 
debugging reasons if nothing else. Will have to give that some more 
thought..

Thanks for the testing! I just posted an updated version of the patch 
elsewhere in this thread.

- Heikki



Re: Freezing without write I/O

From
Peter Eisentraut
Date:
On 9/25/13 5:31 AM, Heikki Linnakangas wrote:
> Attached is a new version, which adds that field to HeapTupleData. Most
> of the issues on you listed above have been fixed, plus a bunch of other
> bugs I found myself. The bug that Jeff ran into with his count.pl script
> has also been fixed.

This patch doesn't apply at all:

Reversed (or previously applied) patch detected!  Assuming -R.
1 out of 39 hunks FAILED -- saving rejects to file
src/backend/access/heap/heapam.c.rej
5 out of 9 hunks FAILED -- saving rejects to file
src/backend/access/heap/rewriteheap.c.rej
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
10 out of 12 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlog.c.rej
13 out of 18 hunks FAILED -- saving rejects to file
src/backend/commands/cluster.c.rej
Reversed (or previously applied) patch detected!  Assuming -R.
15 out of 17 hunks FAILED -- saving rejects to file
src/backend/commands/vacuum.c.rej
5 out of 24 hunks FAILED -- saving rejects to file
src/backend/commands/vacuumlazy.c.rej
3 out of 9 hunks FAILED -- saving rejects to file
src/backend/postmaster/autovacuum.c.rej
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
1 out of 4 hunks FAILED -- saving rejects to file
src/backend/utils/misc/guc.c.rej
Reversed (or previously applied) patch detected!  Assuming -R.
Reversed (or previously applied) patch detected!  Assuming -R.
1 out of 2 hunks FAILED -- saving rejects to file
src/include/commands/cluster.h.rej
1 out of 1 hunk FAILED -- saving rejects to file
src/include/commands/vacuum.h.rej
Reversed (or previously applied) patch detected!  Assuming -R.



Re: Freezing without write I/O

From
Heikki Linnakangas
Date:
On 25.09.2013 15:48, Peter Eisentraut wrote:
> On 9/25/13 5:31 AM, Heikki Linnakangas wrote:
>> Attached is a new version, which adds that field to HeapTupleData. Most
>> of the issues on you listed above have been fixed, plus a bunch of other
>> bugs I found myself. The bug that Jeff ran into with his count.pl script
>> has also been fixed.
>
> This patch doesn't apply at all:

Huh, that's strange. Merged with latest changes from master, here's a
new version that applies.

- Heikki

Attachment

Re: Freezing without write I/O

From
Ants Aasma
Date:
Just found this in my drafts folder. Sorry for the late response.

On Fri, Sep 20, 2013 at 3:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I am entirely unconvinced that we need this.  Some people use acquire
> + release fences, some people use read + write fences, and there are
> all combinations (read-acquire, read-release, read-acquire-release,
> write-acquire, write-release, write-acquire-release, both-acquire,
> both-release, both-acquire-release).  I think we're going to have
> enough trouble deciding between the primitives we already have, let
> alone with a whole mess of new ones.  Mistakes will only manifest
> themselves on certain platforms and in many cases the races are so
> tight that actual failures are very unlikely to be reserved in
> regression testing.

I have to retract my proposal to try to emulate C11 atomics in C89. I
guess this goes to show why one shouldn't try to conjure up atomic
API's at 2AM. I forgot the fact that while acquire-release semantics
are enough to ensure sequentially consistent behavior, the actual
barriers need to be paired with specific atomic accesses to achieve
that. It's not possible to use freestanding acquire/release barriers
to do implement this, nor would it be possible to include barriers in
the atomic accesses themselves without causing significant
pessimization.

Sequentially consistency (along with causal transitivity and total
store ordering that come with it) should be regarded as a goal. I'm
not able to reason about concurrent programs without those guarantees,
and I suspect no one else is either. Sequential consistency is
guaranteed if atomic variables (including locks) are accessed with
appropriate acquire and release semantics. We just need to use a
hodge-podge of read/write/full barriers and volatile memory accesses
to actually implement the semantics until some distant future date
where we can start relying on compilers getting it right.

I still think we should have a macro for the volatile memory accesses.
As a rule, each one of those needs a memory barrier, and if we
consolidate them, or optimize them out, the considerations why this is
safe should be explained in a comment. Race prone memory accesses
should stick out like a sore thumb.

> Personally, I think the biggest change that would help here is to
> mandate that spinlock operations serve as compiler fences.  That would
> eliminate the need for scads of volatile references all over the
> place.

+1. The commits that you showed fixing issues in this area show quite
well why this is a good idea.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-10-01 04:47:42 +0300, Ants Aasma wrote:
> I still think we should have a macro for the volatile memory accesses.
> As a rule, each one of those needs a memory barrier, and if we
> consolidate them, or optimize them out, the considerations why this is
> safe should be explained in a comment. Race prone memory accesses
> should stick out like a sore thumb.

Agreed. The "wait free LW_SHARED" thing[1] I posted recently had a simple

#define pg_atomic_read(atomic) (*(volatile uint32 *)&(atomic))

That should be sufficient and easily greppable, right?

I generally think we need to provide some primitives for doing atomic
stuff. There's lots of stuff that's not easy to accelerate further without.

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Ants Aasma
Date:
On Tue, Oct 1, 2013 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Agreed. The "wait free LW_SHARED" thing[1] I posted recently had a simple
>
> #define pg_atomic_read(atomic) (*(volatile uint32 *)&(atomic))
>
> That should be sufficient and easily greppable, right?

Looks good enough for me. I would consider using a naming scheme that
accounts for possible future uint64 atomics.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2013-09-25 12:31:20 +0300, Heikki Linnakangas wrote:
> Hmm, some of those are trivial, but others, rewrite_heap_tuple() are
> currently only passed the HeapTuple, with no reference to the page where the
> tuple came from. Instead of changing code to always pass that along with a
> tuple, I think we should add a boolean to HeapTupleData, to indicate if the
> tuple came from a mature page. If the flag is set, the tuple should be
> considered visible to everyone, without looking at the xmin/xmax. This might
> affect extensions, although usually external C code that have to deal with
> HeapTuples will use functions like heap_form_tuple to do so, and won't need
> to deal with low-level stuff or visibility themselves.
> 
> Attached is a new version, which adds that field to HeapTupleData. Most of
> the issues on you listed above have been fixed, plus a bunch of other bugs I
> found myself. The bug that Jeff ran into with his count.pl script has also
> been fixed.

This seems a bit hacky to me. Change a widely used struct because a few
functions don't get passed enough information? Visibilitychecks are
properly done with a buffer passed along; that some places have cut
corners around that doesn't mean we have to continue to do so. The
pullups around rs_pagemature are imo indicative of this (there's even a
bug because a page can be all visible before it's mature, right? That
could then cause an assert somewhere down the line when we check page
and tuple are coherent).

Ok, making another scan through this:
* Why can we do PageSetLSN(page, RangeSwitchLSN) in heap_* when the action doesn't need WAL? And why is that correct? I
cansee doing that in vacuum itself, but doing it when we write *new* data to the page?
 
* heap_freeze_page(): The PageSetLSN(page, RangeSwitchLSN) if there's nothing to log is not WAL logged. Which means
thatif we crash it won't necessarily be set, so the VM and the heap lsn might get out of sync. That probably doesn't
haveany bad effects, but imo deserves a comment.
 
* heap_freeze_page(): PageUpdateNeedsFreezing() can now be true before and after. That's a bit confusing :/
* GetSafeClogTruncationPoint truncates the xid-lsn ranges, but there's also an, uncommented, TruncateXidLSNRanges. At
leashow they work together should be described better.
 
* There's quite some FIXMEs around
* Let's move the xid-lsn ranges code from GetNewTransactionId() into it's own function.
* PageMatureLSN and RangeSwitchLSN should be documented somewhere. They are implicitly, but they are used widely enough
thatthat doesn't seem sufficient.
 
* pg_controldata should output pageMatureLSN

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Jeff Davis
Date:
On Wed, 2013-09-25 at 12:31 +0300, Heikki Linnakangas wrote:
> On 19.09.2013 16:24, Andres Freund wrote:
> ...
> > There's probably more.

I think _bt_check_unique is also a problem.

> Hmm, some of those are trivial, but others, rewrite_heap_tuple() are 
> currently only passed the HeapTuple, with no reference to the page where 
> the tuple came from. Instead of changing code to always pass that along 
> with a tuple, I think we should add a boolean to HeapTupleData, to 
> indicate if the tuple came from a mature page. If the flag is set, the 
> tuple should be considered visible to everyone, without looking at the 
> xmin/xmax. This might affect extensions, although usually external C 
> code that have to deal with HeapTuples will use functions like 
> heap_form_tuple to do so, and won't need to deal with low-level stuff or 
> visibility themselves.

How bad would the code be to just pass along the buffer when it's
needed? After looking around, it doesn't seem necessarily worse than
adding the struct field (and I agree with Andres that it seems more
proper to pass the buffer along).

I also have a separate question about this patch:

It seems like this gets us closer to 64-bit transaction IDs. Can we get
some help from the compiler to distinguish the two cases in a less
error-prone way? Is it worth thinking about making 64-bit the norm, and
32-bit an optimization in certain places (e.g. the heap and maybe the
xip array of a snapshot)?

Regards,Jeff Davis





Re: Freezing without write I/O

From
Peter Geoghegan
Date:
Shouldn't this patch be in the January commitfest?

-- 
Peter Geoghegan



Re: Freezing without write I/O

From
Andres Freund
Date:
On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
> Shouldn't this patch be in the January commitfest?

I think we previously concluded that there wasn't much chance to get
this into 9.4 and there's significant work to be done on the patch
before new reviews are required, so not submitting it imo makes sense.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freezing without write I/O

From
Simon Riggs
Date:
On 26 January 2014 12:58, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
>> Shouldn't this patch be in the January commitfest?
>
> I think we previously concluded that there wasn't much chance to get
> this into 9.4 and there's significant work to be done on the patch
> before new reviews are required, so not submitting it imo makes sense.

I think we should make this a priority feature for 9.5

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Freezing without write I/O

From
Robert Haas
Date:
On Mon, Jan 27, 2014 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 26 January 2014 12:58, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
>>> Shouldn't this patch be in the January commitfest?
>>
>> I think we previously concluded that there wasn't much chance to get
>> this into 9.4 and there's significant work to be done on the patch
>> before new reviews are required, so not submitting it imo makes sense.
>
> I think we should make this a priority feature for 9.5

+1.  I can't think of many things we might do that would be more important.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freezing without write I/O

From
Jeff Janes
Date:
On Monday, January 27, 2014, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jan 27, 2014 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 26 January 2014 12:58, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
>>> Shouldn't this patch be in the January commitfest?
>>
>> I think we previously concluded that there wasn't much chance to get
>> this into 9.4 and there's significant work to be done on the patch
>> before new reviews are required, so not submitting it imo makes sense.
>
> I think we should make this a priority feature for 9.5

+1.  I can't think of many things we might do that would be more important.


Can anyone guess how likely this approach is to make it into 9.5?  I've been pondering some incremental improvements over what we have now, but if this revolutionary approach has a high chance of landing then any work on incremental improvements would be pointless.

Thanks,

Jeff

Re: Freezing without write I/O

From
David Fetter
Date:
On Wed, May 14, 2014 at 05:46:49PM -0700, Jeff Janes wrote:
> On Monday, January 27, 2014, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > On Mon, Jan 27, 2014 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com<javascript:;>>
> > wrote:
> > > On 26 January 2014 12:58, Andres Freund <andres@2ndquadrant.com<javascript:;>>
> > wrote:
> > >> On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
> > >>> Shouldn't this patch be in the January commitfest?
> > >>
> > >> I think we previously concluded that there wasn't much chance to get
> > >> this into 9.4 and there's significant work to be done on the patch
> > >> before new reviews are required, so not submitting it imo makes sense.
> > >
> > > I think we should make this a priority feature for 9.5
> >
> > +1.  I can't think of many things we might do that would be more important.
> >
> Can anyone guess how likely this approach is to make it into 9.5?
> I've been pondering some incremental improvements over what we have
> now, but if this revolutionary approach has a high chance of landing
> then any work on incremental improvements would be pointless.

How much work do you anticipate for the description and PoC of
whatever it is you've been pondering?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Freezing without write I/O

From
Robert Haas
Date:
On Wed, May 14, 2014 at 8:46 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> +1.  I can't think of many things we might do that would be more
>> important.
>
> Can anyone guess how likely this approach is to make it into 9.5?  I've been
> pondering some incremental improvements over what we have now, but if this
> revolutionary approach has a high chance of landing then any work on
> incremental improvements would be pointless.

Well, Heikki was saying on another thread that he had kind of gotten
cold feet about this, so I gather he's not planning to pursue it.  Not
sure if I understood that correctly.  If so, I guess it depends on
whether someone else can pick it up, but we might first want to
establish why he got cold feet and how worrying those problems seem to
other people.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company