Thread: crash-safe visibility map, take four

crash-safe visibility map, take four

From
Robert Haas
Date:
On Wed, Dec 1, 2010 at 11:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> As far as I can tell, there are basically two viable solutions on the
> table here.
>
> 1. Every time we observe a page as all-visible, (a) set the
> PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
> bit in the visibility map page, bumping the LSN as usual, and (c) emit
> a WAL record indicating the relation and block number.  On redo of
> this record, set both the page-level bit and the visibility map bit.
> The heap page may hit the disk before the WAL record, but that's OK;
> it just might result in a little extra work until some subsequent
> operation gets the visibility map bit set.  The visibility map page
> page may hit the disk before the heap page, but that's OK too, because
> the WAL record will already be on disk due to the LSN interlock.  If a
> crash occurs before the heap page is flushed, redo will fix the heap
> page.  (The heap page will get flushed as part of the next checkpoint,
> if not sooner, so by the time the redo pointer advances past the WAL
> record, there's no longer a risk.)
>
> 2. Every time we observe a page as all-visible, (a) set the
> PD_ALL_VISIBLE bit on the page, without bumping the LSN, (b) set the
> bit in the visibility map page, bumping the LSN if a WAL record is
> issued (which only happens sometimes, read on), and (c) emit a WAL
> record indicating the "chunk" of 128 visibility map bits which
> contains the bit we just set - but only if we're now dealing with a
> new group of 128 visibility map bits or if a checkpoint has intervened
> since the last such record we emitted.  On redo of this record, clear
> the visibility map bits in each chunk.  The heap page may hit the disk
> before the WAL record, but that's OK for the same reasons as in plan
> #1.  The visibility map page may hit the disk before the heap page,
> but that's OK too, because the WAL record will already be on disk to
> due the LSN interlock.  If a crash occurs before the heap page makes
> it to disk, then redo will clear the visibility map bits, leaving them
> to be reset by a subsequent VACUUM.

I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
 Patch attached.  It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something.  Feel free to point and snicker in the latter case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: crash-safe visibility map, take four

From
Gokulakannan Somasundaram
Date:
I took a crack at implementing the first approach described above,
which seems to be by far the simplest idea we've come up with to date.
 Patch attached.  It doesn't seem to be that complicated, which could
mean either that it's not that complicated or that I'm missing
something.  Feel free to point and snicker in the latter case.

Hi,
    I suppose the problem is not with setting the bit, but resetting the bit. Has that been completed already?

Thanks. 

Re: crash-safe visibility map, take four

From
Robert Haas
Date:
On Tue, Mar 22, 2011 at 11:59 PM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
>> I took a crack at implementing the first approach described above,
>> which seems to be by far the simplest idea we've come up with to date.
>>  Patch attached.  It doesn't seem to be that complicated, which could
>> mean either that it's not that complicated or that I'm missing
>> something.  Feel free to point and snicker in the latter case.
>>
> Hi,
>     I suppose the problem is not with setting the bit, but resetting the
> bit. Has that been completed already?
> Thanks.

All operations that clear the bit area are already WAL-logged.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take four

From
Jesper Krogh
Date:
On 2011-03-22 21:43, Robert Haas wrote:
> I took a crack at implementing the first approach described above,
> which seems to be by far the simplest idea we've come up with to date.
>   Patch attached.  It doesn't seem to be that complicated, which could
> mean either that it's not that complicated or that I'm missing
> something.  Feel free to point and snicker in the latter case.

Looks simple, but there is now benefit on the usage side in the patch,
so it isn't really "testable" yet? I would love to spend some time testing
when its doable (even with rough corners.)

I'm still a bit puzzled with how it would end up working with a page-level
visibillity map bit for index-scans. There is a clear "drop off" in 
usabillity
when the change rates of the table goes up, which may or may not be
relevant, but I cannot really judge, since I haven't even got a ballpark
figure about how much table churn would disable say 50% of the usage.

= Really naive suggestion approaching =
Another layout might be to simply drag out t_xmin, t_xmax pr row (8 bytes)
into a table by itself. This table will be way bigger than the one bit 
per page
map, but could be "wal-logged" as any other change in the system?

It would, by definition make the visibility testing work (way faster 
than today),
no matter how fast the underlying table changes.

State of today (PG 8.4) is that a query like this:
testdb=# select count(id) from testdb.seq where fts @@ to_tsquery('tag'); count
------- 69753
(1 row)

Time: 5863.600 ms
testdb=# select count(id) from testdb.seq where fts @@ to_tsquery('tag'); count
------- 69753
(1 row)

Time: 659.832 ms
testdb=# select count(id) from testdb.seq where fts @@ to_tsquery('tag'); count
------- 69753
(1 row)

Time: 1005.765 ms

Somewhere around 15ns / tuple (not bad at all).
(the first was probably "half warm")

The "average" rows per tuple is somewhere between 4 and 8 for this 
table, assuming
8 and that the 69K are randomly distributed among the 16M other tuples 
(fair assumption
in this case). The 600-1000ms for the fresh cache run are the timing to 
drag:
69753*8192 (page size) = 571MB into memory for visibillity testing 
alone, on warm cache
all pages being in main memory. Packing 16M tuples with 8 bytes / tuple 
in a map would be
around 128MB.

given 8 bytes/row and random distribution of data, that would require us 
to read all 128MB,
so a speedup of x4 on this example, but it would rougly let us count the 
entire table in
the same time.

With regard to disk vs. memory hotness.. those 128MB compares to a table 
size of 32GB
(with a toast table next to it of 64GB) but that shouldn't be touched by 
above query.

The ns/tuple number (today) on a "thin" table in my system is 
approaching 1ns / tuple.

If the page-level bitmap would be set "quite fast" on a fairly busy 
system anyway, then
the above is just noise in the air, but I have currently no feeling, and 
there is
some math in there I have trouble setting reliable ballpark numbers on.

There is, by all approaches room for significant improvements for the 
visibillity
testing for a huge range of installations.

Can I drag out numbers of "frozenness of tuples" from my current systems 
to fill in the
discussion? (how?)

Jesper
-- 
Jesper


Re: crash-safe visibility map, take four

From
Gokulakannan Somasundaram
Date:

All operations that clear the bit area are already WAL-logged.

Is it the case with visibility map also?

Thanks. 

Re: crash-safe visibility map, take four

From
Robert Haas
Date:
On Wed, Mar 23, 2011 at 2:29 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
>> All operations that clear the bit area are already WAL-logged.
>>
> Is it the case with visibility map also?
> Thanks.

Yes.  Look at the comment that the patch removes.  That describes the
problem being fixed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take four

From
Robert Haas
Date:
On Wed, Mar 23, 2011 at 2:16 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> On 2011-03-22 21:43, Robert Haas wrote:
>>
>> I took a crack at implementing the first approach described above,
>> which seems to be by far the simplest idea we've come up with to date.
>>  Patch attached.  It doesn't seem to be that complicated, which could
>> mean either that it's not that complicated or that I'm missing
>> something.  Feel free to point and snicker in the latter case.
>
> Looks simple, but there is now benefit on the usage side in the patch,
> so it isn't really "testable" yet? I would love to spend some time testing
> when its doable (even with rough corners.)

What it probably needs right now is some crash testing - insert a
database panic at various points in the code and then check whether
the state after recovery is still OK.  Also some code review from
people who understand recovery better than me.  *waves to Heikki*

There's a lot more work that will have to be done before this starts
to produce user-visible performance benefits, and then a lot more work
after that before we've exhausted all the possibilities.  I can't cope
with all that right now.  This is basic infrastructure, that will
eventually enable a variety of cool stuff, but isn't particularly sexy
by itself.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take four

From
Merlin Moncure
Date:
On Wed, Mar 23, 2011 at 1:16 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> On 2011-03-22 21:43, Robert Haas wrote:
>>
>> I took a crack at implementing the first approach described above,
>> which seems to be by far the simplest idea we've come up with to date.
>>  Patch attached.  It doesn't seem to be that complicated, which could
>> mean either that it's not that complicated or that I'm missing
>> something.  Feel free to point and snicker in the latter case.
>
> Looks simple, but there is now benefit on the usage side in the patch,
> so it isn't really "testable" yet? I would love to spend some time testing
> when its doable (even with rough corners.)
>
> I'm still a bit puzzled with how it would end up working with a page-level
> visibillity map bit for index-scans. There is a clear "drop off" in
> usabillity
> when the change rates of the table goes up, which may or may not be
> relevant, but I cannot really judge, since I haven't even got a ballpark
> figure about how much table churn would disable say 50% of the usage.

How much benefit you are going to get is going to be really workload
dependent.  In a lot of cases distribution of writes are going to be
really non uniform so that a small percentage of records get the
majority of the writes across the database generally.  Reliable
PD_ALL_VISIBLE opens the door to optimizing around this pattern, which
i'd estimate the vast majority of databases follow in various degrees.

It's really hard to overemphasize how important in performance terms
are the features that mitigate the relative downsides of our mvcc
implementation.  The  HOT feature in 8.3 was an absolute breakthrough
in terms of postgres performance and I expect this will open similar
doors.

merlin


Re: crash-safe visibility map, take four

From
Gokulakannan Somasundaram
Date:
Yeah. i looked at it. I don't think it addresses the problem raised here.


Or may be i am missing something.

Thanks.

On Wed, Mar 23, 2011 at 7:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 23, 2011 at 2:29 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
>> All operations that clear the bit area are already WAL-logged.
>>
> Is it the case with visibility map also?
> Thanks.

Yes.  Look at the comment that the patch removes.  That describes the
problem being fixed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: crash-safe visibility map, take four

From
Robert Haas
Date:
On Thu, Mar 24, 2011 at 12:56 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
> Yeah. i looked at it. I don't think it addresses the problem raised here.
> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
> Or may be i am missing something.

Yeah, I think you're right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take four

From
Simon Riggs
Date:
On Wed, Mar 23, 2011 at 6:16 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> On 2011-03-22 21:43, Robert Haas wrote:
>>
>> I took a crack at implementing the first approach described above,
>> which seems to be by far the simplest idea we've come up with to date.
>>  Patch attached.  It doesn't seem to be that complicated, which could
>> mean either that it's not that complicated or that I'm missing
>> something.  Feel free to point and snicker in the latter case.
>
> Looks simple, but there is now benefit...

Your tests and discussion remind me that I haven't yet seen any tests
that show that index-only scans would be useful for performance.

Everyone just seems to be assuming that they make a huge difference,
and that the difference is practically realisable in a common
workload.

Perhaps that's already been done and I just didn't notice?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: crash-safe visibility map, take four

From
高增琦
Date:
Hi,

Should we do full-page write for visibilitymap all the time?
Now, when clear visiblitymap, there is no full-page write for vm
since we don't save buffer info in insert/update/delete's log.

The full-page write is used to protect pages from disk failure. Without it,
1) set vm: the vm bits that should be set to 1 may still be 0
2) clear vm: the vm bits that should be set to 0 may still be 1
Are these true? Or the page is totally unpredictable?

Another question:
To address the problem in
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
, should we just clear the vm before the log of insert/update/delete?
This may reduce the performance, is there another solution?

Thanks.

--
GaoZengqi
pgf00a@gmail.com
zengqigao@gmail.com


On Fri, Mar 25, 2011 at 6:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, Mar 23, 2011 at 6:16 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> On 2011-03-22 21:43, Robert Haas wrote:
>>
>> I took a crack at implementing the first approach described above,
>> which seems to be by far the simplest idea we've come up with to date.
>>  Patch attached.  It doesn't seem to be that complicated, which could
>> mean either that it's not that complicated or that I'm missing
>> something.  Feel free to point and snicker in the latter case.
>
> Looks simple, but there is now benefit...

Your tests and discussion remind me that I haven't yet seen any tests
that show that index-only scans would be useful for performance.

Everyone just seems to be assuming that they make a huge difference,
and that the difference is practically realisable in a common
workload.

Perhaps that's already been done and I just didn't notice?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: crash-safe visibility map, take four

From
Heikki Linnakangas
Date:
On 30.03.2011 06:24, 高增琦 wrote:
> Should we do full-page write for visibilitymap all the time?
> Now, when clear visiblitymap, there is no full-page write for vm
> since we don't save buffer info in insert/update/delete's log.
>
> The full-page write is used to protect pages from disk failure. Without it,
> 1) set vm: the vm bits that should be set to 1 may still be 0
> 2) clear vm: the vm bits that should be set to 0 may still be 1
> Are these true? Or the page is totally unpredictable?

Not quite. The WAL replay will set or clear vm bits, regardless of full 
page writes. Full page writes protect from torn pages, ie. the problem 
where some operations on a page have made it to disk while others have 
not. That's not a problem for VM pages, as each bit on the page can be 
set or cleared individually. But for something like a heap page where 
you have an offset in the beginning of the page that points to the tuple 
elsewhere on the page, you have to ensure that they stay in sync, even 
if you don't otherwise care if the update makes it to disk or not.

> Another question:
> To address the problem in
> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
> , should we just clear the vm before the log of insert/update/delete?
> This may reduce the performance, is there another solution?

Yeah, that's a straightforward way to fix it. I don't think the 
performance hit will be too bad. But we need to be careful not to hold 
locks while doing I/O, which might require some rearrangement of the 
code. We might want to do a similar dance that we do in vacuum, and call 
visibilitymap_pin first, then lock and update the heap page, and then 
set the VM bit while holding the lock on the heap page.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: crash-safe visibility map, take four

From
Robert Haas
Date:
On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Yeah, that's a straightforward way to fix it. I don't think the performance
> hit will be too bad. But we need to be careful not to hold locks while doing
> I/O, which might require some rearrangement of the code. We might want to do
> a similar dance that we do in vacuum, and call visibilitymap_pin first, then
> lock and update the heap page, and then set the VM bit while holding the
> lock on the heap page.

Looking at heap_insert(), for example, we get the target buffer by
calling RelationGetBufferForTuple(), which kicks out an already
x-locked buffer.  Until we have an x-lock on the buffer, we don't
actually know that there's enough free space for the new tuple.  But
if we release the x-lock to pin the visibility map page, then, one,
we're adding an additional lock acquisition and release cycle, and,
two, by the time we reacquire the x-lock the amount of free space may
no longer be sufficient.  Maybe we could check PD_ALL_VISIBLE before
taking the buffer lock - if it appears to be set, then we pin the
visibility map page before taking the buffer lock.  Otherwise, we take
the buffer lock at once.  Either way, once we have the lock, we
recheck the bit.  Only if it's set and we haven't got a pin do we need
to do the drop-lock-pin-reacquire-lock dance.  Is that at all
sensible?

I also wonder if we should be thinking about holding on to the
visibility map pin longer, rather than potentially reacquiring it for
every tuple inserted.  For bulk loading it doesn't matter; we'll
usually be extending the relation anyway, so the PD_ALL_VISIBLE bits
won't be set and we needn't ever hit the map.  But for a table where
there's a large amount of internal free space, it might matter.  Then
again maybe we don't clear all-visible bits often enough for it to be
an issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: crash-safe visibility map, take four

From
高增琦
Date:
On Wed, Mar 30, 2011 at 8:52 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 30.03.2011 06:24, 高增琦 wrote:
Should we do full-page write for visibilitymap all the time?
Now, when clear visiblitymap, there is no full-page write for vm
since we don't save buffer info in insert/update/delete's log.

The full-page write is used to protect pages from disk failure. Without it,
1) set vm: the vm bits that should be set to 1 may still be 0
2) clear vm: the vm bits that should be set to 0 may still be 1
Are these true? Or the page is totally unpredictable?

Not quite. The WAL replay will set or clear vm bits, regardless of full page writes. Full page writes protect from torn pages, ie. the problem where some operations on a page have made it to disk while others have not. That's not a problem for VM pages, as each bit on the page can be set or cleared individually. But for something like a heap page where you have an offset in the beginning of the page that points to the tuple elsewhere on the page, you have to ensure that they stay in sync, even if you don't otherwise care if the update makes it to disk or not.


Consider a example:
1. delete on two pages, emits two log (1, page1, vm_clear_1), (2, page2, vm_clear_2)
2. "vm_clear_1" and "vm_clear_2" on same vm page
3. checkpoint, and vm page get torned, vm_clear_2 was lost
4. delete another page, emits one log (3, page1, vm_clear_3), vm_clear_3 still on that vm page
5. power down
6. startup, redo will replay all change after checkpoint, but vm_clear_2 will never be cleared
Am I right?

 
Another question:
To address the problem in
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
, should we just clear the vm before the log of insert/update/delete?
This may reduce the performance, is there another solution?

Yeah, that's a straightforward way to fix it. I don't think the performance hit will be too bad. But we need to be careful not to hold locks while doing I/O, which might require some rearrangement of the code. We might want to do a similar dance that we do in vacuum, and call visibilitymap_pin first, then lock and update the heap page, and then set the VM bit while holding the lock on the heap page.


Do you mean we should lock the heap page first, then get the blocknumber, then release heap page,
then pin the vm's page, then lock both heap page and vm page?
As Robert Haas said, when lock the heap page again, may there isnot enough free space on it.
Is there a way just stop the checkpoint for a while?

Thanks.
GaoZengqi

Re: crash-safe visibility map, take four

From
高增琦
Date:
<div class="gmail_quote">2011/3/30 Robert Haas <span dir="ltr"><<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>></span><br/><blockquote class="gmail_quote"
style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div id=":19i">Maybe we could check
PD_ALL_VISIBLEbefore<br /> taking the buffer lock - if it appears to be set, then we pin the<br /> visibility map page
beforetaking the buffer lock.  Otherwise, we take<br /> the buffer lock at once.  Either way, once we have the lock,
we<br/> recheck the bit.  Only if it's set and we haven't got a pin do we need<br /> to do the
drop-lock-pin-reacquire-lockdance.  Is that at all<br /> sensible?</div></blockquote></div><br />But only lock can make
surethe page has enough free space.<br />If we try the drop-lock-...-lock dance, we may fall into a dead loop.<br /><br
/>--<br />GaoZengqi<br /><a href="mailto:pgf00a@gmail.com">pgf00a@gmail.com</a><br /><a
href="mailto:zengqigao@gmail.com">zengqigao@gmail.com</a><br/> 

Re: crash-safe visibility map, take four

From
Heikki Linnakangas
Date:
On 31.03.2011 11:33, 高增琦 wrote:
> Consider a example:
> 1. delete on two pages, emits two log (1, page1, vm_clear_1), (2, page2,
> vm_clear_2)
> 2. "vm_clear_1" and "vm_clear_2" on same vm page
> 3. checkpoint, and vm page get torned, vm_clear_2 was lost
> 4. delete another page, emits one log (3, page1, vm_clear_3), vm_clear_3
> still on that vm page
> 5. power down
> 6. startup, redo will replay all change after checkpoint, but vm_clear_2
> will never be cleared
> Am I right?

No. A page can only be torn at a hard crash, ie. at step 5. A checkpoint 
flushes all changes to disk, once the checkpoint finishes all the 
changes before it are safe on disk.

If you crashed between step 2 and 3, the VM page might be torn so that 
only one of the vm_clears has made it to disk but the other has not. But 
the WAL records for both are on disk anyway, so that will be corrected 
at replay.

>>   Another question:
>>> To address the problem in
>>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
>>> , should we just clear the vm before the log of insert/update/delete?
>>> This may reduce the performance, is there another solution?
>>>
>>
>> Yeah, that's a straightforward way to fix it. I don't think the performance
>> hit will be too bad. But we need to be careful not to hold locks while doing
>> I/O, which might require some rearrangement of the code. We might want to do
>> a similar dance that we do in vacuum, and call visibilitymap_pin first, then
>> lock and update the heap page, and then set the VM bit while holding the
>> lock on the heap page.
>>
> Do you mean we should lock the heap page first, then get the blocknumber,
> then release heap page,
> then pin the vm's page, then lock both heap page and vm page?
> As Robert Haas said, when lock the heap page again, may there isnot enough
> free space on it.

I think the sequence would have to be:

1. Pin the heap page.
2. Check if the all-visible flag is set on the heap page (without lock). 
If it is, pin the vm page
3. Lock heap page, check that it has enough free space
4. Check again if the all-visible flag is set. If it is but we didn't 
pin the vm page yet, release lock and loop back to step 2
5. Update heap page
6. Update vm page

> Is there a way just stop the checkpoint for a while?

Not at the moment. It wouldn't be hard to add, though. I was about to 
add a mechnism for that last autumn to fix a similar issue with b-tree 
parent pointer updates 
(http://archives.postgresql.org/message-id/4CCFEE61.2090702@enterprisedb.com), 
but in the end it was solved differently.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com