Thread: FSM, now without WAL-logging

FSM, now without WAL-logging

From
Heikki Linnakangas
Date:
Attached is a revamped version of the FSM rewrite. WAL-logging is now
gone. Any inconsistencies between levels of the FSM is fixed during
vacuum, and by searchers when they run into a dead end because of a
discrepancy. Corruption within FSM pages is likewise fixed by vacuum and
searchers.

The FSM in a warm standby gets updated by replay of heap CLEAN WAL
records. That means that the FSM will reflect the situation after the
last vacuum, which is better than what we have now, but not quite
up-to-date. I'm worried that this might not be enough, because on a
large table, a lot of pages could've been filled since last vacuum, and
the first guy who tries to insert to the table will have to grind
through all those pages, finding that they're all full now. It would be
simple to update the FSM at every heap insert and update record, but
that then might be an unacceptable amount of overhead at recovery. Also,
the index FSM is not yet updated during recovery.

The FSM is now extended lazily, so there's no explicit
FreeSpaceMapExtendRel function anymore. To avoid calling smgrnblocks all
the time, I added a field to RelationData to cache the size of the FSM fork.

The fsm_search_avail() function now emulates the old FSM behavior more
closely. It should now always return the next page to the right of the
next-pointer (wrapping around if necessary).

I believe I've addressed all the other Tom's comments on the code as
well. Also, the "next-pointer" is now reset in vacuum, to encourage the
use of low-numbered pages, per Bruce's comment.

There's one known bug left. If we crash after truncating a relation, and
the truncation of the FSM doesn't hit the disk but the truncation of the
main fork does, we can end up with an FSM that shows that there's free
space on pages that don't exist anymore. The straightforward way to fix
that is to put back the WAL-logging of FSM truncations, but given that I
just ripped off all the rest of the WAL-logging, does anyone have other
ideas?

TODO:
- Performance testing, again.
- Add test case to regression suite that exercises index FSM
- Fix the crash+truncate bug
- Update index FSM during recovery
- Update FSM more frequetly during recovery
- Documentation

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: FSM, now without WAL-logging

From
Simon Riggs
Date:
On Mon, 2008-09-22 at 20:43 +0300, Heikki Linnakangas wrote:

> Attached is a revamped version of the FSM rewrite. WAL-logging is now 
> gone. Any inconsistencies between levels of the FSM is fixed during 
> vacuum, and by searchers when they run into a dead end because of a 
> discrepancy. Corruption within FSM pages is likewise fixed by vacuum and 
> searchers.
> 
> The FSM in a warm standby gets updated by replay of heap CLEAN WAL 
> records. That means that the FSM will reflect the situation after the 
> last vacuum, which is better than what we have now, but not quite 
> up-to-date. I'm worried that this might not be enough...

I hadn't realised you would remove it completely. Did you identify WAL
as the bottleneck?

Is there some mid-way point between every time and almost never
(VACUUM!)? 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: FSM, now without WAL-logging

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2008-09-22 at 20:43 +0300, Heikki Linnakangas wrote:
> 
>> Attached is a revamped version of the FSM rewrite. WAL-logging is now
>> gone. Any inconsistencies between levels of the FSM is fixed during 
>> vacuum, and by searchers when they run into a dead end because of a 
>> discrepancy. Corruption within FSM pages is likewise fixed by vacuum and 
>> searchers.
>>
>> The FSM in a warm standby gets updated by replay of heap CLEAN WAL 
>> records. That means that the FSM will reflect the situation after the 
>> last vacuum, which is better than what we have now, but not quite 
>> up-to-date. I'm worried that this might not be enough...
> 
> I hadn't realised you would remove it completely. Did you identify WAL
> as the bottleneck?

No. But it seemed like a sensible thing to do.

> Is there some mid-way point between every time and almost never
> (VACUUM!)? 

That's possible. However, if it's not fully WAL-logged, it needs to be 
self-correcting, and probably needs to be periodically vacuumed as well. 
So you'll get the code complexity of both approaches. I don't think 
VACUUM in particular needs to be separately WAL-logged, because we can 
easily piggy-back on the HEAP_CLEAN records. The question is whether we 
should do the same for every heap insert and update record, or is that 
too much overhead?

It looks like truncations do need to be WAL-logged, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: FSM, now without WAL-logging

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):

> It would be 
> simple to update the FSM at every heap insert and update record, but 
> that then might be an unacceptable amount of overhead at recovery. Also, 
> the index FSM is not yet updated during recovery.

I expect locking problems specially on small tables where FSM has only one level
instead slower recovery. Maybe background writer could update FSM info, but it
breaks modularity and It could bring problems during checkpoints.

<snip>

> 
> There's one known bug left. If we crash after truncating a relation, and 
> the truncation of the FSM doesn't hit the disk but the truncation of the 
> main fork does, we can end up with an FSM that shows that there's free 
> space on pages that don't exist anymore. The straightforward way to fix 
> that is to put back the WAL-logging of FSM truncations, but given that I 
> just ripped off all the rest of the WAL-logging, does anyone have other 
> ideas?

What's about truncate FSM during WAL replay of main fork truncation record?
Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: FSM, now without WAL-logging

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> Heikki Linnakangas napsal(a):
>> There's one known bug left. If we crash after truncating a relation, and 
>> the truncation of the FSM doesn't hit the disk but the truncation of the 
>> main fork does, we can end up with an FSM that shows that there's free 
>> space on pages that don't exist anymore. The straightforward way to fix 
>> that is to put back the WAL-logging of FSM truncations, but given that I 
>> just ripped off all the rest of the WAL-logging, does anyone have other 
>> ideas?
> 
> What's about truncate FSM during WAL replay of main fork truncation record?

That would work, except that there's no WAL record for truncating the
main fork.

I considered adding that, WAL-logging the truncation of the main fork,
in vacuum.c. But it would have to be done for all indexams that use the
FSM as well.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: FSM, now without WAL-logging

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> Heikki Linnakangas napsal(a):
> 
>> It would be 
>> simple to update the FSM at every heap insert and update record, but 
>> that then might be an unacceptable amount of overhead at recovery. Also, 
>> the index FSM is not yet updated during recovery.
> 
> I expect locking problems specially on small tables where FSM has only one level
> instead slower recovery. Maybe background writer could update FSM info, but it
> breaks modularity and It could bring problems during checkpoints.

Recovery is single-threaded (for the lack of a better word) anyway, so
there can't be other backends competing for the locks. Background writer
is one option.

One option is to update the FSM if there's less than X% of free space on
the page left after the insert/update. That would be similar to the rule
we use during normal operation, which is to update the FSM whenever the
target page fills up and we have to get a new page from the FSM.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: FSM, now without WAL-logging

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Zdenek Kotala wrote:
>> What's about truncate FSM during WAL replay of main fork truncation record?

> That would work, except that there's no WAL record for truncating the
> main fork.

Whaddya mean there's no WAL record for that?  smgrtruncate produces one.
        regards, tom lane


Re: FSM, now without WAL-logging

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Zdenek Kotala wrote:
>>> What's about truncate FSM during WAL replay of main fork truncation record?
> 
>> That would work, except that there's no WAL record for truncating the
>> main fork.
> 
> Whaddya mean there's no WAL record for that?  smgrtruncate produces one.

Ooh, I didn't realize that!

Hmm. The smgrtuncate WAL record is generated after the file is 
truncated, so there's still a small window there, where we can be left 
with a truncated main fork, but no smgrtruncate record for it, and thus 
the page of the FSM representing the truncated blocks doesn't get zeroed 
at replay.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: FSM, now without WAL-logging

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Hmm. The smgrtuncate WAL record is generated after the file is 
> truncated, so there's still a small window there, where we can be left 
> with a truncated main fork, but no smgrtruncate record for it, and thus 
> the page of the FSM representing the truncated blocks doesn't get zeroed 
> at replay.

Hmm.  I seem to recall that that ordering was intentional, but I don't
recall why just now.  The code doesn't say but maybe there's something
in the archives.  It does seem a little odd since it's an apparent
violation of the wal-before-data rule.

If you wanted to be certain that the WAL record existed you'd have to
not only generate it first but fsync it, which would be a bit of a
performance hit.  OTOH I'm not sure we care about smgrtruncate being
really quick...
        regards, tom lane


Re: FSM, now without WAL-logging

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Attached is a revamped version of the FSM rewrite. WAL-logging is now 
> gone. Any inconsistencies between levels of the FSM is fixed during 
> vacuum, and by searchers when they run into a dead end because of a 
> discrepancy. Corruption within FSM pages is likewise fixed by vacuum and 
> searchers.

Hi Heikki,

I'm testing this version now and I got core dump in initdb phase:

08047558 pgstat_count_heap_insert+0x20(840b120, 842d738, 80, 8047580)
08047608 heap_insert+0x124(840b120, 842d738, 0, 1, 1, 842d710)
08047638 simple_heap_insert+0x23(840b120, 842d738, 8367f60, 80cc25a, b, 
8302cac)
08047668 InsertOneTuple+0x93(4da, 80479ec, 8047b88, 80cb0b3, 0, 0)
08047b88 boot_yyparse+0x7f1(0, feff0e68, 0, 1, 83ab990, fefcba94)
08047bb8 AuxiliaryProcessMain+0x41d(3, 83aad04, fee88540, 816b63b, 
fef00e40, 8047bf4)
08047bfc main+0x3a9(4, 83aad00, 8047c30)
08047c10 _start+0x80(4, 8047d04, 8047d2c, 8047d33, 8047d37, 0)

Let me know if you need more info. (I'm not using fresh CVS snapshot but 
two weeks old)
    Zdenek


Re: FSM, now without WAL-logging

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> I'm testing this version now and I got core dump in initdb phase:
>
> 08047558 pgstat_count_heap_insert+0x20(840b120, 842d738, 80, 8047580)
> 08047608 heap_insert+0x124(840b120, 842d738, 0, 1, 1, 842d710)
> 08047638 simple_heap_insert+0x23(840b120, 842d738, 8367f60, 80cc25a, b,
> 8302cac)
> 08047668 InsertOneTuple+0x93(4da, 80479ec, 8047b88, 80cb0b3, 0, 0)
> 08047b88 boot_yyparse+0x7f1(0, feff0e68, 0, 1, 83ab990, fefcba94)
> 08047bb8 AuxiliaryProcessMain+0x41d(3, 83aad04, fee88540, 816b63b,
> fef00e40, 8047bf4)
> 08047bfc main+0x3a9(4, 83aad00, 8047c30)
> 08047c10 _start+0x80(4, 8047d04, 8047d2c, 8047d33, 8047d37, 0)
>
> Let me know if you need more info. (I'm not using fresh CVS snapshot but
> two weeks old)

pgstat_count_heap_insert is an odd place to crash. I'd suggest updating
your CVS snapshot, and "make clean".

Attached is a new version, now with WAL-logging of the FSM truncation. I
decided to go with the separate WAL record for that, rather than
piggybacking on the smgrtruncate's WAL record. It seems much better from
a modularity point of view this way. I've also worked on the comments,
renamed many of the internal functions, in a more coherent scheme, and I
also started using the "struct FSMAddress" you suggested a while ago.

But I don't think I've changed anything that could explain that crash.
Let me know if it still doesn't work.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: FSM, now without WAL-logging

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> I'm testing this version now and I got core dump in initdb phase:
>>
>> 08047558 pgstat_count_heap_insert+0x20(840b120, 842d738, 80, 8047580)

<snip>

>>
>> Let me know if you need more info. (I'm not using fresh CVS snapshot 
>> but two weeks old)
> 
> pgstat_count_heap_insert is an odd place to crash. I'd suggest updating 
> your CVS snapshot, and "make clean".

Updating is not good idea :( - not for performance test, because it will 
require retest original version and so on ...

> 
> Attached is a new version, now with WAL-logging of the FSM truncation. I 
> decided to go with the separate WAL record for that, rather than 
> piggybacking on the smgrtruncate's WAL record. It seems much better from 
> a modularity point of view this way. I've also worked on the comments, 
> renamed many of the internal functions, in a more coherent scheme, and I 
> also started using the "struct FSMAddress" you suggested a while ago.
> 
> But I don't think I've changed anything that could explain that crash. 
> Let me know if it still doesn't work.

OK. I'll look on it.
    Zdenek




Re: FSM, now without WAL-logging

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:

> 
> Attached is a new version, now with WAL-logging of the FSM truncation. I 
> decided to go with the separate WAL record for that, rather than 
> piggybacking on the smgrtruncate's WAL record. It seems much better from 
> a modularity point of view this way. I've also worked on the comments, 
> renamed many of the internal functions, in a more coherent scheme, and I 
> also started using the "struct FSMAddress" you suggested a while ago.
> 
> But I don't think I've changed anything that could explain that crash. 
> Let me know if it still doesn't work.

This version works on my "old" repo.

I performed performance test (iGEN) on SUN x4600 with 60 concurrent 
users and see result:

Original:
---------
MQThL (Maximum Qualified Throughput LIGHT): 1209.60 tpm
MQThM (Maximum Qualified Throughput MEDIUM): 2576.72 tpm
MQThH (Maximum Qualified Throughput HEAVY): 2191.20 tpm


TRANSACTION MIX

Total number of transactions = 181232
TYPE            TX. COUNT       MIX
----            ---------       ---
Light:          30240           16.69%
Medium:         64418           35.54%
DSS:            19865           10.96%
Heavy:          54780           30.23%
Connection:     11929           6.58%


RESPONSE TIMES          AVG.            MAX.            90TH

Light                   0.304           6.405           0.400
Medium                  0.317           6.533           0.400
DSS                     0.266           6.343           0.020
Heavy                   0.361           6.737           3.000
Connections             0.264           5.983           0.400
Number of users = 60
Sum of Avg. RT * TPS for all Tx. Types = 32.770142


FSM with WAL
------------
MQThL (Maximum Qualified Throughput LIGHT): 1199.36 tpm
MQThM (Maximum Qualified Throughput MEDIUM): 2569.12 tpm
MQThH (Maximum Qualified Throughput HEAVY): 2171.64 tpm


TRANSACTION MIX

Total number of transactions = 180625
TYPE            TX. COUNT       MIX
----            ---------       ---
Light:          29984           16.60%
Medium:         64228           35.56%
DSS:            20181           11.17%
Heavy:          54291           30.06%
Connection:     11941           6.61%


RESPONSE TIMES          AVG.            MAX.            90TH

Light                   0.309           6.560           0.400
Medium                  0.323           6.529           0.400
DSS                     0.268           6.327           0.020
Heavy                   0.360           6.675           3.000
Connections             0.274           6.359           0.400
Number of users = 60
Sum of Avg. RT * TPS for all Tx. Types = 32.845712


FSM no WAL last version
-----------------------
MQThL (Maximum Qualified Throughput LIGHT): 1207.92 tpm
MQThM (Maximum Qualified Throughput MEDIUM): 2611.84 tpm
MQThH (Maximum Qualified Throughput HEAVY): 2177.68 tpm


TRANSACTION MIX

Total number of transactions = 182222
TYPE            TX. COUNT       MIX
----            ---------       ---
Light:          30198           16.57%
Medium:         65296           35.83%
DSS:            20118           11.04%
Heavy:          54442           29.88%
Connection:     12168           6.68%


RESPONSE TIMES          AVG.            MAX.            90TH

Light                   0.301           6.106           0.400
Medium                  0.315           6.130           0.400
DSS                     0.261           5.977           0.020
Heavy                   0.361           6.220           3.000
Connections             0.260           6.044           0.400
Number of users = 60
Sum of Avg. RT * TPS for all Tx. Types = 32.696832


-----------------------------------------

I don't see any big difference. Throughput is similar. Only response 
time seems to be better with your last FSM version.

I personally happy with performance.
        Zdenek



Re: FSM, now without WAL-logging

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> I performed performance test (iGEN) on SUN x4600 with 60 concurrent 
> users and see result:>> ...
> 
> I don't see any big difference. Throughput is similar. Only response 
> time seems to be better with your last FSM version.
> 
> I personally happy with performance.

Thanks! I've been running DBT-2 tests myself, and I'm not seeing any 
difference there either:

testno    TPM    NO90    time    comment
36    1405    1.724    6h    fsm-nowal
35    1421    0.761    6h    CVS HEAD
34    1439    1.066    2h    fsm-nowal
33    1442    0.868    2h    CVS HEAD

The NO90 is the 90% percentile of New Order transaction response time. 
there's big variation there in the 6h tests, because of a huge dip in 
performance near the end of the test, both with and without the patch. I  don't have an explanation for the dips, but
itthrows off the response 
 
times for the whole tests. Given that in the 2h tests, which is exactly 
the same as the 6h test, just shorter, there's no degradation in the 
response times, I'm not worried about that.

I don't have access to the site where I used to publish the test results 
earlier, but let me know if you want to see the full test results, and 
I'll try to zip them up and FTP somewhere (~500 MB uncompressed).

I've also tried various pgbench tests, on a RAM disk and otherwise, as 
well as the "table population" test I ran earlier, and don't see any 
difference in performance.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: FSM, now without WAL-logging

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:

<snip>

> I've also tried various pgbench tests, on a RAM disk and otherwise, as 
> well as the "table population" test I ran earlier, and don't see any 
> difference in performance.

I think performance is OK
    Zdenek