Thread: FSM, now without WAL-logging
Attached is a revamped version of the FSM rewrite. WAL-logging is now gone. Any inconsistencies between levels of the FSM is fixed during vacuum, and by searchers when they run into a dead end because of a discrepancy. Corruption within FSM pages is likewise fixed by vacuum and searchers. The FSM in a warm standby gets updated by replay of heap CLEAN WAL records. That means that the FSM will reflect the situation after the last vacuum, which is better than what we have now, but not quite up-to-date. I'm worried that this might not be enough, because on a large table, a lot of pages could've been filled since last vacuum, and the first guy who tries to insert to the table will have to grind through all those pages, finding that they're all full now. It would be simple to update the FSM at every heap insert and update record, but that then might be an unacceptable amount of overhead at recovery. Also, the index FSM is not yet updated during recovery. The FSM is now extended lazily, so there's no explicit FreeSpaceMapExtendRel function anymore. To avoid calling smgrnblocks all the time, I added a field to RelationData to cache the size of the FSM fork. The fsm_search_avail() function now emulates the old FSM behavior more closely. It should now always return the next page to the right of the next-pointer (wrapping around if necessary). I believe I've addressed all the other Tom's comments on the code as well. Also, the "next-pointer" is now reset in vacuum, to encourage the use of low-numbered pages, per Bruce's comment. There's one known bug left. If we crash after truncating a relation, and the truncation of the FSM doesn't hit the disk but the truncation of the main fork does, we can end up with an FSM that shows that there's free space on pages that don't exist anymore. The straightforward way to fix that is to put back the WAL-logging of FSM truncations, but given that I just ripped off all the rest of the WAL-logging, does anyone have other ideas? TODO: - Performance testing, again. - Add test case to regression suite that exercises index FSM - Fix the crash+truncate bug - Update index FSM during recovery - Update FSM more frequetly during recovery - Documentation -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On Mon, 2008-09-22 at 20:43 +0300, Heikki Linnakangas wrote: > Attached is a revamped version of the FSM rewrite. WAL-logging is now > gone. Any inconsistencies between levels of the FSM is fixed during > vacuum, and by searchers when they run into a dead end because of a > discrepancy. Corruption within FSM pages is likewise fixed by vacuum and > searchers. > > The FSM in a warm standby gets updated by replay of heap CLEAN WAL > records. That means that the FSM will reflect the situation after the > last vacuum, which is better than what we have now, but not quite > up-to-date. I'm worried that this might not be enough... I hadn't realised you would remove it completely. Did you identify WAL as the bottleneck? Is there some mid-way point between every time and almost never (VACUUM!)? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Mon, 2008-09-22 at 20:43 +0300, Heikki Linnakangas wrote: > >> Attached is a revamped version of the FSM rewrite. WAL-logging is now >> gone. Any inconsistencies between levels of the FSM is fixed during >> vacuum, and by searchers when they run into a dead end because of a >> discrepancy. Corruption within FSM pages is likewise fixed by vacuum and >> searchers. >> >> The FSM in a warm standby gets updated by replay of heap CLEAN WAL >> records. That means that the FSM will reflect the situation after the >> last vacuum, which is better than what we have now, but not quite >> up-to-date. I'm worried that this might not be enough... > > I hadn't realised you would remove it completely. Did you identify WAL > as the bottleneck? No. But it seemed like a sensible thing to do. > Is there some mid-way point between every time and almost never > (VACUUM!)? That's possible. However, if it's not fully WAL-logged, it needs to be self-correcting, and probably needs to be periodically vacuumed as well. So you'll get the code complexity of both approaches. I don't think VACUUM in particular needs to be separately WAL-logged, because we can easily piggy-back on the HEAP_CLEAN records. The question is whether we should do the same for every heap insert and update record, or is that too much overhead? It looks like truncations do need to be WAL-logged, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas napsal(a): > It would be > simple to update the FSM at every heap insert and update record, but > that then might be an unacceptable amount of overhead at recovery. Also, > the index FSM is not yet updated during recovery. I expect locking problems specially on small tables where FSM has only one level instead slower recovery. Maybe background writer could update FSM info, but it breaks modularity and It could bring problems during checkpoints. <snip> > > There's one known bug left. If we crash after truncating a relation, and > the truncation of the FSM doesn't hit the disk but the truncation of the > main fork does, we can end up with an FSM that shows that there's free > space on pages that don't exist anymore. The straightforward way to fix > that is to put back the WAL-logging of FSM truncations, but given that I > just ripped off all the rest of the WAL-logging, does anyone have other > ideas? What's about truncate FSM during WAL replay of main fork truncation record? Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Zdenek Kotala wrote: > Heikki Linnakangas napsal(a): >> There's one known bug left. If we crash after truncating a relation, and >> the truncation of the FSM doesn't hit the disk but the truncation of the >> main fork does, we can end up with an FSM that shows that there's free >> space on pages that don't exist anymore. The straightforward way to fix >> that is to put back the WAL-logging of FSM truncations, but given that I >> just ripped off all the rest of the WAL-logging, does anyone have other >> ideas? > > What's about truncate FSM during WAL replay of main fork truncation record? That would work, except that there's no WAL record for truncating the main fork. I considered adding that, WAL-logging the truncation of the main fork, in vacuum.c. But it would have to be done for all indexams that use the FSM as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Zdenek Kotala wrote: > Heikki Linnakangas napsal(a): > >> It would be >> simple to update the FSM at every heap insert and update record, but >> that then might be an unacceptable amount of overhead at recovery. Also, >> the index FSM is not yet updated during recovery. > > I expect locking problems specially on small tables where FSM has only one level > instead slower recovery. Maybe background writer could update FSM info, but it > breaks modularity and It could bring problems during checkpoints. Recovery is single-threaded (for the lack of a better word) anyway, so there can't be other backends competing for the locks. Background writer is one option. One option is to update the FSM if there's less than X% of free space on the page left after the insert/update. That would be similar to the rule we use during normal operation, which is to update the FSM whenever the target page fills up and we have to get a new page from the FSM. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Zdenek Kotala wrote: >> What's about truncate FSM during WAL replay of main fork truncation record? > That would work, except that there's no WAL record for truncating the > main fork. Whaddya mean there's no WAL record for that? smgrtruncate produces one. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Zdenek Kotala wrote: >>> What's about truncate FSM during WAL replay of main fork truncation record? > >> That would work, except that there's no WAL record for truncating the >> main fork. > > Whaddya mean there's no WAL record for that? smgrtruncate produces one. Ooh, I didn't realize that! Hmm. The smgrtuncate WAL record is generated after the file is truncated, so there's still a small window there, where we can be left with a truncated main fork, but no smgrtruncate record for it, and thus the page of the FSM representing the truncated blocks doesn't get zeroed at replay. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Hmm. The smgrtuncate WAL record is generated after the file is > truncated, so there's still a small window there, where we can be left > with a truncated main fork, but no smgrtruncate record for it, and thus > the page of the FSM representing the truncated blocks doesn't get zeroed > at replay. Hmm. I seem to recall that that ordering was intentional, but I don't recall why just now. The code doesn't say but maybe there's something in the archives. It does seem a little odd since it's an apparent violation of the wal-before-data rule. If you wanted to be certain that the WAL record existed you'd have to not only generate it first but fsync it, which would be a bit of a performance hit. OTOH I'm not sure we care about smgrtruncate being really quick... regards, tom lane
Heikki Linnakangas napsal(a): > Attached is a revamped version of the FSM rewrite. WAL-logging is now > gone. Any inconsistencies between levels of the FSM is fixed during > vacuum, and by searchers when they run into a dead end because of a > discrepancy. Corruption within FSM pages is likewise fixed by vacuum and > searchers. Hi Heikki, I'm testing this version now and I got core dump in initdb phase: 08047558 pgstat_count_heap_insert+0x20(840b120, 842d738, 80, 8047580) 08047608 heap_insert+0x124(840b120, 842d738, 0, 1, 1, 842d710) 08047638 simple_heap_insert+0x23(840b120, 842d738, 8367f60, 80cc25a, b, 8302cac) 08047668 InsertOneTuple+0x93(4da, 80479ec, 8047b88, 80cb0b3, 0, 0) 08047b88 boot_yyparse+0x7f1(0, feff0e68, 0, 1, 83ab990, fefcba94) 08047bb8 AuxiliaryProcessMain+0x41d(3, 83aad04, fee88540, 816b63b, fef00e40, 8047bf4) 08047bfc main+0x3a9(4, 83aad00, 8047c30) 08047c10 _start+0x80(4, 8047d04, 8047d2c, 8047d33, 8047d37, 0) Let me know if you need more info. (I'm not using fresh CVS snapshot but two weeks old) Zdenek
Zdenek Kotala wrote: > I'm testing this version now and I got core dump in initdb phase: > > 08047558 pgstat_count_heap_insert+0x20(840b120, 842d738, 80, 8047580) > 08047608 heap_insert+0x124(840b120, 842d738, 0, 1, 1, 842d710) > 08047638 simple_heap_insert+0x23(840b120, 842d738, 8367f60, 80cc25a, b, > 8302cac) > 08047668 InsertOneTuple+0x93(4da, 80479ec, 8047b88, 80cb0b3, 0, 0) > 08047b88 boot_yyparse+0x7f1(0, feff0e68, 0, 1, 83ab990, fefcba94) > 08047bb8 AuxiliaryProcessMain+0x41d(3, 83aad04, fee88540, 816b63b, > fef00e40, 8047bf4) > 08047bfc main+0x3a9(4, 83aad00, 8047c30) > 08047c10 _start+0x80(4, 8047d04, 8047d2c, 8047d33, 8047d37, 0) > > Let me know if you need more info. (I'm not using fresh CVS snapshot but > two weeks old) pgstat_count_heap_insert is an odd place to crash. I'd suggest updating your CVS snapshot, and "make clean". Attached is a new version, now with WAL-logging of the FSM truncation. I decided to go with the separate WAL record for that, rather than piggybacking on the smgrtruncate's WAL record. It seems much better from a modularity point of view this way. I've also worked on the comments, renamed many of the internal functions, in a more coherent scheme, and I also started using the "struct FSMAddress" you suggested a while ago. But I don't think I've changed anything that could explain that crash. Let me know if it still doesn't work. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Heikki Linnakangas napsal(a): > Zdenek Kotala wrote: >> I'm testing this version now and I got core dump in initdb phase: >> >> 08047558 pgstat_count_heap_insert+0x20(840b120, 842d738, 80, 8047580) <snip> >> >> Let me know if you need more info. (I'm not using fresh CVS snapshot >> but two weeks old) > > pgstat_count_heap_insert is an odd place to crash. I'd suggest updating > your CVS snapshot, and "make clean". Updating is not good idea :( - not for performance test, because it will require retest original version and so on ... > > Attached is a new version, now with WAL-logging of the FSM truncation. I > decided to go with the separate WAL record for that, rather than > piggybacking on the smgrtruncate's WAL record. It seems much better from > a modularity point of view this way. I've also worked on the comments, > renamed many of the internal functions, in a more coherent scheme, and I > also started using the "struct FSMAddress" you suggested a while ago. > > But I don't think I've changed anything that could explain that crash. > Let me know if it still doesn't work. OK. I'll look on it. Zdenek
Heikki Linnakangas napsal(a): > Zdenek Kotala wrote: > > Attached is a new version, now with WAL-logging of the FSM truncation. I > decided to go with the separate WAL record for that, rather than > piggybacking on the smgrtruncate's WAL record. It seems much better from > a modularity point of view this way. I've also worked on the comments, > renamed many of the internal functions, in a more coherent scheme, and I > also started using the "struct FSMAddress" you suggested a while ago. > > But I don't think I've changed anything that could explain that crash. > Let me know if it still doesn't work. This version works on my "old" repo. I performed performance test (iGEN) on SUN x4600 with 60 concurrent users and see result: Original: --------- MQThL (Maximum Qualified Throughput LIGHT): 1209.60 tpm MQThM (Maximum Qualified Throughput MEDIUM): 2576.72 tpm MQThH (Maximum Qualified Throughput HEAVY): 2191.20 tpm TRANSACTION MIX Total number of transactions = 181232 TYPE TX. COUNT MIX ---- --------- --- Light: 30240 16.69% Medium: 64418 35.54% DSS: 19865 10.96% Heavy: 54780 30.23% Connection: 11929 6.58% RESPONSE TIMES AVG. MAX. 90TH Light 0.304 6.405 0.400 Medium 0.317 6.533 0.400 DSS 0.266 6.343 0.020 Heavy 0.361 6.737 3.000 Connections 0.264 5.983 0.400 Number of users = 60 Sum of Avg. RT * TPS for all Tx. Types = 32.770142 FSM with WAL ------------ MQThL (Maximum Qualified Throughput LIGHT): 1199.36 tpm MQThM (Maximum Qualified Throughput MEDIUM): 2569.12 tpm MQThH (Maximum Qualified Throughput HEAVY): 2171.64 tpm TRANSACTION MIX Total number of transactions = 180625 TYPE TX. COUNT MIX ---- --------- --- Light: 29984 16.60% Medium: 64228 35.56% DSS: 20181 11.17% Heavy: 54291 30.06% Connection: 11941 6.61% RESPONSE TIMES AVG. MAX. 90TH Light 0.309 6.560 0.400 Medium 0.323 6.529 0.400 DSS 0.268 6.327 0.020 Heavy 0.360 6.675 3.000 Connections 0.274 6.359 0.400 Number of users = 60 Sum of Avg. RT * TPS for all Tx. Types = 32.845712 FSM no WAL last version ----------------------- MQThL (Maximum Qualified Throughput LIGHT): 1207.92 tpm MQThM (Maximum Qualified Throughput MEDIUM): 2611.84 tpm MQThH (Maximum Qualified Throughput HEAVY): 2177.68 tpm TRANSACTION MIX Total number of transactions = 182222 TYPE TX. COUNT MIX ---- --------- --- Light: 30198 16.57% Medium: 65296 35.83% DSS: 20118 11.04% Heavy: 54442 29.88% Connection: 12168 6.68% RESPONSE TIMES AVG. MAX. 90TH Light 0.301 6.106 0.400 Medium 0.315 6.130 0.400 DSS 0.261 5.977 0.020 Heavy 0.361 6.220 3.000 Connections 0.260 6.044 0.400 Number of users = 60 Sum of Avg. RT * TPS for all Tx. Types = 32.696832 ----------------------------------------- I don't see any big difference. Throughput is similar. Only response time seems to be better with your last FSM version. I personally happy with performance. Zdenek
Zdenek Kotala wrote: > I performed performance test (iGEN) on SUN x4600 with 60 concurrent > users and see result:>> ... > > I don't see any big difference. Throughput is similar. Only response > time seems to be better with your last FSM version. > > I personally happy with performance. Thanks! I've been running DBT-2 tests myself, and I'm not seeing any difference there either: testno TPM NO90 time comment 36 1405 1.724 6h fsm-nowal 35 1421 0.761 6h CVS HEAD 34 1439 1.066 2h fsm-nowal 33 1442 0.868 2h CVS HEAD The NO90 is the 90% percentile of New Order transaction response time. there's big variation there in the 6h tests, because of a huge dip in performance near the end of the test, both with and without the patch. I don't have an explanation for the dips, but itthrows off the response times for the whole tests. Given that in the 2h tests, which is exactly the same as the 6h test, just shorter, there's no degradation in the response times, I'm not worried about that. I don't have access to the site where I used to publish the test results earlier, but let me know if you want to see the full test results, and I'll try to zip them up and FTP somewhere (~500 MB uncompressed). I've also tried various pgbench tests, on a RAM disk and otherwise, as well as the "table population" test I ran earlier, and don't see any difference in performance. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas napsal(a): > Zdenek Kotala wrote: <snip> > I've also tried various pgbench tests, on a RAM disk and otherwise, as > well as the "table population" test I ran earlier, and don't see any > difference in performance. I think performance is OK Zdenek