Thread: FSM patch - performance test
Hi Heikki, I finally performed iGen test. I used two v490 servers with 4 dual core SPARC CPUs and 32GB RAM. I have only one disk and I did not performed any disk I/O optimization. I tested 105 parallel connection and think time was 200ms. See the result: Original: --------- Actual run/snap-shot time: 3004 sec MQThL (Maximum Qualified Throughput LIGHT): 1458.76 tpm MQThM (Maximum Qualified Throughput MEDIUM): 3122.44 tpm MQThH (Maximum Qualified Throughput HEAVY): 2626.70 tpm TRANSACTION MIX Total number of transactions = 438133 TYPE TX. COUNT MIX ---- --------- --- Light: 72938 16.65% Medium: 156122 35.63% DSS: 48516 11.07% Heavy: 131335 29.98% Connection: 29222 6.67% RESPONSE TIMES AVG. MAX. 90TH Light 0.541 3.692 0.800 Medium 0.542 3.702 0.800 DSS 0.539 3.510 0.040 Heavy 0.539 3.742 4.000 Connections 0.545 3.663 0.800 Number of users = 105 Sum of Avg. RT * TPS for all Tx. Types = 64.851454 New FSM implementation: ----------------------- Actual run/snap-shot time: 3004 sec MQThL (Maximum Qualified Throughput LIGHT): 1351.20 tpm MQThM (Maximum Qualified Throughput MEDIUM): 2888.74 tpm MQThH (Maximum Qualified Throughput HEAVY): 2428.90 tpm TRANSACTION MIX Total number of transactions = 405502 TYPE TX. COUNT MIX ---- --------- --- Light: 67560 16.66% Medium: 144437 35.62% DSS: 45028 11.10% Heavy: 121445 29.95% Connection: 27032 6.67% RESPONSE TIMES AVG. MAX. 90TH Light 0.596 3.735 0.800 Medium 0.601 3.748 0.800 DSS 0.601 3.695 0.040 Heavy 0.597 3.725 4.000 Connections 0.599 3.445 0.800 Number of users = 105 Sum of Avg. RT * TPS for all Tx. Types = 66.419466 ---------------------------- My conclusion is that new implementation is about 8% slower in OLTP workload. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Zdenek Kotala wrote: > My conclusion is that new implementation is about 8% slower in OLTP > workload. Thanks. That's very disappointing :-( -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Zdenek Kotala wrote: > My conclusion is that new implementation is about 8% slower in OLTP > workload. Can you do some analysis of why that is? Looks like I need to blow the dust off my DBT-2 test rig and try to reproduce that as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas napsal(a): > Zdenek Kotala wrote: >> My conclusion is that new implementation is about 8% slower in OLTP >> workload. > > Can you do some analysis of why that is? I'll try something but I do not guarantee result. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Zdenek Kotala wrote: >> My conclusion is that new implementation is about 8% slower in OLTP >> workload. > Thanks. That's very disappointing :-( One thing that jumped out at me is that you call FreeSpaceMapExtendRel every time a rel is extended by even one block. I admit I've not studied the data structure in any detail yet, but surely most such calls end up being a no-op? Seems like some attention to making a fast path for that case would be helpful. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Zdenek Kotala wrote: >>> My conclusion is that new implementation is about 8% slower in OLTP >>> workload. > >> Thanks. That's very disappointing :-( > > One thing that jumped out at me is that you call FreeSpaceMapExtendRel > every time a rel is extended by even one block. I admit I've not > studied the data structure in any detail yet, but surely most such calls > end up being a no-op? Seems like some attention to making a fast path > for that case would be helpful. Yes, most of those calls end up being no-op. Which is exactly why I would be surprised if those made any difference. It does call smgrnblocks(), though, which isn't completely free... Zdenek, can you say off the top of your head whether the test was I/O bound or CPU bound? What was the CPU utilization % during the test? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> One thing that jumped out at me is that you call FreeSpaceMapExtendRel >> every time a rel is extended by even one block. > Yes, most of those calls end up being no-op. Which is exactly why I > would be surprised if those made any difference. It does call > smgrnblocks(), though, which isn't completely free... No, it's a kernel call (at least one) which makes it pretty expensive. I wonder whether it's necessary to do FreeSpaceMapExtendRel at this point at all? Why not lazily extend the map when you are told to store a nonzero space category for a page that's off the end of the map? Whether or not this saved many cycles overall, it'd push most of the map extension work to VACUUM instead of having it happen in foreground. A further refinement would be to extend the map only for a space category "significantly" greater than zero --- maybe a quarter page or so. For an insert-only table that would probably result in the map never growing at all, which might be nice. However it would go back to the concept of FSM being lossy; I forget whether you were hoping to get away from that. regards, tom lane
Heikki Linnakangas napsal(a): > Tom Lane wrote: >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >>> Zdenek Kotala wrote: >>>> My conclusion is that new implementation is about 8% slower in OLTP >>>> workload. >> >>> Thanks. That's very disappointing :-( >> >> One thing that jumped out at me is that you call FreeSpaceMapExtendRel >> every time a rel is extended by even one block. I admit I've not >> studied the data structure in any detail yet, but surely most such calls >> end up being a no-op? Seems like some attention to making a fast path >> for that case would be helpful. > > Yes, most of those calls end up being no-op. Which is exactly why I > would be surprised if those made any difference. It does call > smgrnblocks(), though, which isn't completely free... It is not a problem. It is really strange. I'm using DTrace to count number of calls and number of calls is really small (I monitor only one backend). I have removed WAL logging and it does not help too. > Zdenek, can you say off the top of your head whether the test was I/O > bound or CPU bound? What was the CPU utilization % during the test? CPU is not problem it is mostly in idle time. -bash-3.00# iostat 5 tty sd1 ssd0 ssd1 nfs1 cpu tin tout kps tps serv kpstps serv kps tps serv kps tps serv us sy wt id 0 1 0 0 1 9 1 92 0 0 0 0 0 0 0 0 0 100 0 47 0 0 0 894 111 7 0 0 0 0 0 0 2 1 0 97 0 16 0 0 0 949 118 6 0 0 0 0 0 0 2 2 0 97 0 16 0 0 0 965 120 6 0 0 0 0 0 0 2 1 0 97 0 16 0 0 0 981 122 7 0 0 0 0 0 0 2 2 0 96 0 16 0 0 0 944 118 6 0 0 0 0 0 0 2 1 0 97 0 16 0 0 0 1202 149 7 0 0 0 0 0 0 3 2 0 95 0 16 0 0 0 1261 157 9 0 0 0 0 0 0 3 2 0 95 0 16 0 0 0 1357 168 14 0 0 0 0 0 0 3 2 0 95 0 16 0 0 0 1631 201 33 0 0 0 0 0 0 2 2 0 96 0 16 0 0 0 1973 246 48 0 0 0 0 0 0 2 2 0 96 0 16 0 0 0 2008 251 50 0 0 0 0 0 0 2 2 0 97 0 16 0 0 0 1956 241 45 0 0 0 0 0 0 2 2 0 97 0 16 0 0 0 2003 250 49 0 0 0 0 0 0 2 2 0 97 -bash-3.00# vmstat 1 kthr memory page disk faults cpu r b w swap free re mfpi po fr de sr s1 sd sd -- in sy cs us sy id 0 0 0 28091000 31640552 3 4 0 0 0 0 0 0 1 0 0 359 72 206 0 0 100 0 0 0 27363144 27614576 3 28 0 16 16 0 0 0 60 0 0 1216 1134 1072 1 1 99 0 0 0 27363144 27614568 8 0 016 16 0 0 0 52 0 0 1099 1029 964 0 1 98 0 0 0 27363144 27614560 9 0 0 8 8 0 0 0 53 0 0 1143 896 1009 1 1 98 0 0 0 27363144 27614544 1 241 0 16 16 0 0 0 46 0 0 1042 1105 895 0 1 98 0 0 0 27363144 27614544 0 0 0 16 16 0 0 0 50 0 0 1078 860 924 0 0 99 0 0 0 27363144 27614552 10 0 0 16 16 0 0 0 56 0 0 1177 914 1033 1 1 980 0 0 27363144 27614536 0 0 0 8 8 0 0 0 25 0 0 726 554 603 0 0 99 0 0 0 27363144 27614528 1 0 0 16 16 0 0 0 65 0 0 1206 1159 1081 1 1 98 0 0 0 27363144 27614512 13 0 0 16 16 0 0 0 63 0 0 1256 1088 1094 1 1 99 0 0 027363144 27614512 0 0 0 8 8 0 0 0 37 0 0 920 797 779 0 1 99 0 0 0 27363144 27614504 6 0 0 16 16 0 0 0 58 0 0 1218 1074 1078 1 0 99 0 0 0 27363144 27614488 85 91 0 16 16 0 0 0 45 0 0 973 1344 833 1 1 99 0 0 0 2736314427614488 2 0 0 16 16 0 0 0 57 0 0 1164 1023 1036 1 1 99 0 0 0 27363144 27614472 4 0 0 8 8 0 0 0 47 0 0 1133 937 957 0 1 99 -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Zdenek Kotala napsal(a): > Heikki Linnakangas napsal(a): >> Zdenek Kotala wrote: >>> My conclusion is that new implementation is about 8% slower in OLTP >>> workload. >> >> Can you do some analysis of why that is? I tested it several times and last test was surprise for me. I run original server (with old FSM) on the database which has been created by new server (with new FSM) and performance is similar (maybe new implementation is little bit better): MQThL (Maximum Qualified Throughput LIGHT): 1348.90 tpm MQThM (Maximum Qualified Throughput MEDIUM): 2874.76 tpm MQThH (Maximum Qualified Throughput HEAVY): 2422.20 tpm The question is why? There could be two reasons for that. One is realated to OS/FS or HW. Filesystem could be defragmented or HDD is slower in some part... Second idea is that new FSM creates heavy defragmented data and index scan needs to jump from one page to another too often. Thoughts? Zdenek PS: I'm leaving now and I will be online on Monday. -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Zdenek Kotala wrote: > Zdenek Kotala napsal(a): >> Heikki Linnakangas napsal(a): >>> Zdenek Kotala wrote: >>>> My conclusion is that new implementation is about 8% slower in OLTP >>>> workload. >>> >>> Can you do some analysis of why that is? > > I tested it several times and last test was surprise for me. I run > original server (with old FSM) on the database which has been created by > new server (with new FSM) and performance is similar (maybe new > implementation is little bit better): > > MQThL (Maximum Qualified Throughput LIGHT): 1348.90 tpm > MQThM (Maximum Qualified Throughput MEDIUM): 2874.76 tpm > MQThH (Maximum Qualified Throughput HEAVY): 2422.20 tpm > > The question is why? There could be two reasons for that. One is > realated to OS/FS or HW. Filesystem could be defragmented or HDD is > slower in some part... Ugh. Could it be autovacuum kicking in at different times? Do you get any other metrics than the TPM out of it. > Second idea is that new FSM creates heavy defragmented data and index > scan needs to jump from one page to another too often. Hmm. That's remotely plausible, I suppose. The old FSM only kept track of pages with more than avg. request size of free space, but the new FSM tracks even the smallest free spots. Is there tables in that workload that are inserted to, with very varying row widths? FWIW, I just got the results of my first 2h DBT-2 results, and I'm seeing no difference at all in the overall performance or behavior during the test. Autovacuum doesn't kick in in those short tests, though, so I schedule a pair of 4h tests, and might run even longer tests over the weekend. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Zdenek Kotala wrote: >> Second idea is that new FSM creates heavy defragmented data and index >> scan needs to jump from one page to another too often. > Hmm. That's remotely plausible, I suppose. The old FSM only kept track > of pages with more than avg. request size of free space, but the new FSM > tracks even the smallest free spots. Is there tables in that workload > that are inserted to, with very varying row widths? I'm not sure I buy that either. But after thinking a bit about how search_avail() works, it occurs to me that it's not doing what the old code did and that might contribute to contention. The old FSM did a cyclic search through the pages it knew about, so as long as there were plenty of pages with "enough" free space, different backends would always get pointed to different pages. But consider what the algorithm is now. (For simplicity, consider only the behavior on a leaf FSM page.) * Starting from the "next" slot, bubble up to parent nodes until finding a parent showing enough space. * Descend to the *leftmost* leaf child of that parent that has enough space. * Point "next" to the slot after that, and return that page. What this means is that if we start with "next" pointing at a page without enough space (quite likely considering that we now index all pages not only those with free space), then it is highly possible that the search will end on a page *before* where next was. The most trivial case is that we have an even-numbered page with a lot of free space and its odd-numbered successor has none --- in this case, far from spreading out the backends, all comers will be handed back that same page! (Until someone reports that it's full.) In general it seems that this behavior will tend to concentrate the returned pages in a small area rather than allowing them to range over the whole FSM page as was intended. So the bottom line is that the "next" addition doesn't actually work and needs to be rethought. It might be possible to salvage it by paying attention to "next" during the descent phase and preferentially trying to descend to the right of "next"; but I'm not quite sure how to make that work efficiently, and even less sure how to wrap around cleanly when the starting value of "next" is near the last slot on the page. regards, tom lane
Heikki Linnakangas napsal(a): > Zdenek Kotala wrote: >> Zdenek Kotala napsal(a): >>> Heikki Linnakangas napsal(a): >>>> Zdenek Kotala wrote: >>>>> My conclusion is that new implementation is about 8% slower in OLTP >>>>> workload. >>>> >>>> Can you do some analysis of why that is? >> >> I tested it several times and last test was surprise for me. I run >> original server (with old FSM) on the database which has been created >> by new server (with new FSM) and performance is similar (maybe new >> implementation is little bit better): >> >> MQThL (Maximum Qualified Throughput LIGHT): 1348.90 tpm >> MQThM (Maximum Qualified Throughput MEDIUM): 2874.76 tpm >> MQThH (Maximum Qualified Throughput HEAVY): 2422.20 tpm >> >> The question is why? There could be two reasons for that. One is >> realated to OS/FS or HW. Filesystem could be defragmented or HDD is >> slower in some part... > > Ugh. Could it be autovacuum kicking in at different times? Do you get > any other metrics than the TPM out of it. I don't think that it is autovacuum problem. I run test more times and result was same. But today I created fresh database and I got similar throughput for original and new FSM implementation. It seems to me that I hit a HW/OS singularity. I'll verify it tomorrow. I recognize only little bit slowdown during index creation, (4:11mins vs. 3:47mins), but I tested it only once. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Tom Lane wrote: > What this means is that if we start with "next" pointing at a page > without enough space (quite likely considering that we now index all > pages not only those with free space), then it is highly possible that > the search will end on a page *before* where next was. The most trivial > case is that we have an even-numbered page with a lot of free space and > its odd-numbered successor has none --- in this case, far from spreading > out the backends, all comers will be handed back that same page! (Until > someone reports that it's full.) In general it seems that this behavior > will tend to concentrate the returned pages in a small area rather than > allowing them to range over the whole FSM page as was intended. Good point. > So the bottom line is that the "next" addition doesn't actually work and > needs to be rethought. It might be possible to salvage it by paying > attention to "next" during the descent phase and preferentially trying > to descend to the right of "next"; but I'm not quite sure how to make > that work efficiently, and even less sure how to wrap around cleanly > when the starting value of "next" is near the last slot on the page. Yeah, I think it can be salvaged like that. see the patch I just posted on a separate thread. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com