Thread: Interpreting vmstat
Hello, (note best viewed in fixed-width font) I'm still trying to find where my performance bottle neck is... I have 4G ram, PG 7.3.4 shared_buffers = 75000 effective_cache_size = 75000 Run a query I've been having trouble with and watch the output of vmstat (linux): $ vmstat 1 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 148 8732 193652 2786668 0 0 0 0 292 151 0 2 98 2 0 2 148 7040 193652 2786668 0 0 0 208 459 697 45 10 45 0 0 0 148 9028 193652 2786684 0 0 16 644 318 613 25 4 71 1 0 0 148 5092 193676 2780196 0 0 12 184 441 491 37 5 58 0 1 0 148 5212 193684 2772512 0 0 112 9740 682 1063 45 12 43 1 0 0 148 5444 193684 2771584 0 0 120 4216 464 1303 44 3 52 1 0 0 148 12232 193660 2771620 0 0 244 628 340 681 43 20 38 1 0 0 148 12168 193664 2771832 0 0 196 552 332 956 42 2 56 1 0 0 148 12080 193664 2772248 0 0 272 204 371 201 40 1 59 1 1 0 148 12024 193664 2772624 0 0 368 0 259 127 42 3 55 Thats the first 10 lines or so... the query takes 60 seconds to run. I'm confused on the bo & bi parts of the io: IO bi: Blocks sent to a block device (blocks/s). bo: Blocks received from a block device (blocks/s). yet it seems to be opposite of that... bi only increases when doing a largish query, while bo also goes up, I typically see periodic bo numbers in the low 100's, which I'd guess are log writes. I would think that my entire DB should end up cached since a raw pg_dump file is about 1G in size, yet my performance doesn't indicate that that is the case... running the same query a few minutes later, I'm not seeing a significant performance improvement. Here's a sample from iostat while the query is running: $ iostat -x -d 1 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949552.96 0.00 0.00 100.00 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949662.96 0.00 0.00 100.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949642.96 0.00 0.00 100.00 sdb 0.00 428.00 0.00 116.00 0.00 4368.00 37.66 2844.40 296.55 86.21 100.00 sdb1 0.00 428.00 0.00 116.00 0.00 4368.00 37.66 6874.40 296.55 86.21 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949552.96 0.00 0.00 100.00 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949662.96 0.00 0.00 100.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949642.96 0.00 0.00 100.00 sdb 4.00 182.00 6.00 77.00 80.00 2072.00 25.93 2814.50 54.22 120.48 100.00 sdb1 4.00 182.00 6.00 77.00 80.00 2072.00 25.93 6844.50 54.22 120.48 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949552.96 0.00 0.00 100.00 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949662.96 0.00 0.00 100.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949642.96 0.00 0.00 100.00 sdb 0.00 43.00 0.00 11.00 0.00 432.00 39.27 2810.40 36.36 909.09 100.00 sdb1 0.00 43.00 0.00 11.00 0.00 432.00 39.27 6840.40 36.36 909.09 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 15.84 0.00 17.82 0.00 269.31 15.11 42524309.47 44.44 561.11 100.00 sda1 0.00 15.84 0.00 17.82 0.00 269.31 15.11 42524419.47 44.44 561.11 100.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42524398.67 0.00 0.00 100.00 sdb 0.99 222.77 0.99 114.85 15.84 2700.99 23.45 2814.16 35.90 86.32 100.00 sdb1 0.99 222.77 0.99 114.85 15.84 2700.99 23.45 6844.16 35.90 86.32 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949551.76 0.00 0.00 101.00 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949662.86 0.00 0.00 101.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42949642.66 0.00 0.00 101.00 sdb 1.00 91.00 1.00 28.00 16.00 960.00 33.66 2838.40 10.34 348.28 101.00 sdb1 1.00 91.00 1.00 28.00 16.00 960.00 33.66 6908.70 10.34 348.28 101.00 The DB files and logs are on sdb1. Can someone point me in the direction of some documentation on how to interpret these numbers? Also, I've tried to figure out what's getting cached by PostgreSQL by looking at pg_statio_all_tables. What kind of ratio should I be seeing for heap_blks_read / heap_blks_hit ? Thanks.
Well, Since I haven't seen any other responds, I'll offer a bit of advice and let others correct me. :) Your shared buffers may be too big (?). It is much larger than the guide on varlena.com recommends. All I can suggest is trying some experiments with halving/doubling the numbers to see which way performance goes. Also, if you are counting on cache to improve performance, then the db has to be loaded into cache the first time. So, are subsequent re-queries faster? Thom Dyson Director of Information Services Sybex, Inc. pgsql-performance-owner@postgresql.org wrote on 05/18/2004 11:12:14 AM: > Hello, > (note best viewed in fixed-width font) > > I'm still trying to find where my performance bottle neck is... > I have 4G ram, PG 7.3.4 > shared_buffers = 75000 > effective_cache_size = 75000
On Tue, 2004-05-18 at 14:12, Doug Y wrote: > Run a query I've been having trouble with and watch the output of vmstat > (linux): > > $ vmstat 1 > procs memory swap io system > cpu > r b w swpd free buff cache si so bi bo in cs us > sy id > 0 0 0 148 8732 193652 > 2786668 0 0 0 0 292 151 0 2 98 > 2 0 2 148 7040 193652 > 2786668 0 0 0 208 459 697 45 10 45 > 0 0 0 148 9028 193652 > 2786684 0 0 16 644 318 613 25 4 71 > 1 0 0 148 5092 193676 > 2780196 0 0 12 184 441 491 37 5 58 > 0 1 0 148 5212 193684 > 2772512 0 0 112 9740 682 1063 45 12 43 > 1 0 0 148 5444 193684 > 2771584 0 0 120 4216 464 1303 44 3 52 > 1 0 0 148 12232 193660 > 2771620 0 0 244 628 340 681 43 20 38 > 1 0 0 148 12168 193664 > 2771832 0 0 196 552 332 956 42 2 56 > 1 0 0 148 12080 193664 > 2772248 0 0 272 204 371 201 40 1 59 > 1 1 0 148 12024 193664 > 2772624 0 0 368 0 259 127 42 3 55 > > Thats the first 10 lines or so... the query takes 60 seconds to run. > > I'm confused on the bo & bi parts of the io: > IO > bi: Blocks sent to a block device (blocks/s). > bo: Blocks received from a block device (blocks/s). > > yet it seems to be opposite of that... bi only increases when doing a > largish query, while bo also goes up, I typically see periodic bo numbers > in the low 100's, which I'd guess are log writes. > > I would think that my entire DB should end up cached since a raw pg_dump > file is about 1G in size, yet my performance doesn't indicate that that is > the case... running the same query a few minutes later, I'm not seeing a > significant performance improvement. > Been meaning to try and address this thread since it touches on one of the areas that I think is sorely lacking in the postgresql admin knowledge base; how to use various unix commands to deduce performance information. This would seem even more important given that PostgreSQL admins are expected to use said tools to find out some information that the commercial databases provide to there users... but alas this is -performance and not -advocacy so let me get on with it eh? As you noted, bi and bo are actually reversed, and I believe if you search the kernel hackers mailing lists you'll find references to this... here's some more empirical evidence though, the following vmstat was taken from a high-write traffic monitoring type database application... procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 27412 593336 112036 1865936 0 1 0 1 1 0 1 0 0 5 1 1 27412 593336 112036 1865952 0 0 0 477 600 1346 53 7 40 4 0 0 27412 593336 112036 1865960 0 0 0 1296 731 2087 47 5 48 3 3 2 27412 594408 112052 1865972 4 0 4 2973 904 2957 32 20 48 3 1 1 26596 594544 112068 1865976 64 0 64 1433 770 2766 41 22 37 1 1 1 26596 594544 112072 1866004 0 0 5 959 702 1687 50 10 41 3 1 1 26596 594512 112072 1866024 0 0 0 1155 731 2209 52 12 37 2 0 0 26596 594512 112072 1866040 0 0 0 635 511 1293 48 5 46 0 1 1 26596 594472 112076 1866076 0 0 7 739 551 1248 49 8 43 1 0 0 26596 594472 112076 1866088 0 0 0 1048 598 1295 49 8 43 2 0 0 26596 594208 112084 1866696 0 0 203 1253 686 1506 42 16 41 1 0 0 26596 593920 112084 1866716 0 0 0 1184 599 1329 39 12 49 0 1 1 26596 593060 112084 1866740 0 0 3 1036 613 3442 48 8 44 0 1 2 26596 592920 112084 1866752 0 0 0 3825 836 1323 9 14 76 0 0 0 26596 593544 112084 1866788 0 0 0 1064 625 1197 9 15 76 0 1 1 26596 596300 112088 1866808 0 0 0 747 625 1558 7 13 79 0 0 1 26596 599600 112100 1866892 0 0 0 468 489 1331 6 4 91 0 0 0 26596 599600 112100 1866896 0 0 0 237 418 997 5 4 91 0 1 1 26596 599600 112104 1866896 0 0 0 1063 582 1371 7 7 86 0 0 0 26596 599612 112104 1866904 0 0 0 561 648 1556 6 4 89 notice all the bo as it continually writes data to disk. Also notice how generally speaking it has no bi since it does not have to pull data up from disk... you will notice that the couple of times it grabs information from swap space, you'll also find a corresponding pull on the io. getting back to your issue in order to determine if there is a problem in this case, you need to run explain analyze a few times repeatedly, take the relative score given by these runs, and then come back 5-10 minutes later and run explain analyze again and see if the results are drastically different. troll vmstat while you do this to see if there is bi occurring. I probably should mention that just because you see activity on bi doesn't mean that you'll notice any difference in performance against running the query with no bi, it's dependent on a number of factors really. Oh, and as the other poster alluded to, knock down your shared buffers by about 50% and see where that gets you. I might also knock *up* your effective cache size... try doubling that and see how things go. Hope this helps... and others jump in with corrections if needed. Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL