Thread: Re: pgstattuple extension for indexes
Hi Nagayasu san and folks, I have a question about pgstatindex. Satoshi Nagayasu <nagayasus@nttdata.co.jp> wrote: > Attached patch has been cleaned up, > and modified to be able to work with CVS HEAD. Index leaf pages are ordered just after REINDEX. [1] [2] [3] After full-split, they will be the following: [1] [3] [5] [2] [4] [6] because new pages are allocated at the end of the index file. I think this condition should be regarded as full-fragmented, but pgstatindex reports that the leaf_fragmentation is 50%. Presently, fragmentation factor is computed as the code below: if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno) stat->fragments++; But the method has the above problem. So I suggest to use whether the right link points to the next adjacent page or not. if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1) stat->fragments++; Do you think which method is better? Or do you have other ideas? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Thu, Aug 17, 2006 at 12:55:28PM +0900, ITAGAKI Takahiro wrote: > I think this condition should be regarded as full-fragmented, > but pgstatindex reports that the leaf_fragmentation is 50%. > > Presently, fragmentation factor is computed as the code below: > > if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno) > stat->fragments++; > > But the method has the above problem. So I suggest to use whether > the right link points to the next adjacent page or not. > > if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1) > stat->fragments++; > > Do you think which method is better? Or do you have other ideas? If we do it your way, then every index will probably get a fragmentation of nearly 100%. That's not very useful. I don't agree that your example is fully fragmented, since on the first read the OS will read the next four (or more) blocks so all the others are zero-cost. A more useful definition of fragmentation would be: if you're scanning forward through an index, how often do you have to jump backwards again. The current calculation seems to get that fairly right... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Thu, Aug 17, 2006 at 02:23:48PM +0200, Martijn van Oosterhout wrote: > On Thu, Aug 17, 2006 at 12:55:28PM +0900, ITAGAKI Takahiro wrote: > > I think this condition should be regarded as full-fragmented, > > but pgstatindex reports that the leaf_fragmentation is 50%. > > > > Presently, fragmentation factor is computed as the code below: > > > > if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno) > > stat->fragments++; > > > > But the method has the above problem. So I suggest to use whether > > the right link points to the next adjacent page or not. > > > > if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1) > > stat->fragments++; > > > > Do you think which method is better? Or do you have other ideas? > > If we do it your way, then every index will probably get a > fragmentation of nearly 100%. That's not very useful. I don't agree > that your example is fully fragmented, since on the first read the OS > will read the next four (or more) blocks so all the others are > zero-cost. Ok, fine... expand the example out to an index that's not trivial in size. Even with read-ahead, once you get to a few megs (which is obviously not that big), you're seeking. > A more useful definition of fragmentation would be: if you're scanning > forward through an index, how often do you have to jump backwards > again. The current calculation seems to get that fairly right... Well, what really matters is how often you'll need to seek (either intra-track or inter-track). Any time you need to seek you're hit with a pretty serious penalty. And until we have an asyncronous prefetch process, a forward seek will most likely be just as expensive as a backwards seek, because by the time the data winds it's way from the drive back to PostgreSQL and the next read request winds it's way back down to the drive the data you wanted has probably flown past the head. Granted, I'm ignoring OS read-ahead here, but in a heavily fragmented index that you're actually reading off disk (ie: it's not trivially small), that read-ahead isn't likely to help you too much. Given all that, I'd argue that it's best to consider any page that isn't in-order as another fragment. On another note, now that scans happen at a per-page level, does that make some kind of online index clustering command a possibility? Another thought that comes to mind is putting enough brains in the indexes or the FSM to request free pages that are in a specific region of the index file. That would allow things to stay less fragmented. Of course a similar method could be used to try and maintain a table heap in cluster order, and I suspect that method would probably be a lot easier to impliment. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Thu, Aug 17, 2006 at 02:54:20PM -0500, Jim C. Nasby wrote: > On Thu, Aug 17, 2006 at 02:23:48PM +0200, Martijn van Oosterhout wrote: > > On Thu, Aug 17, 2006 at 12:55:28PM +0900, ITAGAKI Takahiro wrote: > > > But the method has the above problem. So I suggest to use whether > > > the right link points to the next adjacent page or not. > > > > > > if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1) > > > stat->fragments++; > > > > > > Do you think which method is better? Or do you have other ideas? > > Ok, fine... expand the example out to an index that's not trivial in > size. Even with read-ahead, once you get to a few megs (which is > obviously not that big), you're seeking. Well, mostly I'm just saying that only matching on the next block number is going to give unrealistically low numbers. We can't ignore OS level caching, the way Postgres works relies on it in many ways. I'd suggest something like: btpo_next between blkno and blkno + X, to try to take into account caching. But I'm not sure that will give numbers significantly different from what is already generated. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
ITAGAKI Takahiro wrote: > But the method has the above problem. So I suggest to use whether > the right link points to the next adjacent page or not. > > if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1) > stat->fragments++; Well, in that way, following two conditions, [1] [x] [2] [y] [3] and [3] [x] [2] [y] [1] will be calculated as same fragmentation ratio(100%), I can't agree with that, because both will generate different costs while index scan in the real world (I don't care about page splitting algorithm now). If we think 'fragmentation' more strictly, the fragmentation ratio should be calculated with 'distance' and 'direction' of the block ordering and positions, because [1] [x] [y] [z] [2] and [2] [x] [y] [1] [z] have different costs each. However, in such way, if I get '57.6%' as a fragmentation radio, what does it mean? What can I do next? Two cases (forward ordered blocks with some gaps, and backward ordered blocks with some gaps) are clearly different, but will result same radios. Understanding and estimating real cost of the index scan is difficult. So I want to think 'fragmentation radio' simply, "How many backward seeks will occur while your index scan?". I guess, in some cases, people will want to know more detailed information, but most people need a tool which is easy to use and easy to understand. And I believe present calculation is good enough. -- NAGAYASU Satoshi <nagayasus@nttdata.co.jp> Phone: +81-3-3523-8122
Satoshi Nagayasu <nagayasus@nttdata.co.jp> wrote: > Well, in that way, following two conditions, > [1] [x] [2] [y] [3] > and > [3] [x] [2] [y] [1] > will be calculated as same fragmentation ratio(100%), I can't agree > with that, because both will generate different costs while index scan > in the real world (I don't care about page splitting algorithm now). I think the calculations (100%) are appropriate, because we should do REINDEX in both case. Supposing to the sizes of [x], [y] are mega or giga bytes, the order is not important; we have to do large seeks in both case. In addition, the latter case rarely happens in real world, isn't it? > However, in such way, if I get '57.6%' as a fragmentation radio, > what does it mean? What can I do next? I think the information of fragmentations are probably not the most important; the information users want to know are "When to do REINDEX?" and "How to set the fillfactor?". I hope you to write how to interpret the framgentation (and other) info in README. In my understanding, I'll write "You'd better do REINDEX when you see the fragmentation is greater than 50%" under the present calculation method. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro wrote: > Satoshi Nagayasu <nagayasus@nttdata.co.jp> wrote: > >> Well, in that way, following two conditions, >> [1] [x] [2] [y] [3] >> and >> [3] [x] [2] [y] [1] >> will be calculated as same fragmentation ratio(100%), I can't agree >> with that, because both will generate different costs while index scan >> in the real world (I don't care about page splitting algorithm now). > > I think the calculations (100%) are appropriate, because we should do > REINDEX in both case. Supposing to the sizes of [x], [y] are mega or giga > bytes, the order is not important; we have to do large seeks in both case. I don't think so. A few blocks forward skip while scan can be reasonable and acceptable (of course, it's case by case). BTW, What does 'large seeks' mean? Seeking a few blocks, hundred of blocks and millions of blocks are not same, I think. Are they same for you? >> However, in such way, if I get '57.6%' as a fragmentation radio, >> what does it mean? What can I do next? > > I think the information of fragmentations are probably not > the most important; the information users want to know are > "When to do REINDEX?" and "How to set the fillfactor?". Agreed. So, I'm just counting backward seeks simply for the fragmentation ratio. It means 'the mismatch radio between logical order and physical order of the blocks'. > I hope you to write how to interpret the framgentation (and other) info > in README. In my understanding, I'll write "You'd better do REINDEX when > you see the fragmentation is greater than 50%" under the present > calculation method. I can't understand why you want to make such decision, because you're thinking the fragmentation information is not the most important for the users, aren't you? -- NAGAYASU Satoshi <nagayasus@nttdata.co.jp> Phone: +81-3-3523-8122
Satoshi Nagayasu <nagayasus@nttdata.co.jp> wrote: > > I hope you to write how to interpret the framgentation (and other) info > > in README. In my understanding, I'll write "You'd better do REINDEX when > > you see the fragmentation is greater than 50%" under the present > > calculation method. > > I can't understand why you want to make such decision, because you're > thinking the fragmentation information is not the most important for > the users, aren't you? Suppose a simple update case, for example, the accounts table in pgbench. The default fillfactor of btree indexes is 90%, so the leaf pages are fully split after we update 10-20% of tuples. But pgstatindex reports the fragmentation is 50% in such condition, but I think we should do REINDEX then. My decision came from this. The setting fillfactor=50% is better than the case with high fillfactor but all pages have split once, even if sizes of the indexes are same. I worry that users will misunderstand the 50% of fragmentation -- if the report says 100%, they'll consider to do REINDEX. But 50%, the necessity is unclear. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro wrote: > Suppose a simple update case, for example, the accounts table in pgbench. > The default fillfactor of btree indexes is 90%, so the leaf pages are > fully split after we update 10-20% of tuples. But pgstatindex reports > the fragmentation is 50% in such condition, but I think we should do > REINDEX then. My decision came from this. > > The setting fillfactor=50% is better than the case with high fillfactor > but all pages have split once, even if sizes of the indexes are same. > I worry that users will misunderstand the 50% of fragmentation -- if the > report says 100%, they'll consider to do REINDEX. But 50%, the necessity > is unclear. I think you should use 'average of page density' and 'number of leaf pages' in such case. It is more useful to know filling condition of the leaves. I've observed both while running pgbench, and the result is coming with the WEB+DB PRESS magazine in next Wednesday. :) Thanks. -- NAGAYASU Satoshi <nagayasus@nttdata.co.jp> Phone: +81-3-3523-8122
On Aug 17, 2006, at 4:10 PM, Martijn van Oosterhout wrote: > On Thu, Aug 17, 2006 at 02:54:20PM -0500, Jim C. Nasby wrote: >> On Thu, Aug 17, 2006 at 02:23:48PM +0200, Martijn van Oosterhout >> wrote: >>> On Thu, Aug 17, 2006 at 12:55:28PM +0900, ITAGAKI Takahiro wrote: >>>> But the method has the above problem. So I suggest to use whether >>>> the right link points to the next adjacent page or not. >>>> >>>> if (opaque->btpo_next != P_NONE && opaque->btpo_next != >>>> blkno + 1) >>>> stat->fragments++; >>>> >>>> Do you think which method is better? Or do you have other ideas? >> >> Ok, fine... expand the example out to an index that's not trivial in >> size. Even with read-ahead, once you get to a few megs (which is >> obviously not that big), you're seeking. > > Well, mostly I'm just saying that only matching on the next block > number is going to give unrealistically low numbers. We can't > ignore OS > level caching, the way Postgres works relies on it in many ways. While I agree that *users* must take caching into account, I don't think we should be fudging fragmentation numbers. For starters, we have absolutely no idea how much caching is actually happening. We should just report the raw numbers and let users draw their own conclusions. Doing otherwise makes the stat far less useful. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Fri, Aug 18, 2006 at 09:15:59AM +0900, Satoshi Nagayasu wrote: > ITAGAKI Takahiro wrote: > > But the method has the above problem. So I suggest to use whether > > the right link points to the next adjacent page or not. > > > > if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1) > > stat->fragments++; > > Well, in that way, following two conditions, > [1] [x] [2] [y] [3] > and > [3] [x] [2] [y] [1] > will be calculated as same fragmentation ratio(100%), I can't agree > with that, because both will generate different costs while index scan > in the real world (I don't care about page splitting algorithm now). What about just reporting the correlation of pages in the index, as well as fragmentation? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461