Thread: Re: pgstattuple extension for indexes

Re: pgstattuple extension for indexes

From
ITAGAKI Takahiro
Date:
Hi Nagayasu san and folks,
I have a question about pgstatindex.

Satoshi Nagayasu <nagayasus@nttdata.co.jp> wrote:

> Attached patch has been cleaned up,
> and modified to be able to work with CVS HEAD.

Index leaf pages are ordered just after REINDEX. [1] [2] [3]
After full-split, they will be the following: [1] [3] [5] [2] [4] [6]
because new pages are allocated at the end of the index file.

I think this condition should be regarded as full-fragmented,
but pgstatindex reports that the leaf_fragmentation is 50%.


Presently, fragmentation factor is computed as the code below:
   if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)       stat->fragments++;

But the method has the above problem. So I suggest to use whether
the right link points to the next adjacent page or not.
   if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1)       stat->fragments++;

Do you think which method is better? Or do you have other ideas?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: pgstattuple extension for indexes

From
Martijn van Oosterhout
Date:
On Thu, Aug 17, 2006 at 12:55:28PM +0900, ITAGAKI Takahiro wrote:
> I think this condition should be regarded as full-fragmented,
> but pgstatindex reports that the leaf_fragmentation is 50%.
>
> Presently, fragmentation factor is computed as the code below:
>
>     if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
>         stat->fragments++;
>
> But the method has the above problem. So I suggest to use whether
> the right link points to the next adjacent page or not.
>
>     if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1)
>         stat->fragments++;
>
> Do you think which method is better? Or do you have other ideas?

If we do it your way, then every index will probably get a
fragmentation of nearly 100%. That's not very useful. I don't agree
that your example is fully fragmented, since on the first read the OS
will read the next four (or more) blocks so all the others are
zero-cost.

A more useful definition of fragmentation would be: if you're scanning
forward through an index, how often do you have to jump backwards
again. The current calculation seems to get that fairly right...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: pgstattuple extension for indexes

From
"Jim C. Nasby"
Date:
On Thu, Aug 17, 2006 at 02:23:48PM +0200, Martijn van Oosterhout wrote:
> On Thu, Aug 17, 2006 at 12:55:28PM +0900, ITAGAKI Takahiro wrote:
> > I think this condition should be regarded as full-fragmented,
> > but pgstatindex reports that the leaf_fragmentation is 50%.
> > 
> > Presently, fragmentation factor is computed as the code below:
> > 
> >     if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
> >         stat->fragments++;
> > 
> > But the method has the above problem. So I suggest to use whether
> > the right link points to the next adjacent page or not.
> > 
> >     if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1)
> >         stat->fragments++;
> > 
> > Do you think which method is better? Or do you have other ideas?
> 
> If we do it your way, then every index will probably get a
> fragmentation of nearly 100%. That's not very useful. I don't agree
> that your example is fully fragmented, since on the first read the OS
> will read the next four (or more) blocks so all the others are
> zero-cost.

Ok, fine... expand the example out to an index that's not trivial in
size. Even with read-ahead, once you get to a few megs (which is
obviously not that big), you're seeking.

> A more useful definition of fragmentation would be: if you're scanning
> forward through an index, how often do you have to jump backwards
> again. The current calculation seems to get that fairly right...

Well, what really matters is how often you'll need to seek (either
intra-track or inter-track). Any time you need to seek you're hit with a
pretty serious penalty. And until we have an asyncronous prefetch
process, a forward seek will most likely be just as expensive as a
backwards seek, because by the time the data winds it's way from the
drive back to PostgreSQL and the next read request winds it's way back
down to the drive the data you wanted has probably flown past the head.
Granted, I'm ignoring OS read-ahead here, but in a heavily fragmented
index that you're actually reading off disk (ie: it's not trivially
small), that read-ahead isn't likely to help you too much.

Given all that, I'd argue that it's best to consider any page that isn't
in-order as another fragment.

On another note, now that scans happen at a per-page level, does that
make some kind of online index clustering command a possibility? Another
thought that comes to mind is putting enough brains in the indexes or
the FSM to request free pages that are in a specific region of the index
file. That would allow things to stay less fragmented. Of course a
similar method could be used to try and maintain a table heap in cluster
order, and I suspect that method would probably be a lot easier to
impliment.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: pgstattuple extension for indexes

From
Martijn van Oosterhout
Date:
On Thu, Aug 17, 2006 at 02:54:20PM -0500, Jim C. Nasby wrote:
> On Thu, Aug 17, 2006 at 02:23:48PM +0200, Martijn van Oosterhout wrote:
> > On Thu, Aug 17, 2006 at 12:55:28PM +0900, ITAGAKI Takahiro wrote:
> > > But the method has the above problem. So I suggest to use whether
> > > the right link points to the next adjacent page or not.
> > >
> > >     if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1)
> > >         stat->fragments++;
> > >
> > > Do you think which method is better? Or do you have other ideas?
>
> Ok, fine... expand the example out to an index that's not trivial in
> size. Even with read-ahead, once you get to a few megs (which is
> obviously not that big), you're seeking.

Well, mostly I'm just saying that only matching on the next block
number is going to give unrealistically low numbers. We can't ignore OS
level caching, the way Postgres works relies on it in many ways.

I'd suggest something like: btpo_next between blkno and blkno + X, to
try to take into account caching. But I'm not sure that will give
numbers significantly different from what is already generated.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: pgstattuple extension for indexes

From
Satoshi Nagayasu
Date:
ITAGAKI Takahiro wrote:
> But the method has the above problem. So I suggest to use whether
> the right link points to the next adjacent page or not.
>
>     if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1)
>         stat->fragments++;

Well, in that way, following two conditions,  [1] [x] [2] [y] [3]
and  [3] [x] [2] [y] [1]
will be calculated as same fragmentation ratio(100%), I can't agree
with that, because both will generate different costs while index scan
in the real world (I don't care about page splitting algorithm now).

If we think 'fragmentation' more strictly, the fragmentation ratio
should be calculated with 'distance' and 'direction' of the block
ordering and positions, because  [1] [x] [y] [z] [2]
and  [2] [x] [y] [1] [z]
have different costs each.

However, in such way, if I get '57.6%' as a fragmentation radio,
what does it mean? What can I do next? Two cases (forward ordered blocks
with some gaps, and backward ordered blocks with some gaps) are clearly
different, but will result same radios.

Understanding and estimating real cost of the index scan is difficult.
So I want to think 'fragmentation radio' simply,
"How many backward seeks will occur while your index scan?".

I guess, in some cases, people will want to know more detailed information,
but most people need a tool which is easy to use and easy to understand.
And I believe present calculation is good enough.

-- 
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>
Phone: +81-3-3523-8122


Re: pgstattuple extension for indexes

From
ITAGAKI Takahiro
Date:
Satoshi Nagayasu <nagayasus@nttdata.co.jp> wrote:

> Well, in that way, following two conditions,
>    [1] [x] [2] [y] [3]
> and
>    [3] [x] [2] [y] [1]
> will be calculated as same fragmentation ratio(100%), I can't agree
> with that, because both will generate different costs while index scan
> in the real world (I don't care about page splitting algorithm now).

I think the calculations (100%) are appropriate, because we should do
REINDEX in both case. Supposing to the sizes of [x], [y] are mega or giga
bytes, the order is not important; we have to do large seeks in both case.
In addition, the latter case rarely happens in real world, isn't it? 


> However, in such way, if I get '57.6%' as a fragmentation radio,
> what does it mean? What can I do next?

I think the information of fragmentations are probably not 
the most important; the information users want to know are 
"When to do REINDEX?" and "How to set the fillfactor?".

I hope you to write how to interpret the framgentation (and other) info
in README. In my understanding, I'll write "You'd better do REINDEX when
you see the fragmentation is greater than 50%" under the present
calculation method.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: pgstattuple extension for indexes

From
Satoshi Nagayasu
Date:
ITAGAKI Takahiro wrote:
> Satoshi Nagayasu <nagayasus@nttdata.co.jp> wrote:
> 
>> Well, in that way, following two conditions,
>>    [1] [x] [2] [y] [3]
>> and
>>    [3] [x] [2] [y] [1]
>> will be calculated as same fragmentation ratio(100%), I can't agree
>> with that, because both will generate different costs while index scan
>> in the real world (I don't care about page splitting algorithm now).
> 
> I think the calculations (100%) are appropriate, because we should do
> REINDEX in both case. Supposing to the sizes of [x], [y] are mega or giga
> bytes, the order is not important; we have to do large seeks in both case.

I don't think so. A few blocks forward skip while scan can be reasonable
and acceptable (of course, it's case by case).

BTW, What does 'large seeks' mean? Seeking a few blocks, hundred of blocks
and millions of blocks are not same, I think. Are they same for you?

>> However, in such way, if I get '57.6%' as a fragmentation radio,
>> what does it mean? What can I do next?
> 
> I think the information of fragmentations are probably not 
> the most important; the information users want to know are 
> "When to do REINDEX?" and "How to set the fillfactor?".

Agreed.
So, I'm just counting backward seeks simply for the fragmentation ratio.
It means 'the mismatch radio between logical order and physical order
of the blocks'.

> I hope you to write how to interpret the framgentation (and other) info
> in README. In my understanding, I'll write "You'd better do REINDEX when
> you see the fragmentation is greater than 50%" under the present
> calculation method.

I can't understand why you want to make such decision, because you're
thinking the fragmentation information is not the most important for
the users, aren't you?
-- 
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>
Phone: +81-3-3523-8122


Re: pgstattuple extension for indexes

From
ITAGAKI Takahiro
Date:
Satoshi Nagayasu <nagayasus@nttdata.co.jp> wrote:

> > I hope you to write how to interpret the framgentation (and other) info
> > in README. In my understanding, I'll write "You'd better do REINDEX when
> > you see the fragmentation is greater than 50%" under the present
> > calculation method.
> 
> I can't understand why you want to make such decision, because you're
> thinking the fragmentation information is not the most important for
> the users, aren't you?

Suppose a simple update case, for example, the accounts table in pgbench.
The default fillfactor of btree indexes is 90%, so the leaf pages are
fully split after we update 10-20% of tuples. But pgstatindex reports
the fragmentation is 50% in such condition, but I think we should do
REINDEX then. My decision came from this.

The setting fillfactor=50% is better than the case with high fillfactor
but all pages have split once, even if sizes of the indexes are same.
I worry that users will misunderstand the 50% of fragmentation -- if the
report says 100%, they'll consider to do REINDEX. But 50%, the necessity
is unclear.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: pgstattuple extension for indexes

From
Satoshi Nagayasu
Date:
ITAGAKI Takahiro wrote:
> Suppose a simple update case, for example, the accounts table in pgbench.
> The default fillfactor of btree indexes is 90%, so the leaf pages are
> fully split after we update 10-20% of tuples. But pgstatindex reports
> the fragmentation is 50% in such condition, but I think we should do
> REINDEX then. My decision came from this.
> 
> The setting fillfactor=50% is better than the case with high fillfactor
> but all pages have split once, even if sizes of the indexes are same.
> I worry that users will misunderstand the 50% of fragmentation -- if the
> report says 100%, they'll consider to do REINDEX. But 50%, the necessity
> is unclear.

I think you should use 'average of page density' and 'number of leaf pages'
in such case. It is more useful to know filling condition of the leaves.

I've observed both while running pgbench, and the result is coming with
the WEB+DB PRESS magazine in next Wednesday. :)

Thanks.
-- 
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>
Phone: +81-3-3523-8122


Re: pgstattuple extension for indexes

From
Jim Nasby
Date:
On Aug 17, 2006, at 4:10 PM, Martijn van Oosterhout wrote:
> On Thu, Aug 17, 2006 at 02:54:20PM -0500, Jim C. Nasby wrote:
>> On Thu, Aug 17, 2006 at 02:23:48PM +0200, Martijn van Oosterhout  
>> wrote:
>>> On Thu, Aug 17, 2006 at 12:55:28PM +0900, ITAGAKI Takahiro wrote:
>>>> But the method has the above problem. So I suggest to use whether
>>>> the right link points to the next adjacent page or not.
>>>>
>>>>     if (opaque->btpo_next != P_NONE && opaque->btpo_next !=  
>>>> blkno + 1)
>>>>         stat->fragments++;
>>>>
>>>> Do you think which method is better? Or do you have other ideas?
>>
>> Ok, fine... expand the example out to an index that's not trivial in
>> size. Even with read-ahead, once you get to a few megs (which is
>> obviously not that big), you're seeking.
>
> Well, mostly I'm just saying that only matching on the next block
> number is going to give unrealistically low numbers. We can't  
> ignore OS
> level caching, the way Postgres works relies on it in many ways.

While I agree that *users* must take caching into account, I don't  
think we should be fudging fragmentation numbers. For starters, we  
have absolutely no idea how much caching is actually happening.

We should just report the raw numbers and let users draw their own  
conclusions. Doing otherwise makes the stat far less useful.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461




Re: pgstattuple extension for indexes

From
"Jim C. Nasby"
Date:
On Fri, Aug 18, 2006 at 09:15:59AM +0900, Satoshi Nagayasu wrote:
> ITAGAKI Takahiro wrote:
> > But the method has the above problem. So I suggest to use whether
> > the right link points to the next adjacent page or not.
> >
> >     if (opaque->btpo_next != P_NONE && opaque->btpo_next != blkno + 1)
> >         stat->fragments++;
> 
> Well, in that way, following two conditions,
>    [1] [x] [2] [y] [3]
> and
>    [3] [x] [2] [y] [1]
> will be calculated as same fragmentation ratio(100%), I can't agree
> with that, because both will generate different costs while index scan
> in the real world (I don't care about page splitting algorithm now).

What about just reporting the correlation of pages in the index, as well
as fragmentation?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461