Re: anti-join chosen even when slower than old plan - Mailing list pgsql-performance

From Cédric Villemain
Subject Re: anti-join chosen even when slower than old plan
Date
Msg-id AANLkTik162=KQ_iARndjc6wV4ib=Qm=RyavNQxSoG92D@mail.gmail.com
Whole thread Raw
In response to Re: anti-join chosen even when slower than old plan  (Vitalii Tymchyshyn <tivv00@gmail.com>)
List pgsql-performance
2010/11/12 Vitalii Tymchyshyn <tivv00@gmail.com>:
> 12.11.10 12:56, Cédric Villemain написав(ла):
>>
>> I supposed it was an answer to my mail but not sure... please keep
>> CC'ed people, it is easier to follow threads (at least for me)
>>
>
> OK
>>
>> 2010/11/12 Vitalii Tymchyshyn<tivv00@gmail.com>:
>>
>>>
>>> I'd say there are two Qs here:
>>>
>>> 1) Modify costs based on information on how much of the table is in
>>> cache.
>>> It would be great  if this can be done, but I'd prefer to have it as
>>> admin
>>> knobs (because of plan stability). May be both admin and automatic ways
>>> can
>>> be followed with some parallel (disableable) process modify knobs on
>>> admin
>>> behalf. In this case different strategies to automatically modify knobs
>>> can
>>> be applied.
>>>
>>
>> OS cache is usualy stable enough to keep your plans stable too, I think.
>>
>
> Not if it is on edge. There are always edge cases where data fluctuates near
> some threshold.

So far I did some analysis on the topic with pgfincore. Tables and
index first have peak and holes if you graph the % of blocks in cache
at the server start, but after a while, it is more stable.

Maybe there are applications where linux faill to find a 'stable' page cache.

If people are able to graph the pgfincore results for all or part of
the objects of their database it will give us more robust analysis.
Especially when corner case with the planner exists (like here).

>>
>>
>>>
>>> 2) Modify costs for part of table retrieval. Then you need to define
>>> "part".
>>> Current ways are partitioning and partial indexes. Some similar to
>>> partial
>>> index thing may be created, that has only "where" clause and no data. But
>>> has statistics and knobs (and may be personal bufferspace if they are
>>> introduced). I don't like to gather data about "last X percents" or like,
>>> because it works only in clustering and it's hard for optimizer to decide
>>> if
>>> it will be enough to scan only this percents for given query.
>>>
>>
>> Modifying random_page_cost and sequential_page_cost thanks to
>> statistics about cached blocks can be improved if we know the
>> distribution.
>>
>> It does not mean : we know we have last 15% in cache, and we are goign
>> to request those 15%.
>>
>
> You mean *_cost for the whole table, don't you? That is case (1) for me.

Yes.

> Case (2) is when different cost values are selected based on what portion of
> table is requested in the query. E.g. when we have data for the whole day in
> one table, data for the last hour is cached and all the other data is not.
> Optimizer then may use different *_cost for query that requires all the data
> and for query that requires only last hour data. But, as I've said, that is
> much more complex task then (1).

I need to think some more time of that.

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

pgsql-performance by date:

Previous
From: Vitalii Tymchyshyn
Date:
Subject: Re: anti-join chosen even when slower than old plan
Next
From: "Kyriacos Kyriacou"
Date:
Subject: MVCC performance issue