Thread: anti-join chosen even when slower than old plan

anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
The semi-join and anti-join have helped us quite a bit, but we have
seen a situation where anti-join is chosen even though it is slower
than the "old fashioned" plan.  I know there have been other reports
of this, but I just wanted to go on record with my details.

The query:

delete from "DbTranLogRecord"
  where not exists
        (select * from "DbTranRepository" r
           where r."countyNo" = "DbTranLogRecord"."countyNo"
             and r."tranImageSeqNo"
                 = "DbTranLogRecord"."tranImageSeqNo");

Old plan on 8.3.7:

 Seq Scan on "DbTranLogRecord"  (cost=0.00..1224227790.06
rows=333387520 width=6)
   Filter: (NOT (subplan))
   SubPlan
     ->  Index Scan using "DbTranRepositoryPK" on "DbTranRepository"
r  (cost=0.00..1.83 rows=1 width=974)
           Index Cond: ((("countyNo")::smallint = ($0)::smallint)
AND (("tranImageSeqNo")::numeric = ($1)::numeric))

Deletes about 9.2 million rows in 7 hours and 20 minutes.

New plan on 9.0.1:

 Delete  (cost=0.00..93918390.38 rows=1 width=12)
   ->  Merge Anti Join  (cost=0.00..93918390.38 rows=1 width=12)
         Merge Cond: ((("DbTranLogRecord"."countyNo")::smallint =
(r."countyNo")::smallint) AND
(("DbTranLogRecord"."tranImageSeqNo")::numeric =
(r."tranImageSeqNo")::numeric))
         ->  Index Scan using "DbTranLogRecordPK" on
"DbTranLogRecord"  (cost=0.00..73143615.91 rows=675405504 width=20)
         ->  Index Scan using "DbTranRepositoryPK" on
"DbTranRepository" r  (cost=0.00..16328700.43 rows=152541168
width=20)

Cancelled after 39 hours and 25 minutes.

I know how to work around it by using OFFSET 0 or tweaking the
costing for that one query; just sharing the information.

Also, we know these tables might be good candidates for
partitioning, but that's an issue for another day.

         Table "public.DbTranLogRecord"
     Column     |       Type        | Modifiers
----------------+-------------------+-----------
 countyNo       | "CountyNoT"       | not null
 tranImageSeqNo | "TranImageSeqNoT" | not null
 logRecordSeqNo | "LogRecordSeqNoT" | not null
 operation      | "OperationT"      | not null
 tableName      | "TableNameT"      | not null
Indexes:
    "DbTranLogRecordPK" PRIMARY KEY, btree ("countyNo",
"tranImageSeqNo", "logRecordSeqNo")
    "DbTranLogRecord_TableNameSeqNo" btree ("countyNo", "tableName",
"tranImageSeqNo", operation)

            Table "public.DbTranRepository"
      Column      |          Type          | Modifiers
------------------+------------------------+-----------
 countyNo         | "CountyNoT"            | not null
 tranImageSeqNo   | "TranImageSeqNoT"      | not null
 timestampValue   | "TimestampT"           | not null
 transactionImage | "ImageT"               |
 status           | character(1)           | not null
 queryName        | "QueryNameT"           |
 runDuration      | numeric(15,0)          |
 userId           | "UserIdT"              |
 functionalArea   | "FunctionalAreaT"      |
 sourceRef        | character varying(255) |
 url              | "URLT"                 |
 tranImageSize    | numeric(15,0)          |
Indexes:
    "DbTranRepositoryPK" PRIMARY KEY, btree ("countyNo",
"tranImageSeqNo") CLUSTER
    "DbTranRepository_UserId" btree ("countyNo", "userId",
"tranImageSeqNo")
    "DbTranRepository_timestamp" btree ("countyNo", "timestampValue")

            relname             | relpages |  reltuples  |
pg_relation_size
--------------------------------+----------+-------------+------------------
 DbTranLogRecord                |  5524411 | 6.75406e+08 | 42 GB
 DbTranLogRecordPK              |  6581122 | 6.75406e+08 | 50 GB
 DbTranLogRecord_TableNameSeqNo |  6803441 | 6.75406e+08 | 52 GB
 DbTranRepository               | 22695447 | 1.52376e+08 | 173 GB
 DbTranRepositoryPK             |  1353643 | 1.52376e+08 | 10 GB
 DbTranRepository_UserId        |  1753793 | 1.52376e+08 | 13 GB
 DbTranRepository_timestamp     |  1353682 | 1.52376e+08 | 10 GB
(7 rows)

oprofile while not much but this delete is running:

samples  %        symbol name
2320174  33.7617  index_getnext
367268    5.3443  LWLockAcquire
299131    4.3528  hash_search_with_hash_value
249459    3.6300  HeapTupleSatisfiesMVCC
229558    3.3404  PinBuffer
222673    3.2402  _bt_checkkeys
204416    2.9745  LWLockRelease
194336    2.8279  heap_page_prune_opt
152353    2.2169  XidInMVCCSnapshot
121131    1.7626  AllocSetAlloc
91123     1.3260  SearchCatCache
88394     1.2863  nocache_index_getattr
85936     1.2505  pglz_compress
76531     1.1136  heap_hot_search_buffer
69532     1.0118  _mdfd_getseg
68743     1.0003  FunctionCall2
64720     0.9418  TransactionIdPrecedes
45298     0.6591  texteq
43183     0.6284  UnpinBuffer
40666     0.5917  base_yyparse

If you want more details or the opannotate level, let me know.

-Kevin

Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> samples  %        symbol name
> 2320174  33.7617  index_getnext

I couldn't resist seeing where the time went within this function.
Over 13.7% of the opannotate run time was on this bit of code:

  /*
   * The xmin should match the previous xmax value, else chain is
   * broken.  (Note: this test is not optional because it protects
   * us against the case where the prior chain member's xmax aborted
   * since we looked at it.)
   */
  if (TransactionIdIsValid(scan->xs_prev_xmax) &&
      !TransactionIdEquals(scan->xs_prev_xmax,
                        HeapTupleHeaderGetXmin(heapTuple->t_data)))
      break;

I can't see why it would be such a hotspot, but it is.

-Kevin

Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
>> samples  %        symbol name
>> 2320174  33.7617  index_getnext

> I couldn't resist seeing where the time went within this function.
> Over 13.7% of the opannotate run time was on this bit of code:

>   /*
>    * The xmin should match the previous xmax value, else chain is
>    * broken.  (Note: this test is not optional because it protects
>    * us against the case where the prior chain member's xmax aborted
>    * since we looked at it.)
>    */
>   if (TransactionIdIsValid(scan->xs_prev_xmax) &&
>       !TransactionIdEquals(scan->xs_prev_xmax,
>                         HeapTupleHeaderGetXmin(heapTuple->t_data)))
>       break;

> I can't see why it would be such a hotspot, but it is.

Main-memory access waits, maybe?  If at_chain_start is false, that xmin
fetch would be the first actual touch of a given heap tuple, and could
be expected to have to wait for a cache line to be pulled in from RAM.
However, you'd have to be spending a lot of time chasing through long
HOT chains before that would happen enough to make this a hotspot...

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> However, you'd have to be spending a lot of time chasing through
> long HOT chains before that would happen enough to make this a
> hotspot...

That makes it all the more mysterious, then.  These tables are
insert-only except for a weekly delete of one week of the oldest
data.  The parent table, with the date, is deleted first, then this
table deletes "where not exists" a corresponding parent.  I can't
see how we'd ever have a HOT chain in these tables.

Is there anything in particular you'd like me to check?

-Kevin

Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> The semi-join and anti-join have helped us quite a bit, but we have
> seen a situation where anti-join is chosen even though it is slower
> than the "old fashioned" plan.  I know there have been other reports
> of this, but I just wanted to go on record with my details.

In principle, the old-style plan ought to be equivalent to a nestloop
antijoin with a seqscan of DbTranLogRecord on the outside and an
indexscan of DbTranRepository on the inside.  Can you force it to choose
such a plan by setting enable_mergejoin off (and maybe enable_hashjoin
too)?  If so, it'd be interesting to see the estimated costs and actual
runtime on 9.0 for that plan.

It would also be interesting to check estimated and actual costs for the
SELECT COUNT(*) versions of these queries, ie, no actual delete.  I'm
suspicious that the cost differential has nothing to do with antijoin
vs. subplan, and everything to do with whether the targeted tuples are
being deleted in physical order (thus improving locality of access for
the deletions).  If it's the latter, see previous discussions about
possibly sorting update/delete targets by CTID before applying the
actions.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
Grzegorz Jaśkiewicz
Date:
you're joining on more than one key. That always hurts performance.

Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
Grzegorz Jaśkiewicz wrote:

> you're joining on more than one key. That always hurts performance.

That's very clearly *not* the problem, as there is a plan which runs
in acceptable time but the optimizer is not choosing without being
coerced.

(1) Virtually every query we run joins on multi-column keys, yet we
have good performance except for this one query.

(2) We're talking about a performance regression due to a new release
picking a newly available plan which it wrongly estimates to be an
order of magnitude faster, when it's actually more than five times
slower.

-Kevin


Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> The semi-join and anti-join have helped us quite a bit, but we
>> have seen a situation where anti-join is chosen even though it is
>> slower than the "old fashioned" plan.  I know there have been
>> other reports of this, but I just wanted to go on record with my
>> details.
>
> In principle, the old-style plan ought to be equivalent to a
> nestloop antijoin with a seqscan of DbTranLogRecord on the outside
> and an indexscan of DbTranRepository on the inside.  Can you force
> it to choose such a plan by setting enable_mergejoin off (and
> maybe enable_hashjoin too)?

Well, I got what I think is the equivalent plan by adding OFFSET 0
to the subquery:

 Delete  (cost=0.00..1239005015.67 rows=337702752 width=6)
   ->  Seq Scan on "DbTranLogRecord"  (cost=0.00..1239005015.67
rows=337702752 width=6)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Limit  (cost=0.00..1.82 rows=1 width=974)
                 ->  Index Scan using "DbTranRepositoryPK" on
"DbTranRepository" r  (cost=0.00..1.82 rows=1 width=974)
                       Index Cond: ((("countyNo")::smallint =
($0)::smallint) AND (("tranImageSeqNo")::numeric = ($1)::numeric))

> If so, it'd be interesting to see the estimated costs and actual
> runtime on 9.0 for that plan.

Unfortunately, based on the oprofile information I decided to check
out the plan I would get by boosting cpu_index_tuple_cost by a
factor of 20.  The resulting plan was:

 Delete  (cost=132623778.83..139506491.18 rows=1 width=12)
   ->  Merge Anti Join  (cost=132623778.83..139506491.18 rows=1
width=12)
         Merge Cond: ((("DbTranLogRecord"."tranImageSeqNo")::numeric
= (r."tranImageSeqNo")::numeric) AND
(("DbTranLogRecord"."countyNo")::smallint =
(r."countyNo")::smallint))
         ->  Sort  (cost=107941675.79..109630189.55 rows=675405504
width=20)
               Sort Key: "DbTranLogRecord"."tranImageSeqNo",
"DbTranLogRecord"."countyNo"
               ->  Seq Scan on "DbTranLogRecord"
(cost=0.00..7306496.14 rows=675405504 width=20)
         ->  Materialize  (cost=24682103.04..25443983.12
rows=152376016 width=20)
               ->  Sort  (cost=24682103.04..25063043.08
rows=152376016 width=20)
                     Sort Key: r."tranImageSeqNo", r."countyNo"
                     ->  Seq Scan on "DbTranRepository" r
(cost=0.00..3793304.86 rows=152376016 width=20)

That looked like it had potential, so I started that off and went
home before I got your post.  It finished in 3 hours and 31 minutes
-- more than twice as fast as the nestloop plan used under 8.3.

But wait -- it turns out that this pain was self-inflicted.  Based
on heavy testing of the interactive queries which users run against
this database we tuned the database for "fully-cached" settings,
with both random_page_cost and _seq_page_cost at 0.1.  In a
practical sense, the users are almost always running these queries
against very recent data which is, in fact, heavily cached -- so
it's no surprise that the queries they run perform best with plans
based on such costing.  The problem is that these weekly maintenance
runs need to pass the entire database, so caching effects are far
less pronounced.  If I set seq_page_cost = 1 and random_page_cost =
2 I get exactly the same (fast) plan as above.

I guess the lesson here is not to use the same costing for
database-wide off-hours maintenance queries as for ad hoc queries
against a smaller set of recent data by users who expect quick
response time.  I'm fine with tweaking the costs in our maintenance
scripts, but it does tend to make me daydream about how the
optimizer might possibly auto-tweak such things....

I assume there's now no need to get timings for the old plan?

-Kevin

Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In principle, the old-style plan ought to be equivalent to a
>> nestloop antijoin with a seqscan of DbTranLogRecord on the outside
>> and an indexscan of DbTranRepository on the inside.  Can you force
>> it to choose such a plan by setting enable_mergejoin off (and
>> maybe enable_hashjoin too)?

> Well, I got what I think is the equivalent plan by adding OFFSET 0
> to the subquery:

No, that *is* the old-style plan (plus a useless Limit node, which will
surely make it marginally slower).  My point was that a nestloop
antijoin plan should have the same access pattern and hence very similar
performance, maybe even a little better due to not having the SubPlan
machinery in there.

> But wait -- it turns out that this pain was self-inflicted.  Based
> on heavy testing of the interactive queries which users run against
> this database we tuned the database for "fully-cached" settings,
> with both random_page_cost and _seq_page_cost at 0.1.

Ah.  So it was underestimating the cost of the full-table indexscans,
and my guess about nonsequential application of the delete actions
wasn't the right guess.  The merge antijoin does seem like it should be
the fastest way of doing such a large join, so I think the problem is
solved.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Wed, Nov 10, 2010 at 10:15 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> But wait -- it turns out that this pain was self-inflicted.  Based
> on heavy testing of the interactive queries which users run against
> this database we tuned the database for "fully-cached" settings,
> with both random_page_cost and _seq_page_cost at 0.1.  In a
> practical sense, the users are almost always running these queries
> against very recent data which is, in fact, heavily cached -- so
> it's no surprise that the queries they run perform best with plans
> based on such costing.  The problem is that these weekly maintenance
> runs need to pass the entire database, so caching effects are far
> less pronounced.  If I set seq_page_cost = 1 and random_page_cost =
> 2 I get exactly the same (fast) plan as above.
>
> I guess the lesson here is not to use the same costing for
> database-wide off-hours maintenance queries as for ad hoc queries
> against a smaller set of recent data by users who expect quick
> response time.  I'm fine with tweaking the costs in our maintenance
> scripts, but it does tend to make me daydream about how the
> optimizer might possibly auto-tweak such things....

Wow.  That's fascinating, and if you don't mind, I might mention this
potential problem in a future talk at some point.

I've given some thought in the past to trying to maintain some model
of which parts of the database are likely to be cached, and trying to
adjust costing estimates based on that data.  But it's a really hard
problem, because what is and is not in cache can change relatively
quickly, and you don't want to have too much plan instability.  Also,
for many workloads, you'd need to have pretty fine-grained statistics
to figure out anything useful, which would be expensive and difficult
to maintain.

But thinking over what you've written here, I'm reminded of something
Peter said years ago, also about the optimizer.  He was discussed the
ratio of the estimated cost to the actual cost and made an off-hand
remark that efforts had been made over the years to make that ratio
more consistent (i.e. improve the quality of the cost estimates) but
that they'd been abandoned because they didn't necessarily produce
better plans.  Applying that line of thinking to this problem, maybe
we should give up on trying to make the estimates truly model reality,
and focus more on assigning them values which work well in practice.
For example, in your case, it would be sufficient to estimate the
amount of data that a given query is going to grovel through and then
applying some heuristic to choose values for random_page_cost and
seq_page_cost based on the ratio of that value to, I don't know,
effective_cache_size.

Unfortunately, to know how much data we're going to grovel through, we
need to know the plan; and to decide on the right plan, we need to
know how much data we're going to grovel through.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> Wow.  That's fascinating, and if you don't mind, I might mention
> this potential problem in a future talk at some point.

I don't mind at all.

> For example, in your case, it would be sufficient to estimate the
> amount of data that a given query is going to grovel through and
> then applying some heuristic to choose values for random_page_cost
> and seq_page_cost based on the ratio of that value to, I don't
> know, effective_cache_size.

That's where my day-dreams on the topic have been starting.

> Unfortunately, to know how much data we're going to grovel
> through, we need to know the plan; and to decide on the right
> plan, we need to know how much data we're going to grovel through.

And that's where they've been ending.

The only half-sane answer I've thought of is to apply a different
cost to full-table or full-index scans based on the ratio with
effective cache size.  A higher cost for such scans is something
which I've already postulated might be worthwhile for SSI, because
of the increased risk of rw-conflicts which could ultimately
contribute to serialization failures -- to attempt to model, at
least in some crude way, the costs associated with transaction
retry.

-Kevin

Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> Unfortunately, to know how much data we're going to grovel
>> through, we need to know the plan; and to decide on the right
>> plan, we need to know how much data we're going to grovel through.

> And that's where they've been ending.

> The only half-sane answer I've thought of is to apply a different
> cost to full-table or full-index scans based on the ratio with
> effective cache size.

This might have some connection to some rather half-baked ideas I've
been having in connection with the generalized-inner-indexscan problem.
I don't have anything in the way of a coherent presentation to make yet,
but the thing I'm being forced to realize is that sane modeling of a
complex subplan that's on the inside of a nestloop join requires
treating *every* scan type as having different costs "the first time"
versus "during rescan".  If the total amount of data touched in the
query is less than effective_cache_size, it's not unreasonable to
suppose that I/O costs during rescan might be zero, even for a seqscan or
a non-parameterized indexscan.  In fact, only parameterized indexscans
would be able to touch pages they'd not touched the first time, and so
they ought to have higher not lower rescan costs in this environment.
But once the total data volume exceeds effective_cache_size, you have to
do something different since you shouldn't any longer assume the data is
all cached from the first scan.  (This isn't quite as hard as the case
you're talking about, since I think the relevant data volume is the sum
of the sizes of the tables used in the query; which is easy to
estimate at the start of planning, unlike the portion of the tables
that actually gets touched.)

An idea that isn't even half-baked yet is that once we had a cost model
like that, we might be able to produce plans that are well-tuned for a
heavily cached environment by applying the "rescan" cost model even to
the first scan for a particular query.  So that might lead to some sort
of "assume_cached" GUC parameter, and perhaps Kevin could tune his
reporting queries by turning that off instead of messing with individual
cost constants.  Or maybe we could be smarter if we could extract an
estimate for the amount of data touched in the query ... but like you,
I don't see a good way to get that number soon enough.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Wed, Nov 10, 2010 at 6:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Robert Haas <robertmhaas@gmail.com> wrote:
>>> Unfortunately, to know how much data we're going to grovel
>>> through, we need to know the plan; and to decide on the right
>>> plan, we need to know how much data we're going to grovel through.
>
>> And that's where they've been ending.
>
>> The only half-sane answer I've thought of is to apply a different
>> cost to full-table or full-index scans based on the ratio with
>> effective cache size.

Kevin, yes, good point.  Bravo!  Let's do that.  Details TBD, but
suppose effective_cache_size = 1GB.  What we know for sure is that a 4
GB table is not going to be fully cached but a 4 MB table may well be.
 In fact, I think we should assume that the 4 MB table IS cached,
because the point is that if it's used at all, it soon will be.  It's
almost certainly a bad idea to build a plan around the idea of
minimizing reads from that 4 MB table in favor of doing a substantial
amount of additional work somewhere else.  I suppose this could break
down if you had hundreds and hundreds of 4 MB tables all of which were
accessed regularly, but that's an unusual situation, and anyway it's
not clear that assuming them all uncached is going to be any better
than assuming them all cached.

> This might have some connection to some rather half-baked ideas I've
> been having in connection with the generalized-inner-indexscan problem.
> I don't have anything in the way of a coherent presentation to make yet,
> but the thing I'm being forced to realize is that sane modeling of a
> complex subplan that's on the inside of a nestloop join requires
> treating *every* scan type as having different costs "the first time"
> versus "during rescan".  If the total amount of data touched in the
> query is less than effective_cache_size, it's not unreasonable to
> suppose that I/O costs during rescan might be zero, even for a seqscan or
> a non-parameterized indexscan.  In fact, only parameterized indexscans
> would be able to touch pages they'd not touched the first time, and so
> they ought to have higher not lower rescan costs in this environment.
> But once the total data volume exceeds effective_cache_size, you have to
> do something different since you shouldn't any longer assume the data is
> all cached from the first scan.  (This isn't quite as hard as the case
> you're talking about, since I think the relevant data volume is the sum
> of the sizes of the tables used in the query; which is easy to
> estimate at the start of planning, unlike the portion of the tables
> that actually gets touched.)

Well, we don't want the costing model to have sharp edges.
effective_cache_size can't be taken as much more than an educated
guess, and what actually happens will depend a lot on what else is
going on on the system.  If only one query is running on a system at a
time and it is repeatedly seq-scanning a large table, the cost of
reading pages in will be very small until the table grows large enough
that you can't fit the whole thing in memory at once, and then will
abruptly go through the roof.  But realistically you're not going to
know exactly where that edge is going to be, because you can't predict
exactly how much concurrent activity there will be, for example, or
how much work_mem allocations will push out of the OS buffer cache.
So I'm thinking we should start the costs at something like 0.05/0.05
for tables that are much smaller than effective_cache_size and ramp up
to 4/1 for tables that are larger than effective_cache_size.  Maybe
just by linearly ramping up, although that has a certain feeling of
being without mathemetical soundness.

> An idea that isn't even half-baked yet is that once we had a cost model
> like that, we might be able to produce plans that are well-tuned for a
> heavily cached environment by applying the "rescan" cost model even to
> the first scan for a particular query.  So that might lead to some sort
> of "assume_cached" GUC parameter, and perhaps Kevin could tune his
> reporting queries by turning that off instead of messing with individual
> cost constants.

I think the real goal here should be to try to avoid needing a GUC.  A
lot of people could benefit if the system could make some attempt to
recognize on its own which queries are likely to be cached.  We
already have parameters you can hand-tune for each query as necessary.
 Being able to set some parameters system-wide and then get sensible
behavior automatically would be much nicer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
Mladen Gogala
Date:
On 11/10/2010 5:43 PM, Kevin Grittner wrote:
The only half-sane answer I've thought of is to apply a different
cost to full-table or full-index scans based on the ratio with
effective cache size. 

The "effective_cache_size" is, in my humble opinion, a wrong method.  It would be much easier to have a parameter, let's call it "optimizer_index_caching", which would give the assumption of the percentage of an index that is cached. In other words, if "optimizer_index_caching" was set to 80, the optimizer would assume that 80% of any index is cached and would apply different cost estimate. It's not exact but it's simple and modifiable. It would also be a great tool in the hands of the DBA which has to manage OLTP database or DW database and would be able to create a definitive bias toward one type of the execution plan.
I have to confess that the idea about such parameter is not entirely mine: http://tinyurl.com/33gu4f6

-- 
Mladen Gogala 
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com 

Re: anti-join chosen even when slower than old plan

From
Віталій Тимчишин
Date:


2010/11/11 Robert Haas <robertmhaas@gmail.com>

But thinking over what you've written here, I'm reminded of something
Peter said years ago, also about the optimizer.  He was discussed the
ratio of the estimated cost to the actual cost and made an off-hand
remark that efforts had been made over the years to make that ratio
more consistent (i.e. improve the quality of the cost estimates) but
that they'd been abandoned because they didn't necessarily produce
better plans.  Applying that line of thinking to this problem, maybe
we should give up on trying to make the estimates truly model reality,
and focus more on assigning them values which work well in practice.
For example, in your case, it would be sufficient to estimate the
amount of data that a given query is going to grovel through and then
applying some heuristic to choose values for random_page_cost and
seq_page_cost based on the ratio of that value to, I don't know,
effective_cache_size.

As for me, the simplest solution would be to allow to set costs on per-relation basis. E.g. I know that this relation is most time in memory and other one (archive) is on the disk. This could work like charm along with buffer pools (portions of shared cache) - tables (or indexes) that are required to be cached can be assigned to bufferpool that has enough size to hold all the data, archive ones - to small bufferpool. This can guarantie that after query on the archive data, cached tables are still cached.
This solutions however, does not help on tables where only some portion of table is activelly used. The solution can be to allow set costs via partial indexes - e.g. "for any table access using this index, use this cost values". This, BTW, will make table access via given index more preferable.

--
Best regards,
 Vitalii Tymchyshyn

Re: anti-join chosen even when slower than old plan

From
Kenneth Marshall
Date:
On Wed, Nov 10, 2010 at 10:47:21PM -0500, Robert Haas wrote:
> On Wed, Nov 10, 2010 at 6:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> >> Robert Haas <robertmhaas@gmail.com> wrote:
> >>> Unfortunately, to know how much data we're going to grovel
> >>> through, we need to know the plan; and to decide on the right
> >>> plan, we need to know how much data we're going to grovel through.
> >
> >> And that's where they've been ending.
> >
> >> The only half-sane answer I've thought of is to apply a different
> >> cost to full-table or full-index scans based on the ratio with
> >> effective cache size.
>
> Kevin, yes, good point.  Bravo!  Let's do that.  Details TBD, but
> suppose effective_cache_size = 1GB.  What we know for sure is that a 4
> GB table is not going to be fully cached but a 4 MB table may well be.
>  In fact, I think we should assume that the 4 MB table IS cached,
> because the point is that if it's used at all, it soon will be.  It's
> almost certainly a bad idea to build a plan around the idea of
> minimizing reads from that 4 MB table in favor of doing a substantial
> amount of additional work somewhere else.  I suppose this could break
> down if you had hundreds and hundreds of 4 MB tables all of which were
> accessed regularly, but that's an unusual situation, and anyway it's
> not clear that assuming them all uncached is going to be any better
> than assuming them all cached.
>
> > This might have some connection to some rather half-baked ideas I've
> > been having in connection with the generalized-inner-indexscan problem.
> > I don't have anything in the way of a coherent presentation to make yet,
> > but the thing I'm being forced to realize is that sane modeling of a
> > complex subplan that's on the inside of a nestloop join requires
> > treating *every* scan type as having different costs "the first time"
> > versus "during rescan". ?If the total amount of data touched in the
> > query is less than effective_cache_size, it's not unreasonable to
> > suppose that I/O costs during rescan might be zero, even for a seqscan or
> > a non-parameterized indexscan. ?In fact, only parameterized indexscans
> > would be able to touch pages they'd not touched the first time, and so
> > they ought to have higher not lower rescan costs in this environment.
> > But once the total data volume exceeds effective_cache_size, you have to
> > do something different since you shouldn't any longer assume the data is
> > all cached from the first scan. ?(This isn't quite as hard as the case
> > you're talking about, since I think the relevant data volume is the sum
> > of the sizes of the tables used in the query; which is easy to
> > estimate at the start of planning, unlike the portion of the tables
> > that actually gets touched.)
>
> Well, we don't want the costing model to have sharp edges.
> effective_cache_size can't be taken as much more than an educated
> guess, and what actually happens will depend a lot on what else is
> going on on the system.  If only one query is running on a system at a
> time and it is repeatedly seq-scanning a large table, the cost of
> reading pages in will be very small until the table grows large enough
> that you can't fit the whole thing in memory at once, and then will
> abruptly go through the roof.  But realistically you're not going to
> know exactly where that edge is going to be, because you can't predict
> exactly how much concurrent activity there will be, for example, or
> how much work_mem allocations will push out of the OS buffer cache.
> So I'm thinking we should start the costs at something like 0.05/0.05
> for tables that are much smaller than effective_cache_size and ramp up
> to 4/1 for tables that are larger than effective_cache_size.  Maybe
> just by linearly ramping up, although that has a certain feeling of
> being without mathemetical soundness.
>
> > An idea that isn't even half-baked yet is that once we had a cost model
> > like that, we might be able to produce plans that are well-tuned for a
> > heavily cached environment by applying the "rescan" cost model even to
> > the first scan for a particular query. ?So that might lead to some sort
> > of "assume_cached" GUC parameter, and perhaps Kevin could tune his
> > reporting queries by turning that off instead of messing with individual
> > cost constants.
>
> I think the real goal here should be to try to avoid needing a GUC.  A
> lot of people could benefit if the system could make some attempt to
> recognize on its own which queries are likely to be cached.  We
> already have parameters you can hand-tune for each query as necessary.
>  Being able to set some parameters system-wide and then get sensible
> behavior automatically would be much nicer.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

I agree with the goal of avoiding the need for a GUC. This needs to
be as automatic as possible. One idea I had had was computing a value
for the amount of cache data in the system by keeping a sum or a
weighted sum of the table usage in the system. Smaller tables and
indexes would contribute a smaller amount to the total, while larger
indexes and tables would contribute a larger amount. Then by comparing
this running total to the effective_cache_size, set the random and
sequential costs for a query. This would allow the case of many 4MB
tables to favor disk I/O more than memory I/O. The weighting could
be a function of simultaneous users of the table. I know this is a
bit of hand-waving but some sort of dynamic feedback needs to be
provided to the planning process as system use increases.

Regards,
Ken

Re: anti-join chosen even when slower than old plan

From
Mladen Gogala
Date:
Kenneth Marshall wrote:
> I agree with the goal of avoiding the need for a GUC. This needs to
> be as automatic as possible. One idea I had had was computing a value
> for the amount of cache data in the system by keeping a sum or a
> weighted sum of the table usage in the system. Smaller tables and
> indexes would contribute a smaller amount to the total, while larger
> indexes and tables would contribute a larger amount. Then by comparing
> this running total to the effective_cache_size, set the random and
> sequential costs for a query. This would allow the case of many 4MB
> tables to favor disk I/O more than memory I/O. The weighting could
> be a function of simultaneous users of the table. I know this is a
> bit of hand-waving but some sort of dynamic feedback needs to be
> provided to the planning process as system use increases.
>
> Regards,
> Ken
>
>
Kenneth, you seem to be only concerned with the accuracy of the planning
process, not with the plan stability. As a DBA who has to monitor real
world applications, I find things like an execution plan changing with
the use of the system to be my worst nightmare. The part where you say
that "this needs to be as automatic as possible" probably means that I
will not be able to do anything about it, if the optimizer, by any
chance, doesn't get it right. That looks to me like an entirely wrong
way to go.
When application developer tunes the SQL both him and me expect that SQL
to always perform that way, not to change the execution plan because the
system is utilized more than it was 1 hour ago. Nobody seems to have
taken my suggestion about having a parameter
which would simply "invent" the percentage out of thin air seriously,
because it's obviously not accurate.
However, the planner accuracy is not the only concern. Running
applications on the system usually requires plan stability. Means of
external control of the execution plan, DBA knobs and buttons that can
be turned and pushed to produce the desired plan are also very much desired.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: anti-join chosen even when slower than old plan

From
Kenneth Marshall
Date:
On Thu, Nov 11, 2010 at 09:15:58AM -0500, Mladen Gogala wrote:
> Kenneth Marshall wrote:
>> I agree with the goal of avoiding the need for a GUC. This needs to
>> be as automatic as possible. One idea I had had was computing a value
>> for the amount of cache data in the system by keeping a sum or a
>> weighted sum of the table usage in the system. Smaller tables and
>> indexes would contribute a smaller amount to the total, while larger
>> indexes and tables would contribute a larger amount. Then by comparing
>> this running total to the effective_cache_size, set the random and
>> sequential costs for a query. This would allow the case of many 4MB
>> tables to favor disk I/O more than memory I/O. The weighting could
>> be a function of simultaneous users of the table. I know this is a
>> bit of hand-waving but some sort of dynamic feedback needs to be
>> provided to the planning process as system use increases.
>>
>> Regards,
>> Ken
>>
>>
> Kenneth, you seem to be only concerned with the accuracy of the planning
> process, not with the plan stability. As a DBA who has to monitor real
> world applications, I find things like an execution plan changing with the
> use of the system to be my worst nightmare. The part where you say that
> "this needs to be as automatic as possible" probably means that I will not
> be able to do anything about it, if the optimizer, by any chance, doesn't
> get it right. That looks to me like an entirely wrong way to go.
> When application developer tunes the SQL both him and me expect that SQL to
> always perform that way, not to change the execution plan because the
> system is utilized more than it was 1 hour ago. Nobody seems to have taken
> my suggestion about having a parameter
> which would simply "invent" the percentage out of thin air seriously,
> because it's obviously not accurate.
> However, the planner accuracy is not the only concern. Running applications
> on the system usually requires plan stability. Means of
> external control of the execution plan, DBA knobs and buttons that can be
> turned and pushed to produce the desired plan are also very much desired.
>
> --
> Mladen Gogala Sr. Oracle DBA
> 1500 Broadway
> New York, NY 10036
> (212) 329-5251
> www.vmsinfo.com
>
Hi Mladen,

I think in many ways, this is the same problem. Because we are not
correctly modeling the system, the plan choices are not accurate
either for some scenarios. This means that when plan costs are
compared, the evaluation is not accurate. This is what causes the
terrible plans being right next to the good plans and is what
impacts the "plan stability". If the costs are correct, then in
fact the plan stability will be much better with the better
costing, not worse. Plans with close costs should actually have
close performance.

Regards,
Ken

Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
Mladen Gogala <mladen.gogala@vmsinfo.com> wrote:

> create a definitive bias toward one type of the execution plan.

We're talking about trying to support the exact opposite.  This all
started because a database which was tuned for good response time
for relatively small queries against a "hot" portion of some tables
chose a bad plan for a weekend maintenance run against the full
tables.  We're talking about the possibility of adapting the cost
factors based on table sizes as compared to available cache, to more
accurately model the impact of needing to do actual disk I/O for
such queries.

This also is very different from trying to adapt queries to what
happens to be currently in cache.  As already discussed on a recent
thread, the instability in plans and the failure to get to an
effective cache set make that a bad idea.  The idea discussed here
would maintain a stable plan for a given query, it would just help
choose a good plan based on the likely level of caching.

-Kevin

Re: anti-join chosen even when slower than old plan

From
Bob Lunney
Date:
--- On Thu, 11/11/10, Mladen Gogala <mladen.gogala@vmsinfo.com> wrote:

> From: Mladen Gogala <mladen.gogala@vmsinfo.com>
> Subject: Re: [PERFORM] anti-join chosen even when slower than old plan
> To: "Kenneth Marshall" <ktm@rice.edu>
> Cc: "Robert Haas" <robertmhaas@gmail.com>, "Tom Lane" <tgl@sss.pgh.pa.us>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov>,"pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org> 
> Date: Thursday, November 11, 2010, 9:15 AM
> Kenneth Marshall wrote:
> > I agree with the goal of avoiding the need for a GUC.
> This needs to
> > be as automatic as possible. One idea I had had was
> computing a value
> > for the amount of cache data in the system by keeping
> a sum or a
> > weighted sum of the table usage in the system. Smaller
> tables and
> > indexes would contribute a smaller amount to the
> total, while larger
> > indexes and tables would contribute a larger amount.
> Then by comparing
> > this running total to the effective_cache_size, set
> the random and
> > sequential costs for a query. This would allow the
> case of many 4MB
> > tables to favor disk I/O more than memory I/O. The
> weighting could
> > be a function of simultaneous users of the table. I
> know this is a
> > bit of hand-waving but some sort of dynamic feedback
> needs to be
> > provided to the planning process as system use
> increases.
> >
> > Regards,
> > Ken
> >
> >   
> Kenneth, you seem to be only concerned with the accuracy of
> the planning process, not with the plan stability. As a DBA
> who has to monitor real world applications, I find things
> like an execution plan changing with the use of the system
> to be my worst nightmare. The part where you say that "this
> needs to be as automatic as possible" probably means that I
> will not be able to do anything about it, if the optimizer,
> by any chance, doesn't get it right. That looks to me like
> an entirely wrong way to go.
> When application developer tunes the SQL both him and me
> expect that SQL to always perform that way, not to change
> the execution plan because the system is utilized more than
> it was 1 hour ago. Nobody seems to have taken my suggestion
> about having a parameter
> which would simply "invent" the percentage out of thin air
> seriously, because it's obviously not accurate.
> However, the planner accuracy is not the only concern.
> Running applications on the system usually requires plan
> stability. Means of
> external control of the execution plan, DBA knobs and
> buttons that can be turned and pushed to produce the desired
> plan are also very much desired.
>
> -- Mladen Gogala Sr. Oracle DBA
> 1500 Broadway
> New York, NY 10036
> (212) 329-5251
> www.vmsinfo.com
>

Mladen,

Been there, done that with Oracle for more years than I care to remember or admit.  Having the necessary knobs was both
dauntingand a godsend, depending on if you could find the right one(s) to frob during production use, and you turned
themthe right way and amount.  I personally find having less knobbage with PostgreSQL to be a huge benefit over Oracle.
In that spirit, I offer the following suggestion: (Ken's original suggestion inspired me, so if I misunderstand it,
Ken,please correct me.) 

What if the code that managed the shared buffer cache kept track of how many buffers were in the cache for each table
andindex?  Then the optimizer could know the ratio of cached to non-cached table of index buffers (how many pages are
inPG's buffer cache vs. the total number of pages required for the entire table, assuming autovacuum is working well)
andplan accordingly.  It would even be possible to skew the estimate based on the ratio of shared_buffers to
effective_cache_size. The optimizer could then dynamically aadjust the random and sequential costs per query prior to
planning,with (hopefully) plans optimized to the current condition of the server and host caches just prior to
execution.

There are lots of assumptions here, the primary ones being the shared buffer cache's state doesn't change significantly
betweenthe start of planning and actual execution time, and the host is dedicated to running the database and nothing
elsethat would trash the host's file system cache.  I admit that I haven't looked at the code for this yet, so I don't
knowif I'm on to something or off in the weeds. 

Regards,

Bob Lunney






Re: anti-join chosen even when slower than old plan

From
Mladen Gogala
Date:
Kevin Grittner wrote:
> Mladen Gogala <mladen.gogala@vmsinfo.com> wrote:
>
>
>> create a definitive bias toward one type of the execution plan.
>>
>
> We're talking about trying to support the exact opposite.
I understand this, that is precisely the reason for my intervention into
the discussion of experts, which I am not.
> This all
> started because a database which was tuned for good response time
> for relatively small queries against a "hot" portion of some tables
> chose a bad plan for a weekend maintenance run against the full
> tables.  We're talking about the possibility of adapting the cost
> factors based on table sizes as compared to available cache, to more
> accurately model the impact of needing to do actual disk I/O for
> such queries.
>
Kevin, in my experience, the hardest thing to do is to tune so called
mixed type databases. In practice, databases are usually separated: OLTP
database on one group of servers, reporting database and the data
warehouse on another group of servers. Postgres 9.0 has made a great
stride toward such possibility with the new replication facilities.
Again, having an optimizer which will choose the plan completely
accurately is, at least in my opinion, less important than having a
possibility of manual control, the aforementioned "knobs and buttons"
and produce the same plan for the same statement. Trying to make the
optimizer smart enough for all types of loads is akin to looking for the
Holy Grail. Inevitably, you will face some hard questions, like the one
about the airspeed velocity of an unladen swallow, and the whole search
is likely to end in pretty funny way, not producing the desired
"optimizing genie in the CPU".
>
> This also is very different from trying to adapt queries to what
> happens to be currently in cache.  As already discussed on a recent
> thread, the instability in plans and the failure to get to an
> effective cache set make that a bad idea.  The idea discussed here
> would maintain a stable plan for a given query, it would just help
> choose a good plan based on the likely level of caching.
>
Kevin, I am talking from the perspective of a DBA who is involved with a
production databases on day-to-day basis. I am no expert but I do
believe to speak from a perspective of users that Postgres has to win in
order to make further inroads into the corporate server rooms. Without
the possibility of such control and the plan stability, it is hard for
me to recommend more extensive use of PostgreSQL to my boss. Whatever
solution is chosen, it needs to have "knobs and buttons" and produce the
plans that will not change when the CPU usage goes up.

--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
Mladen Gogala <mladen.gogala@vmsinfo.com> writes:
> Again, having an optimizer which will choose the plan completely
> accurately is, at least in my opinion, less important than having a
> possibility of manual control, the aforementioned "knobs and buttons"
> and produce the same plan for the same statement.

More knobs and buttons is the Oracle way, and the end result of that
process is that you have something as hard to use as Oracle.  That's
generally not thought of as desirable in this community.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
Craig James
Date:
On 11/11/10 9:13 AM, Mladen Gogala wrote:
> Kevin Grittner wrote:
>> Mladen Gogala <mladen.gogala@vmsinfo.com> wrote:
>>
>>> create a definitive bias toward one type of the execution plan.
>>
>> We're talking about trying to support the exact opposite.
> I understand this, that is precisely the reason for my intervention into the discussion of experts, which I am not.
>> This all
>> started because a database which was tuned for good response time
>> for relatively small queries against a "hot" portion of some tables
>> chose a bad plan for a weekend maintenance run against the full
>> tables. We're talking about the possibility of adapting the cost
>> factors based on table sizes as compared to available cache, to more
>> accurately model the impact of needing to do actual disk I/O for
>> such queries.
> Kevin, in my experience, the hardest thing to do is to tune so called mixed type databases. In practice, databases
areusually separated: OLTP database on one group of servers, reporting database and the data warehouse on another group
ofservers. Postgres 9.0 has made a great stride toward such possibility with the new replication facilities. Again,
havingan optimizer which will choose the plan completely accurately is, at least in my opinion, less important than
havinga possibility of manual control, the aforementioned "knobs and buttons" and produce the same plan for the same
statement.Trying to make the optimizer smart enough for all types of loads is akin to looking for the Holy Grail.
Inevitably,you will face some hard questions, like the one about the airspeed velocity of an unladen swallow, and the
wholesearch is likely to end in pretty funny way, not producing the desired "optimizing genie in the CPU". 

What about rule-based configuration?  You provide a default configuration (what Postgres does now), and then allow one
ormore alternate configurations that are triggered when certain rules match.  The rules might be things like: 

- Per user or group of users.  Set up a user or group for your
   maintenance task and you automatically get your own config.

- A set of tables.  If you do a query that uses tables X, Y, and
   Z, this configuration applies to you.

- A regular expression applied to the SQL.  If the regexp matches,
   the configuration applies.

- System resource usage.  If some other process is gobbling memory,
   switch to a configuration with lower memory requirements.

- A time of day.  Maybe on weekends, different rules apply.

... and so on.  I don't know what the right parameters might be, but surely the original poster's problem would be
solvedby this solution.  It gives performance experts the tool they need for complex installations, without adding FUD
tothe lives of everyone else. 

Craig

>>
>> This also is very different from trying to adapt queries to what
>> happens to be currently in cache. As already discussed on a recent
>> thread, the instability in plans and the failure to get to an
>> effective cache set make that a bad idea. The idea discussed here
>> would maintain a stable plan for a given query, it would just help
>> choose a good plan based on the likely level of caching.
> Kevin, I am talking from the perspective of a DBA who is involved with a production databases on day-to-day basis. I
amno expert but I do believe to speak from a perspective of users that Postgres has to win in order to make further
inroadsinto the corporate server rooms. Without the possibility of such control and the plan stability, it is hard for
meto recommend more extensive use of PostgreSQL to my boss. Whatever solution is chosen, it needs to have "knobs and
buttons"and produce the plans that will not change when the CPU usage goes up. 
>


Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Thu, Nov 11, 2010 at 10:00 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Mladen Gogala <mladen.gogala@vmsinfo.com> wrote:
>
>> create a definitive bias toward one type of the execution plan.
>
> We're talking about trying to support the exact opposite.  This all
> started because a database which was tuned for good response time
> for relatively small queries against a "hot" portion of some tables
> chose a bad plan for a weekend maintenance run against the full
> tables.  We're talking about the possibility of adapting the cost
> factors based on table sizes as compared to available cache, to more
> accurately model the impact of needing to do actual disk I/O for
> such queries.
>
> This also is very different from trying to adapt queries to what
> happens to be currently in cache.  As already discussed on a recent
> thread, the instability in plans and the failure to get to an
> effective cache set make that a bad idea.  The idea discussed here
> would maintain a stable plan for a given query, it would just help
> choose a good plan based on the likely level of caching.

Let's back up a moment and talk about what the overall goal is, here.
Ideally, we would like PostgreSQL to have excellent performance at all
times, under all circumstances, with minimal tuning.  Therefore, we do
NOT want to add variables that will, by design, need constant manual
adjustment.  That is why I suggested that Tom's idea of an
assume_cached GUC is probably not what we really want to do.   On the
other hand, what I understand Mladen to be suggesting is something
completely different.  He's basically saying that, of course, he wants
it to work out of the box most of the time, but since there are
guaranteed to be cases where it doesn't, how about providing some
knobs that aren't intended to be routinely twaddled but which are
available in case of emergency?  Bravo, I say!

Consider the case of whether a table is cached.  Currently, we
estimate that it isn't, and you can sort of twaddle that assumption
globally by setting seq_page_cost and random_page_cost.  In 9.0, you
can twaddle it with a bit more granularity by adjusting seq_page_cost
and random_page_cost on a per-tablespace basis.  But that's really
intended to handle the case where you have one tablespace on an SSD
and another that isn't.  It doesn't really model caching at all; we're
just abusing it as if it does.  If 90% of a table is cached, you can't
simply multiply the cost of reading it by 0.1, because now many of
those reads will be random I/O rather than sequential I/O.  The right
thing to do is to estimate what percentage of the table will be
cached, then estimate how much random and sequential I/O it'll take to
get the rest, and then compute the cost.

To do that, we can adopt the approach proposed upthread of comparing
the size of the table to effective_cache_size.  We come up with some
function f, such that f(effective_cache_size, table_size) =
assumed_caching_percentage, and then from there we estimate random
I/Os and sequential I/Os, and from there we estimate costs.  This is a
good system, almost certainly better than what we have now.  However,
it's also guaranteed to not always work.  The DBA may know, for
example, that one particular table that is quite large is always fully
cached because it is very heavily access.  So why not let them pin the
assumed_caching_percentage for that table to 100%?  I don't see any
reason at all.  Most people will never need to touch that setting, but
it's there in case you really, really need it.

We've traditionally been reluctant to do this sort of thing (as the
email Tom just sent reflects) but I think we should soften up a bit.
A product gets hard to use when it has knobs that MUST be tuned to
make it work at all, and certainly AFAICT Oracle falls into that
category.  My rollback segment is full?  My table is out of extents?
Well allocate some more space then; I certainly wouldn't have imposed
an arbitrary cap on the table size if I'd known I was doing it.
However, that's not the same thing as having knobs that are
*available* when the shit really hits the fan.  By failing to provide
that type of knob, we're not facilitating ease of use; we're just
making it difficult for the small percentage of people who have
problems to fix them, which is another kind of non-ease-of-use.

In fact, we already have a few knobs of this type.  We have a
statistics target which can be overriden on a per-column basis, and
beginning in 9.0, you can override the planner's n_distinct estimates
in the same way.  Why?  Because we know that it's not achievable to
estimate n_distinct accurately in all cases without making ANALYZE
unreasonably slow.  I bet that very, VERY few people will ever use
that feature, so it costs nothing in terms of "oh, another setting I
have to configure".  But for people who are getting bitten by
inaccurate n_distinct estimates, it will be very nice to have that as
an escape hatch.  I see no harm, and much value, in providing similar
escape hatches elsewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
Mladen Gogala
Date:
Tom Lane wrote:
> More knobs and buttons is the Oracle way,

True. Very true.

> and the end result of that
> process is that you have something as hard to use as Oracle.

Also, you end up with something which is extremely reliable and
adjustable to variety of conditions.

> That's
> generally not thought of as desirable in this community.
>
>             regards, tom lane
>
Allow me to play the devil's advocate again. This community is still
much, much smaller than even the MySQL community, much less Oracle's
community. If growth of the community is the goal, copying a page or two
from the Oracle's book, looks like a good idea to me. The only thing I
dislike about Oracle is its price, not its complexity.

--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Let's back up a moment and talk about what the overall goal is, here.
> Ideally, we would like PostgreSQL to have excellent performance at all
> times, under all circumstances, with minimal tuning.  Therefore, we do
> NOT want to add variables that will, by design, need constant manual
> adjustment.  That is why I suggested that Tom's idea of an
> assume_cached GUC is probably not what we really want to do.   On the
> other hand, what I understand Mladen to be suggesting is something
> completely different.  He's basically saying that, of course, he wants
> it to work out of the box most of the time, but since there are
> guaranteed to be cases where it doesn't, how about providing some
> knobs that aren't intended to be routinely twaddled but which are
> available in case of emergency?  Bravo, I say!

Um ... those are exactly the same thing.  You're just making different
assumptions about how often you will need to twiddle the setting.
Neither assumption is based on any visible evidence, unfortunately.
I was thinking of assume_cached as something that could be
set-and-forget most of the time, and you're entirely right to criticize
it on the grounds that maybe it wouldn't.  But to support a proposal
that doesn't even exist yet on the grounds that it *would* be
set-and-forget seems a tad inconsistent.  We can't make that judgment
without a whole lot more details than have been provided yet for any
idea in this thread.

I do think that something based around a settable-per-table caching
percentage might be a reasonable way to proceed.  But the devil is in
the details, and we don't have those yet.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
I wrote:
> I do think that something based around a settable-per-table caching
> percentage might be a reasonable way to proceed.

BTW ... on reflection it seems that this would *not* solve the use-case
Kevin described at the start of this thread.  What he's got AIUI is some
large tables whose recent entries are well-cached, and a lot of queries
that tend to hit that well-cached portion, plus a few queries that hit
the whole table and so see largely-not-cached behavior.  We can't
represent that very well with a caching knob at the table level.  Either
a high or a low setting will be wrong for one set of queries or the
other.

It might work all right if he were to partition the table and then have
a different caching value attached to the currently-latest partition,
but that doesn't sound exactly maintenance-free either.  Also, that only
works with the current statically-planned approach to partitioned
tables.  I think where we're trying to go with partitioning is that
the planner doesn't consider the individual partitions, but the executor
just hits the right one at runtime --- so cost modifiers attached to
individual partitions aren't going to work in that environment.

The most practical solution for his case still seems to be to twiddle
some GUC or other locally in the maintenance scripts that do the
full-table-scan queries.  Unfortunately we don't have an equivalent
of per-session SET (much less SET LOCAL) for per-relation attributes.
Not sure if we want to go there.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Thu, Nov 11, 2010 at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Let's back up a moment and talk about what the overall goal is, here.
>> Ideally, we would like PostgreSQL to have excellent performance at all
>> times, under all circumstances, with minimal tuning.  Therefore, we do
>> NOT want to add variables that will, by design, need constant manual
>> adjustment.  That is why I suggested that Tom's idea of an
>> assume_cached GUC is probably not what we really want to do.   On the
>> other hand, what I understand Mladen to be suggesting is something
>> completely different.  He's basically saying that, of course, he wants
>> it to work out of the box most of the time, but since there are
>> guaranteed to be cases where it doesn't, how about providing some
>> knobs that aren't intended to be routinely twaddled but which are
>> available in case of emergency?  Bravo, I say!
>
> Um ... those are exactly the same thing.  You're just making different
> assumptions about how often you will need to twiddle the setting.
> Neither assumption is based on any visible evidence, unfortunately.
>
> I was thinking of assume_cached as something that could be
> set-and-forget most of the time, and you're entirely right to criticize
> it on the grounds that maybe it wouldn't.  But to support a proposal
> that doesn't even exist yet on the grounds that it *would* be
> set-and-forget seems a tad inconsistent.  We can't make that judgment
> without a whole lot more details than have been provided yet for any
> idea in this thread.

Well, maybe I misunderstood what you were proposing.  I had the
impression that you were proposing something that would *by design*
require adjustment for each query, so evidently I missed the point.
It seems to me that random_page_cost and seq_page_cost are pretty
close to set-and-forget already.  We don't have many reports of people
needing to tune these values on a per-query basis; most people seem to
just guesstimate a cluster-wide value and call it good.  Refining the
algorithm should only make things better.

> I do think that something based around a settable-per-table caching
> percentage might be a reasonable way to proceed.  But the devil is in
> the details, and we don't have those yet.

I think one of the larger devils in the details is deciding how to
estimate the assumed caching percentage when the user hasn't specified
one.  Frankly, I suspect that if we simply added a reloption called
assumed_caching_percentage and made it default to zero, we would make
a bunch of DBAs happy; they'd knock down seq_page_cost and
random_page_cost enough to account for the general level of caching
and then bump assumed_caching_percentage up for hot tables/indexes (or
ones that they want to have become hot).  I think we can do better
than that, but the right formula isn't exactly obvious.  I feel safe
saying that if effective_cache_size=1GB and table_size=4MB, then we
ought to take the table as fully cached.  But it's far from clear what
caching percentage we should assume when table_size=400MB, and it
seems like the sort of thing that will lead to endless bikeshedding.
There's probably no perfect answer, but I feel we can likely come up
with something that is better than a constant (which would probably
still be better than what we have now).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Thu, Nov 11, 2010 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> I do think that something based around a settable-per-table caching
>> percentage might be a reasonable way to proceed.
>
> BTW ... on reflection it seems that this would *not* solve the use-case
> Kevin described at the start of this thread.  What he's got AIUI is some
> large tables whose recent entries are well-cached, and a lot of queries
> that tend to hit that well-cached portion, plus a few queries that hit
> the whole table and so see largely-not-cached behavior.  We can't
> represent that very well with a caching knob at the table level.  Either
> a high or a low setting will be wrong for one set of queries or the
> other.

Yeah.  For Kevin's case, it seems like we want the caching percentage
to vary not so much based on which table we're hitting at the moment
but on how much of it we're actually reading.  However, the two
problems are related enough that I think it might be feasible to come
up with one solution that answers both needs, or perhaps two
somewhat-intertwined solutions.

> The most practical solution for his case still seems to be to twiddle
> some GUC or other locally in the maintenance scripts that do the
> full-table-scan queries.  Unfortunately we don't have an equivalent
> of per-session SET (much less SET LOCAL) for per-relation attributes.
> Not sure if we want to go there.

I doubt it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> BTW ... on reflection it seems that this would *not* solve the
> use-case Kevin described at the start of this thread.  What he's
> got AIUI is some large tables whose recent entries are well-
> cached, and a lot of queries that tend to hit that well-cached
> portion, plus a few queries that hit the whole table and so see
> largely-not-cached behavior.  We can't represent that very well
> with a caching knob at the table level. Either a high or a low
> setting will be wrong for one set of queries or the other.

Exactly right.

> The most practical solution for his case still seems to be to
> twiddle some GUC or other locally in the maintenance scripts that
> do the full-table-scan queries.

Yes, that works fine.  The thread spun off in this speculative
direction because I started thinking about whether there was any
reasonable way for PostgreSQL to automatically handle such things
without someone having to notice the problem and do the per-script
tuning.  I don't know whether any of the ideas thus spawned are
worth the effort -- it's not a situation I find myself in all that
often.  I guess it could be considered an "ease of use" feature.

> Unfortunately we don't have an equivalent of per-session SET (much
> less SET LOCAL) for per-relation attributes.  Not sure if we want
> to go there.

Besides the "fully-scanned object size relative to relation size
costing adjustment" idea, the only one which seemed to be likely to
be useful for this sort of issue was the "costing factors by user
ID" idea -- the interactive queries hitting the well-cached portion
of the tables are run through a read-only user ID, while the weekly
maintenance scripts (obviously) are not.  With the settings I
initially had assigned to the cluster the maintenance scripts would
never have seen this issue; it was tuning to resolve end-user
complaints of slowness in the interactive queries which set up the
conditions for failure, and if I'd had per-user settings, I probably
would have (and definitely *should* have) used them.

FWIW, I can certainly see the potential of some other ideas which
came up on the thread; what might have seemed like antipathy toward
them was more of an attempt to point out that they would not have
helped at all with the problem which started this thread.

-Kevin

Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Nov 11, 2010 at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do think that something based around a settable-per-table caching
>> percentage might be a reasonable way to proceed. �But the devil is in
>> the details, and we don't have those yet.

> I think one of the larger devils in the details is deciding how to
> estimate the assumed caching percentage when the user hasn't specified
> one.

I was imagining something very similar to the handling of seq_page_cost,
ie, there's a GUC controlling the system-wide default and then you can
override that per-table.  But the real question is whether per-table
is a useful granularity to control it at.  See my later message.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
I wrote:

> Besides the "fully-scanned object size relative to relation size
> costing adjustment" idea,

s/relation size/effective cache size/

-Kevin

Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Yeah.  For Kevin's case, it seems like we want the caching percentage
> to vary not so much based on which table we're hitting at the moment
> but on how much of it we're actually reading.

Well, we could certainly take the expected number of pages to read and
compare that to effective_cache_size.  The thing that's missing in that
equation is how much other stuff is competing for cache space.  I've
tried to avoid having the planner need to know the total size of the
database cluster, but it's kind of hard to avoid that if you want to
model this honestly.

Would it be at all workable to have an estimate that so many megs of a
table are in cache (independently of any other table), and then we could
scale the cost based on the expected number of pages to read versus that
number?  The trick here is that DBAs really aren't going to want to set
such a per-table number (at least, most of the time) so we need a
formula to get to a default estimate for that number based on some simple
system-wide parameters.  I'm not sure if that's any easier.

BTW, it seems that all these variants have an implicit assumption that
if you're reading a small part of the table it's probably part of the
working set; which is an assumption that could be 100% wrong.  I don't
see a way around it without trying to characterize the data access at
an unworkably fine level, though.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Besides the "fully-scanned object size relative to relation size
> costing adjustment" idea, the only one which seemed to be likely to
> be useful for this sort of issue was the "costing factors by user
> ID" idea -- the interactive queries hitting the well-cached portion
> of the tables are run through a read-only user ID, while the weekly
> maintenance scripts (obviously) are not.  With the settings I
> initially had assigned to the cluster the maintenance scripts would
> never have seen this issue; it was tuning to resolve end-user
> complaints of slowness in the interactive queries which set up the
> conditions for failure, and if I'd had per-user settings, I probably
> would have (and definitely *should* have) used them.

Erm ... you can in fact do "ALTER USER SET random_page_cost" today.
As long as the settings are GUC parameters we have quite a lot of
flexibility about how to control them.  This gets back to my earlier
point that our current form of per-relation properties (reloptions) is
considerably less flexible than a GUC.  I think that if we create any
strong planner dependence on such properties, we're going to end up
needing to be able to set them in all the same ways you can set a GUC.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I've tried to avoid having the planner need to know the total size
> of the database cluster, but it's kind of hard to avoid that if
> you want to model this honestly.

Agreed.  Perhaps the cost could start escalating when the pages to
access hit (effective_cache_size * relation_size / database_size),
and escalate to the defaults (or some new GUCs) in a linear fashion
until you hit effective_cache_size?

> BTW, it seems that all these variants have an implicit assumption
> that if you're reading a small part of the table it's probably
> part of the working set

I would say that the assumption should be that seq_page_cost and
random_page_cost model the costs for less extreme (and presumably
more common) queries, and that we're providing a way of handling the
exceptional, extreme queries.

-Kevin

Re: anti-join chosen even when slower than old plan

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Erm ... you can in fact do "ALTER USER SET random_page_cost"
> today.

Ouch.  I'm embarrassed to have missed that.  I'll do that instead of
adding those settings to the scripts, then.

Thanks for pointing that out.

-Kevin

Re: anti-join chosen even when slower than old plan

From
Andres Freund
Date:
On Thursday 11 November 2010 19:58:49 Tom Lane wrote:
> I wrote:
> > I do think that something based around a settable-per-table caching
> > percentage might be a reasonable way to proceed.
>
> BTW ... on reflection it seems that this would *not* solve the use-case
> Kevin described at the start of this thread.  What he's got AIUI is some
> large tables whose recent entries are well-cached, and a lot of queries
> that tend to hit that well-cached portion, plus a few queries that hit
> the whole table and so see largely-not-cached behavior.  We can't
> represent that very well with a caching knob at the table level.  Either
> a high or a low setting will be wrong for one set of queries or the
> other.
>
> It might work all right if he were to partition the table and then have
> a different caching value attached to the currently-latest partition,
> but that doesn't sound exactly maintenance-free either.  Also, that only
> works with the current statically-planned approach to partitioned
> tables.  I think where we're trying to go with partitioning is that
> the planner doesn't consider the individual partitions, but the executor
> just hits the right one at runtime --- so cost modifiers attached to
> individual partitions aren't going to work in that environment.
>
> The most practical solution for his case still seems to be to twiddle
> some GUC or other locally in the maintenance scripts that do the
> full-table-scan queries.  Unfortunately we don't have an equivalent
> of per-session SET (much less SET LOCAL) for per-relation attributes.
> Not sure if we want to go there.
As dicussed in another thread some time ago another possibility is to probe
how well the data i cached using mincore() or similar...
While it presents problem with cache ramp-up it quite cool for other use-cases
(like this one).

Andre

Re: anti-join chosen even when slower than old plan

From
Jon Nelson
Date:
On Thu, Nov 11, 2010 at 1:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Besides the "fully-scanned object size relative to relation size
>> costing adjustment" idea, the only one which seemed to be likely to
>> be useful for this sort of issue was the "costing factors by user
>> ID" idea -- the interactive queries hitting the well-cached portion
>> of the tables are run through a read-only user ID, while the weekly
>> maintenance scripts (obviously) are not.  With the settings I
>> initially had assigned to the cluster the maintenance scripts would
>> never have seen this issue; it was tuning to resolve end-user
>> complaints of slowness in the interactive queries which set up the
>> conditions for failure, and if I'd had per-user settings, I probably
>> would have (and definitely *should* have) used them.
>
> Erm ... you can in fact do "ALTER USER SET random_page_cost" today.
> As long as the settings are GUC parameters we have quite a lot of
> flexibility about how to control them.  This gets back to my earlier
> point that our current form of per-relation properties (reloptions) is
> considerably less flexible than a GUC.  I think that if we create any
> strong planner dependence on such properties, we're going to end up
> needing to be able to set them in all the same ways you can set a GUC.

In Kevin's particular case, would this mechanism not help? By that I
mean he could have two users: one user for the daily, the
tables-ought-to-be-in-hot-cache use case. The other use could make use
of the ALTER USER SET ... mechanism to drive the weekly reporting
(tables are probably not hot) use case.


--
Jon

Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Yeah.  For Kevin's case, it seems like we want the caching percentage
>> to vary not so much based on which table we're hitting at the moment
>> but on how much of it we're actually reading.
>
> Well, we could certainly take the expected number of pages to read and
> compare that to effective_cache_size.  The thing that's missing in that
> equation is how much other stuff is competing for cache space.  I've
> tried to avoid having the planner need to know the total size of the
> database cluster, but it's kind of hard to avoid that if you want to
> model this honestly.

I'm not sure I agree with that.  I mean, you could easily have a
database that is much larger than effective_cache_size, but only that
much of it is hot.  Or, the hot portion could move around over time.
And for reasons of both technical complexity and plan stability, I
don't think we want to try to model that.  It seems perfectly
reasonable to say that reading 25% of effective_cache_size will be
more expensive *per-page* than reading 5% of effective_cache_size,
independently of what the total cluster size is.

> Would it be at all workable to have an estimate that so many megs of a
> table are in cache (independently of any other table), and then we could
> scale the cost based on the expected number of pages to read versus that
> number?  The trick here is that DBAs really aren't going to want to set
> such a per-table number (at least, most of the time) so we need a
> formula to get to a default estimate for that number based on some simple
> system-wide parameters.  I'm not sure if that's any easier.

That's an interesting idea.  For the sake of argument, suppose we
assume that a relation which is less than 5% of effective_cache_size
will be fully cached; and anything larger we'll assume that much of it
is cached.  Consider a 4GB machine with effective_cache_size set to
3GB.  Then we'll assume that any relation less than 153MB table is
100% cached, a 1 GB table is 15% cached, and a 3 GB table is 5%
cached.  That doesn't seem quite right, though: the caching percentage
drops off very quickly after you exceed the threshold.

*thinks*

I wondering if we could do something with a formula like 3 *
amount_of_data_to_read / (3 * amount_of_data_to_read +
effective_cache_size) = percentage NOT cached.  That is, if we're
reading an amount of data equal to effective_cache_size, we assume 25%
caching, and plot a smooth curve through that point.  In the examples
above, we would assume that a 150MB read is 87% cached, a 1GB read is
50% cached, and a 3GB read is 25% cached.

> BTW, it seems that all these variants have an implicit assumption that
> if you're reading a small part of the table it's probably part of the
> working set; which is an assumption that could be 100% wrong.  I don't
> see a way around it without trying to characterize the data access at
> an unworkably fine level, though.

Me neither, but I think it will frequently be true, and I'm not sure
it will hurt very much when it isn't.  I mean, if you execute the same
query repeatedly, that data will become hot soon enough.  If you
execute a lot of different queries that each touch a small portion of
a big, cold table, we might underestimate the costs of the index
probes, but so what?  There's probably no better strategy for
accessing that table anyway.  Perhaps you can construct an example
where this underestimate affects the join order in an undesirable
fashion, but I'm having a hard time getting worked up about that as a
potential problem case.  Our current system - where we essentially
assume that the caching percentage is uniform across the board - can
have the same problem in less artificial cases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
Date:
---- Original message ----
>Date: Thu, 11 Nov 2010 15:29:40 -0500
>From: pgsql-performance-owner@postgresql.org (on behalf of Robert Haas <robertmhaas@gmail.com>)
>Subject: Re: [PERFORM] anti-join chosen even when slower than old plan
>To: Tom Lane <tgl@sss.pgh.pa.us>
>Cc: Kevin Grittner <Kevin.Grittner@wicourts.gov>,Mladen Gogala
<mladen.gogala@vmsinfo.com>,"pgsql-performance@postgresql.org"<pgsql-performance@postgresql.org> 
>
>On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Yeah.  For Kevin's case, it seems like we want the caching percentage
>>> to vary not so much based on which table we're hitting at the moment
>>> but on how much of it we're actually reading.
>>
>> Well, we could certainly take the expected number of pages to read and
>> compare that to effective_cache_size.  The thing that's missing in that
>> equation is how much other stuff is competing for cache space.  I've
>> tried to avoid having the planner need to know the total size of the
>> database cluster, but it's kind of hard to avoid that if you want to
>> model this honestly.
>
>I'm not sure I agree with that.  I mean, you could easily have a
>database that is much larger than effective_cache_size, but only that
>much of it is hot.  Or, the hot portion could move around over time.
>And for reasons of both technical complexity and plan stability, I
>don't think we want to try to model that.  It seems perfectly
>reasonable to say that reading 25% of effective_cache_size will be
>more expensive *per-page* than reading 5% of effective_cache_size,
>independently of what the total cluster size is.
>
>> Would it be at all workable to have an estimate that so many megs of a
>> table are in cache (independently of any other table), and then we could
>> scale the cost based on the expected number of pages to read versus that
>> number?  The trick here is that DBAs really aren't going to want to set
>> such a per-table number (at least, most of the time) so we need a
>> formula to get to a default estimate for that number based on some simple
>> system-wide parameters.  I'm not sure if that's any easier.
>
>That's an interesting idea.  For the sake of argument, suppose we
>assume that a relation which is less than 5% of effective_cache_size
>will be fully cached; and anything larger we'll assume that much of it
>is cached.  Consider a 4GB machine with effective_cache_size set to
>3GB.  Then we'll assume that any relation less than 153MB table is
>100% cached, a 1 GB table is 15% cached, and a 3 GB table is 5%
>cached.  That doesn't seem quite right, though: the caching percentage
>drops off very quickly after you exceed the threshold.
>
>*thinks*
>
>I wondering if we could do something with a formula like 3 *
>amount_of_data_to_read / (3 * amount_of_data_to_read +
>effective_cache_size) = percentage NOT cached.  That is, if we're
>reading an amount of data equal to effective_cache_size, we assume 25%
>caching, and plot a smooth curve through that point.  In the examples
>above, we would assume that a 150MB read is 87% cached, a 1GB read is
>50% cached, and a 3GB read is 25% cached.
>
>> BTW, it seems that all these variants have an implicit assumption that
>> if you're reading a small part of the table it's probably part of the
>> working set; which is an assumption that could be 100% wrong.  I don't
>> see a way around it without trying to characterize the data access at
>> an unworkably fine level, though.
>
>Me neither, but I think it will frequently be true, and I'm not sure
>it will hurt very much when it isn't.  I mean, if you execute the same
>query repeatedly, that data will become hot soon enough.  If you
>execute a lot of different queries that each touch a small portion of
>a big, cold table, we might underestimate the costs of the index
>probes, but so what?  There's probably no better strategy for
>accessing that table anyway.  Perhaps you can construct an example
>where this underestimate affects the join order in an undesirable
>fashion, but I'm having a hard time getting worked up about that as a
>potential problem case.  Our current system - where we essentially
>assume that the caching percentage is uniform across the board - can
>have the same problem in less artificial cases.
>
>--
>Robert Haas
>EnterpriseDB: http://www.enterprisedb.com
>The Enterprise PostgreSQL Company
>
>--
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance

On a thread some time ago, on a similar subject, I opined that I missed the ability to assign tables to tablespaces and
buffersto tablespaces, thus having the ability to isolate needed tables (perhaps a One True Lookup Table, for example;
ora Customer table) to memory without fear of eviction. 

I was sounding beaten about the face and breast.  It really is an "Enterprise" way of handling the situation.

regards,
Robert

Re: anti-join chosen even when slower than old plan

From
Kenneth Marshall
Date:
On Thu, Nov 11, 2010 at 03:56:25PM -0500, gnuoytr@rcn.com wrote:
> On a thread some time ago, on a similar subject, I opined that I missed the ability to assign tables to tablespaces
andbuffers to tablespaces, thus having the ability to isolate needed tables (perhaps a One True Lookup Table, for
example;or a Customer table) to memory without fear of eviction. 
>
> I was sounding beaten about the face and breast.  It really is an "Enterprise" way of handling the situation.
>
> regards,
> Robert
>

ALTER TABLE can be used to change the tablespace of a table
and/or index.

Cheers,
Ken

Re: anti-join chosen even when slower than old plan

From
Cédric Villemain
Date:
2010/11/11 Tom Lane <tgl@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Yeah.  For Kevin's case, it seems like we want the caching percentage
>> to vary not so much based on which table we're hitting at the moment
>> but on how much of it we're actually reading.
>
> Well, we could certainly take the expected number of pages to read and
> compare that to effective_cache_size.  The thing that's missing in that
> equation is how much other stuff is competing for cache space.  I've
> tried to avoid having the planner need to know the total size of the
> database cluster, but it's kind of hard to avoid that if you want to
> model this honestly.
>
> Would it be at all workable to have an estimate that so many megs of a
> table are in cache

Yes, with Linux ... at least.

> (independently of any other table), and then we could
> scale the cost based on the expected number of pages to read versus that
> number?  The trick here is that DBAs really aren't going to want to set
> such a per-table number (at least, most of the time) so we need a
> formula to get to a default estimate for that number based on some simple
> system-wide parameters.  I'm not sure if that's any easier.

My current ideas for future POC with pgfincore are around what is said
currently in this thread.

I'd like to have some maintenance stuff like auto-ANALYZE which report
table and index usage of the OS cache, it might be % of data in cache
and distribution of data in cache (perhaps only my last 15% of the
table are in cache, or perhaps 15% of blocks with a more
regular-random?- distribution)
My current stats around OS cache illustrate that the OS page cache
remain stable : number of blocks in memory per object does not  change
a lot once application have run long enough.

Those are good stats to automaticaly adjust random_page_cost and
seq_page_cost per per table or index. DBA provide accurate (with the
hardware) random_page_cost and seq_page_cost , perhaps we may want a
mem_page_cost (?). Or we just adjust rand_page_cost and seq_page_cost
based on the average data in cache.
Actually I think that updating *_page_cost and keeping the current
design of effective_cache_size (in costsize.c) may rock enough.


>
> BTW, it seems that all these variants have an implicit assumption that
> if you're reading a small part of the table it's probably part of the
> working set; which is an assumption that could be 100% wrong.  I don't
> see a way around it without trying to characterize the data access at
> an unworkably fine level, though.

Exactly.

>
>                        regards, tom lane
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: anti-join chosen even when slower than old plan

From
Cédric Villemain
Date:
2010/11/11 Robert Haas <robertmhaas@gmail.com>:
> On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Yeah.  For Kevin's case, it seems like we want the caching percentage
>>> to vary not so much based on which table we're hitting at the moment
>>> but on how much of it we're actually reading.
>>
>> Well, we could certainly take the expected number of pages to read and
>> compare that to effective_cache_size.  The thing that's missing in that
>> equation is how much other stuff is competing for cache space.  I've
>> tried to avoid having the planner need to know the total size of the
>> database cluster, but it's kind of hard to avoid that if you want to
>> model this honestly.
>
> I'm not sure I agree with that.  I mean, you could easily have a
> database that is much larger than effective_cache_size, but only that
> much of it is hot.  Or, the hot portion could move around over time.
> And for reasons of both technical complexity and plan stability, I
> don't think we want to try to model that.  It seems perfectly
> reasonable to say that reading 25% of effective_cache_size will be
> more expensive *per-page* than reading 5% of effective_cache_size,
> independently of what the total cluster size is.
>
>> Would it be at all workable to have an estimate that so many megs of a
>> table are in cache (independently of any other table), and then we could
>> scale the cost based on the expected number of pages to read versus that
>> number?  The trick here is that DBAs really aren't going to want to set
>> such a per-table number (at least, most of the time) so we need a
>> formula to get to a default estimate for that number based on some simple
>> system-wide parameters.  I'm not sure if that's any easier.
>
> That's an interesting idea.  For the sake of argument, suppose we
> assume that a relation which is less than 5% of effective_cache_size
> will be fully cached; and anything larger we'll assume that much of it
> is cached.  Consider a 4GB machine with effective_cache_size set to
> 3GB.  Then we'll assume that any relation less than 153MB table is
> 100% cached, a 1 GB table is 15% cached, and a 3 GB table is 5%
> cached.  That doesn't seem quite right, though: the caching percentage
> drops off very quickly after you exceed the threshold.
>
> *thinks*
>
> I wondering if we could do something with a formula like 3 *
> amount_of_data_to_read / (3 * amount_of_data_to_read +
> effective_cache_size) = percentage NOT cached.  That is, if we're
> reading an amount of data equal to effective_cache_size, we assume 25%
> caching, and plot a smooth curve through that point.  In the examples
> above, we would assume that a 150MB read is 87% cached, a 1GB read is
> 50% cached, and a 3GB read is 25% cached.


But isn't it already the behavior of effective_cache_size usage ?

See  index_pages_fetched() in costsize.c


>
>> BTW, it seems that all these variants have an implicit assumption that
>> if you're reading a small part of the table it's probably part of the
>> working set; which is an assumption that could be 100% wrong.  I don't
>> see a way around it without trying to characterize the data access at
>> an unworkably fine level, though.
>
> Me neither, but I think it will frequently be true, and I'm not sure
> it will hurt very much when it isn't.  I mean, if you execute the same
> query repeatedly, that data will become hot soon enough.  If you
> execute a lot of different queries that each touch a small portion of
> a big, cold table, we might underestimate the costs of the index
> probes, but so what?  There's probably no better strategy for
> accessing that table anyway.  Perhaps you can construct an example
> where this underestimate affects the join order in an undesirable
> fashion, but I'm having a hard time getting worked up about that as a
> potential problem case.  Our current system - where we essentially
> assume that the caching percentage is uniform across the board - can
> have the same problem in less artificial cases.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: anti-join chosen even when slower than old plan

From
Vitalii Tymchyshyn
Date:
I'd say there are two Qs here:

1) Modify costs based on information on how much of the table is in
cache. It would be great  if this can be done, but I'd prefer to have it
as admin knobs (because of plan stability). May be both admin and
automatic ways can be followed with some parallel (disableable) process
modify knobs on admin behalf. In this case different strategies to
automatically modify knobs can be applied.

2) Modify costs for part of table retrieval. Then you need to define
"part". Current ways are partitioning and partial indexes. Some similar
to partial index thing may be created, that has only "where" clause and
no data. But has statistics and knobs (and may be personal bufferspace
if they are introduced). I don't like to gather data about "last X
percents" or like, because it works only in clustering and it's hard for
optimizer to decide if it will be enough to scan only this percents for
given query.

Best regards, Vitalii Tymchyshyn

Re: anti-join chosen even when slower than old plan

From
Cédric Villemain
Date:
I supposed it was an answer to my mail but not sure... please keep
CC'ed people, it is easier to follow threads (at least for me)

2010/11/12 Vitalii Tymchyshyn <tivv00@gmail.com>:
> I'd say there are two Qs here:
>
> 1) Modify costs based on information on how much of the table is in cache.
> It would be great  if this can be done, but I'd prefer to have it as admin
> knobs (because of plan stability). May be both admin and automatic ways can
> be followed with some parallel (disableable) process modify knobs on admin
> behalf. In this case different strategies to automatically modify knobs can
> be applied.

OS cache is usualy stable enough to keep your plans stable too, I think.

>
> 2) Modify costs for part of table retrieval. Then you need to define "part".
> Current ways are partitioning and partial indexes. Some similar to partial
> index thing may be created, that has only "where" clause and no data. But
> has statistics and knobs (and may be personal bufferspace if they are
> introduced). I don't like to gather data about "last X percents" or like,
> because it works only in clustering and it's hard for optimizer to decide if
> it will be enough to scan only this percents for given query.

Modifying random_page_cost and sequential_page_cost thanks to
statistics about cached blocks can be improved if we know the
distribution.

It does not mean : we know we have last 15% in cache, and we are goign
to request those 15%.

>
> Best regards, Vitalii Tymchyshyn
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: anti-join chosen even when slower than old plan

From
Vitalii Tymchyshyn
Date:
12.11.10 12:56, Cédric Villemain написав(ла):
> I supposed it was an answer to my mail but not sure... please keep
> CC'ed people, it is easier to follow threads (at least for me)
>
OK
> 2010/11/12 Vitalii Tymchyshyn<tivv00@gmail.com>:
>
>> I'd say there are two Qs here:
>>
>> 1) Modify costs based on information on how much of the table is in cache.
>> It would be great  if this can be done, but I'd prefer to have it as admin
>> knobs (because of plan stability). May be both admin and automatic ways can
>> be followed with some parallel (disableable) process modify knobs on admin
>> behalf. In this case different strategies to automatically modify knobs can
>> be applied.
>>
> OS cache is usualy stable enough to keep your plans stable too, I think.
>
Not if it is on edge. There are always edge cases where data fluctuates
near some threshold.
>
>> 2) Modify costs for part of table retrieval. Then you need to define "part".
>> Current ways are partitioning and partial indexes. Some similar to partial
>> index thing may be created, that has only "where" clause and no data. But
>> has statistics and knobs (and may be personal bufferspace if they are
>> introduced). I don't like to gather data about "last X percents" or like,
>> because it works only in clustering and it's hard for optimizer to decide if
>> it will be enough to scan only this percents for given query.
>>
> Modifying random_page_cost and sequential_page_cost thanks to
> statistics about cached blocks can be improved if we know the
> distribution.
>
> It does not mean : we know we have last 15% in cache, and we are goign
> to request those 15%.
>

You mean *_cost for the whole table, don't you? That is case (1) for me.
Case (2) is when different cost values are selected based on what
portion of table is requested in the query. E.g. when we have data for
the whole day in one table, data for the last hour is cached and all the
other data is not. Optimizer then may use different *_cost for query
that requires all the data and for query that requires only last hour
data. But, as I've said, that is much more complex task then (1).

Best regards, Vitalii Tymchyshyn


Re: anti-join chosen even when slower than old plan

From
Cédric Villemain
Date:
2010/11/12 Vitalii Tymchyshyn <tivv00@gmail.com>:
> 12.11.10 12:56, Cédric Villemain написав(ла):
>>
>> I supposed it was an answer to my mail but not sure... please keep
>> CC'ed people, it is easier to follow threads (at least for me)
>>
>
> OK
>>
>> 2010/11/12 Vitalii Tymchyshyn<tivv00@gmail.com>:
>>
>>>
>>> I'd say there are two Qs here:
>>>
>>> 1) Modify costs based on information on how much of the table is in
>>> cache.
>>> It would be great  if this can be done, but I'd prefer to have it as
>>> admin
>>> knobs (because of plan stability). May be both admin and automatic ways
>>> can
>>> be followed with some parallel (disableable) process modify knobs on
>>> admin
>>> behalf. In this case different strategies to automatically modify knobs
>>> can
>>> be applied.
>>>
>>
>> OS cache is usualy stable enough to keep your plans stable too, I think.
>>
>
> Not if it is on edge. There are always edge cases where data fluctuates near
> some threshold.

So far I did some analysis on the topic with pgfincore. Tables and
index first have peak and holes if you graph the % of blocks in cache
at the server start, but after a while, it is more stable.

Maybe there are applications where linux faill to find a 'stable' page cache.

If people are able to graph the pgfincore results for all or part of
the objects of their database it will give us more robust analysis.
Especially when corner case with the planner exists (like here).

>>
>>
>>>
>>> 2) Modify costs for part of table retrieval. Then you need to define
>>> "part".
>>> Current ways are partitioning and partial indexes. Some similar to
>>> partial
>>> index thing may be created, that has only "where" clause and no data. But
>>> has statistics and knobs (and may be personal bufferspace if they are
>>> introduced). I don't like to gather data about "last X percents" or like,
>>> because it works only in clustering and it's hard for optimizer to decide
>>> if
>>> it will be enough to scan only this percents for given query.
>>>
>>
>> Modifying random_page_cost and sequential_page_cost thanks to
>> statistics about cached blocks can be improved if we know the
>> distribution.
>>
>> It does not mean : we know we have last 15% in cache, and we are goign
>> to request those 15%.
>>
>
> You mean *_cost for the whole table, don't you? That is case (1) for me.

Yes.

> Case (2) is when different cost values are selected based on what portion of
> table is requested in the query. E.g. when we have data for the whole day in
> one table, data for the last hour is cached and all the other data is not.
> Optimizer then may use different *_cost for query that requires all the data
> and for query that requires only last hour data. But, as I've said, that is
> much more complex task then (1).

I need to think some more time of that.

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Fri, Nov 12, 2010 at 4:15 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
>> I wondering if we could do something with a formula like 3 *
>> amount_of_data_to_read / (3 * amount_of_data_to_read +
>> effective_cache_size) = percentage NOT cached.  That is, if we're
>> reading an amount of data equal to effective_cache_size, we assume 25%
>> caching, and plot a smooth curve through that point.  In the examples
>> above, we would assume that a 150MB read is 87% cached, a 1GB read is
>> 50% cached, and a 3GB read is 25% cached.
>
> But isn't it already the behavior of effective_cache_size usage ?

No.

The ideal of trying to know what is actually in cache strikes me as an
almost certain non-starter.  It can change very quickly, even as a
result of the query you're actually running.  And getting the
information we'd need in order to do it that way would be very
expensive, when it can be done at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Nov 12, 2010 at 4:15 AM, C�dric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>>> I wondering if we could do something with a formula like 3 *
>>> amount_of_data_to_read / (3 * amount_of_data_to_read +
>>> effective_cache_size) = percentage NOT cached. �That is, if we're
>>> reading an amount of data equal to effective_cache_size, we assume 25%
>>> caching, and plot a smooth curve through that point. �In the examples
>>> above, we would assume that a 150MB read is 87% cached, a 1GB read is
>>> 50% cached, and a 3GB read is 25% cached.

>> But isn't it already the behavior of effective_cache_size usage ?

> No.

I think his point is that we already have a proven formula
(Mackert-Lohmann) and shouldn't be inventing a new one out of thin air.
The problem is to figure out what numbers to apply the M-L formula to.

I've been thinking that we ought to try to use it in the context of the
query as a whole rather than for individual table scans; the current
usage already has some of that flavor but we haven't taken it to the
logical conclusion.

> The ideal of trying to know what is actually in cache strikes me as an
> almost certain non-starter.

Agreed on that point.  Plan stability would go out the window.

            regards, tom lane

Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Fri, Nov 12, 2010 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think his point is that we already have a proven formula
> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air.
> The problem is to figure out what numbers to apply the M-L formula to.

I'm not sure that's really measuring the same thing, although I'm not
opposed to using it if it produces reasonable answers.

> I've been thinking that we ought to try to use it in the context of the
> query as a whole rather than for individual table scans; the current
> usage already has some of that flavor but we haven't taken it to the
> logical conclusion.

That's got a pretty severe chicken-and-egg problem though, doesn't it?
 You're going to need to know how much data you're touching to
estimate the costs so you can pick the best plan, but you can't know
how much data will ultimately be touched until you've got the whole
plan.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
Cédric Villemain
Date:
2010/11/12 Tom Lane <tgl@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Nov 12, 2010 at 4:15 AM, Cédric Villemain
>> <cedric.villemain.debian@gmail.com> wrote:
>>>> I wondering if we could do something with a formula like 3 *
>>>> amount_of_data_to_read / (3 * amount_of_data_to_read +
>>>> effective_cache_size) = percentage NOT cached.  That is, if we're
>>>> reading an amount of data equal to effective_cache_size, we assume 25%
>>>> caching, and plot a smooth curve through that point.  In the examples
>>>> above, we would assume that a 150MB read is 87% cached, a 1GB read is
>>>> 50% cached, and a 3GB read is 25% cached.
>
>>> But isn't it already the behavior of effective_cache_size usage ?
>
>> No.
>
> I think his point is that we already have a proven formula
> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air.
> The problem is to figure out what numbers to apply the M-L formula to.
>
> I've been thinking that we ought to try to use it in the context of the
> query as a whole rather than for individual table scans; the current
> usage already has some of that flavor but we haven't taken it to the
> logical conclusion.
>
>> The ideal of trying to know what is actually in cache strikes me as an
>> almost certain non-starter.
>
> Agreed on that point.  Plan stability would go out the window.

Point is not to now the current cache, but like for ANALYZE on a
regular basis (probably something around number of page read/hit) run
a cache_analyze which report stats like ANALYZE do, and may be
adjusted per table like auto_analyze is.

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: anti-join chosen even when slower than old plan

From
"Marc Mamin"
Date:
Hello,

Just a short though:

Is it imaginable to compare the prognoses of the plans with the actual
results
and somehow log the worst cases ?

a) to help the DBA locate bad statistics and queries
b) as additional information source for the planner

This could possibly affect parameters of your formula on the fly.

best regards,

Marc Mamin

Re: anti-join chosen even when slower than old plan

From
bricklen
Date:
On Sat, Nov 13, 2010 at 1:32 AM, Marc Mamin <M.Mamin@intershop.de> wrote:
> Hello,
>
> Just a short though:
>
> Is it imaginable to compare the prognoses of the plans with the actual
> results
> and somehow log the worst cases ?
>
> a) to help the DBA locate bad statistics and queries
> b) as additional information source for the planner
>
> This could possibly affect parameters of your formula on the fly.
>
> best regards,
>
> Marc Mamin

The contrib module auto_explain might help out here if you wanted to
roll your own solution for plan comparison.

Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Sat, Nov 13, 2010 at 4:32 AM, Marc Mamin <M.Mamin@intershop.de> wrote:
> Hello,
>
> Just a short though:
>
> Is it imaginable to compare the prognoses of the plans with the actual
> results
> and somehow log the worst cases ?
>
> a) to help the DBA locate bad statistics and queries
> b) as additional information source for the planner
>
> This could possibly affect parameters of your formula on the fly.

Yeah, I've thought about this, but it's not exactly clear what would
be most useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
Bruce Momjian
Date:
Tom Lane wrote:
> Mladen Gogala <mladen.gogala@vmsinfo.com> writes:
> > Again, having an optimizer which will choose the plan completely
> > accurately is, at least in my opinion, less important than having a
> > possibility of manual control, the aforementioned "knobs and buttons"
> > and produce the same plan for the same statement.
>
> More knobs and buttons is the Oracle way, and the end result of that
> process is that you have something as hard to use as Oracle.  That's
> generally not thought of as desirable in this community.

Let reply, but Mladen, you might want to look at my blog entry
explaining why knobs are often not useful because they are only used by
a small percentage of users (and confuse the rest):

    http://momjian.us/main/blogs/pgblog/2009.html#January_10_2009

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: anti-join chosen even when slower than old plan

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> Yeah. ?For Kevin's case, it seems like we want the caching percentage
> >> to vary not so much based on which table we're hitting at the moment
> >> but on how much of it we're actually reading.
> >
> > Well, we could certainly take the expected number of pages to read and
> > compare that to effective_cache_size. ?The thing that's missing in that
> > equation is how much other stuff is competing for cache space. ?I've
> > tried to avoid having the planner need to know the total size of the
> > database cluster, but it's kind of hard to avoid that if you want to
> > model this honestly.
>
> I'm not sure I agree with that.  I mean, you could easily have a
> database that is much larger than effective_cache_size, but only that
> much of it is hot.  Or, the hot portion could move around over time.
> And for reasons of both technical complexity and plan stability, I
> don't think we want to try to model that.  It seems perfectly
> reasonable to say that reading 25% of effective_cache_size will be
> more expensive *per-page* than reading 5% of effective_cache_size,
> independently of what the total cluster size is.

Late reply, but one idea is to have the executor store hit counts for
later use by the optimizer.  Only the executor knows how many pages it
had to request from the kernel for a query.  Perhaps getrusage could
tell us how often we hit the disk.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: anti-join chosen even when slower than old plan

From
Bruce Momjian
Date:
Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Nov 12, 2010 at 4:15 AM, C�dric Villemain
> > <cedric.villemain.debian@gmail.com> wrote:
> >>> I wondering if we could do something with a formula like 3 *
> >>> amount_of_data_to_read / (3 * amount_of_data_to_read +
> >>> effective_cache_size) = percentage NOT cached. �That is, if we're
> >>> reading an amount of data equal to effective_cache_size, we assume 25%
> >>> caching, and plot a smooth curve through that point. �In the examples
> >>> above, we would assume that a 150MB read is 87% cached, a 1GB read is
> >>> 50% cached, and a 3GB read is 25% cached.
>
> >> But isn't it already the behavior of effective_cache_size usage ?
>
> > No.
>
> I think his point is that we already have a proven formula
> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air.
> The problem is to figure out what numbers to apply the M-L formula to.
>
> I've been thinking that we ought to try to use it in the context of the
> query as a whole rather than for individual table scans; the current
> usage already has some of that flavor but we haven't taken it to the
> logical conclusion.

Is there a TODO here?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: anti-join chosen even when slower than old plan

From
Cédric Villemain
Date:
2011/1/19 Bruce Momjian <bruce@momjian.us>:
> Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>> > On Fri, Nov 12, 2010 at 4:15 AM, Cédric Villemain
>> > <cedric.villemain.debian@gmail.com> wrote:
>> >>> I wondering if we could do something with a formula like 3 *
>> >>> amount_of_data_to_read / (3 * amount_of_data_to_read +
>> >>> effective_cache_size) = percentage NOT cached.  That is, if we're
>> >>> reading an amount of data equal to effective_cache_size, we assume 25%
>> >>> caching, and plot a smooth curve through that point.  In the examples
>> >>> above, we would assume that a 150MB read is 87% cached, a 1GB read is
>> >>> 50% cached, and a 3GB read is 25% cached.
>>
>> >> But isn't it already the behavior of effective_cache_size usage ?
>>
>> > No.
>>
>> I think his point is that we already have a proven formula
>> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air.
>> The problem is to figure out what numbers to apply the M-L formula to.
>>
>> I've been thinking that we ought to try to use it in the context of the
>> query as a whole rather than for individual table scans; the current
>> usage already has some of that flavor but we haven't taken it to the
>> logical conclusion.
>
> Is there a TODO here?

it looks like, yes.

>
> --
>  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>  EnterpriseDB                             http://enterprisedb.com
>
>  + It's impossible for everything to be true. +
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: anti-join chosen even when slower than old plan

From
Robert Haas
Date:
On Thu, Jan 20, 2011 at 4:17 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
>>> I think his point is that we already have a proven formula
>>> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air.
>>> The problem is to figure out what numbers to apply the M-L formula to.
>>>
>>> I've been thinking that we ought to try to use it in the context of the
>>> query as a whole rather than for individual table scans; the current
>>> usage already has some of that flavor but we haven't taken it to the
>>> logical conclusion.
>>
>> Is there a TODO here?
>
> it looks like, yes.

"Modify the planner to better estimate caching effects"?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: anti-join chosen even when slower than old plan

From
Cédric Villemain
Date:
2011/1/20 Robert Haas <robertmhaas@gmail.com>:
> On Thu, Jan 20, 2011 at 4:17 AM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>>>> I think his point is that we already have a proven formula
>>>> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air.
>>>> The problem is to figure out what numbers to apply the M-L formula to.
>>>>
>>>> I've been thinking that we ought to try to use it in the context of the
>>>> query as a whole rather than for individual table scans; the current
>>>> usage already has some of that flavor but we haven't taken it to the
>>>> logical conclusion.
>>>
>>> Is there a TODO here?
>>
>> it looks like, yes.
>
> "Modify the planner to better estimate caching effects"?

or    "Estimate caching effect in the query context instead of per
object" (the point above)
and "Improve the estimate of the caching effects" (more or less M-L
review, fine control of cache estimate)

?

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: anti-join chosen even when slower than old plan

From
Cédric Villemain
Date:
2011/1/19 Bruce Momjian <bruce@momjian.us>:
> Robert Haas wrote:
>> On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Robert Haas <robertmhaas@gmail.com> writes:
>> >> Yeah. ?For Kevin's case, it seems like we want the caching percentage
>> >> to vary not so much based on which table we're hitting at the moment
>> >> but on how much of it we're actually reading.
>> >
>> > Well, we could certainly take the expected number of pages to read and
>> > compare that to effective_cache_size. ?The thing that's missing in that
>> > equation is how much other stuff is competing for cache space. ?I've
>> > tried to avoid having the planner need to know the total size of the
>> > database cluster, but it's kind of hard to avoid that if you want to
>> > model this honestly.
>>
>> I'm not sure I agree with that.  I mean, you could easily have a
>> database that is much larger than effective_cache_size, but only that
>> much of it is hot.  Or, the hot portion could move around over time.
>> And for reasons of both technical complexity and plan stability, I
>> don't think we want to try to model that.  It seems perfectly
>> reasonable to say that reading 25% of effective_cache_size will be
>> more expensive *per-page* than reading 5% of effective_cache_size,
>> independently of what the total cluster size is.
>
> Late reply, but one idea is to have the executor store hit counts for
> later use by the optimizer.  Only the executor knows how many pages it
> had to request from the kernel for a query.  Perhaps getrusage could
> tell us how often we hit the disk.

AFAIK getrusage does not provide access to real IO counters but
filesystem's ones. :-(

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: anti-join chosen even when slower than old plan

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Thu, Jan 20, 2011 at 4:17 AM, C?dric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
> >>> I think his point is that we already have a proven formula
> >>> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air.
> >>> The problem is to figure out what numbers to apply the M-L formula to.
> >>>
> >>> I've been thinking that we ought to try to use it in the context of the
> >>> query as a whole rather than for individual table scans; the current
> >>> usage already has some of that flavor but we haven't taken it to the
> >>> logical conclusion.
> >>
> >> Is there a TODO here?
> >
> > it looks like, yes.
>
> "Modify the planner to better estimate caching effects"?

Added to TODO.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +