Thread: anti-join chosen even when slower than old plan
The semi-join and anti-join have helped us quite a bit, but we have seen a situation where anti-join is chosen even though it is slower than the "old fashioned" plan. I know there have been other reports of this, but I just wanted to go on record with my details. The query: delete from "DbTranLogRecord" where not exists (select * from "DbTranRepository" r where r."countyNo" = "DbTranLogRecord"."countyNo" and r."tranImageSeqNo" = "DbTranLogRecord"."tranImageSeqNo"); Old plan on 8.3.7: Seq Scan on "DbTranLogRecord" (cost=0.00..1224227790.06 rows=333387520 width=6) Filter: (NOT (subplan)) SubPlan -> Index Scan using "DbTranRepositoryPK" on "DbTranRepository" r (cost=0.00..1.83 rows=1 width=974) Index Cond: ((("countyNo")::smallint = ($0)::smallint) AND (("tranImageSeqNo")::numeric = ($1)::numeric)) Deletes about 9.2 million rows in 7 hours and 20 minutes. New plan on 9.0.1: Delete (cost=0.00..93918390.38 rows=1 width=12) -> Merge Anti Join (cost=0.00..93918390.38 rows=1 width=12) Merge Cond: ((("DbTranLogRecord"."countyNo")::smallint = (r."countyNo")::smallint) AND (("DbTranLogRecord"."tranImageSeqNo")::numeric = (r."tranImageSeqNo")::numeric)) -> Index Scan using "DbTranLogRecordPK" on "DbTranLogRecord" (cost=0.00..73143615.91 rows=675405504 width=20) -> Index Scan using "DbTranRepositoryPK" on "DbTranRepository" r (cost=0.00..16328700.43 rows=152541168 width=20) Cancelled after 39 hours and 25 minutes. I know how to work around it by using OFFSET 0 or tweaking the costing for that one query; just sharing the information. Also, we know these tables might be good candidates for partitioning, but that's an issue for another day. Table "public.DbTranLogRecord" Column | Type | Modifiers ----------------+-------------------+----------- countyNo | "CountyNoT" | not null tranImageSeqNo | "TranImageSeqNoT" | not null logRecordSeqNo | "LogRecordSeqNoT" | not null operation | "OperationT" | not null tableName | "TableNameT" | not null Indexes: "DbTranLogRecordPK" PRIMARY KEY, btree ("countyNo", "tranImageSeqNo", "logRecordSeqNo") "DbTranLogRecord_TableNameSeqNo" btree ("countyNo", "tableName", "tranImageSeqNo", operation) Table "public.DbTranRepository" Column | Type | Modifiers ------------------+------------------------+----------- countyNo | "CountyNoT" | not null tranImageSeqNo | "TranImageSeqNoT" | not null timestampValue | "TimestampT" | not null transactionImage | "ImageT" | status | character(1) | not null queryName | "QueryNameT" | runDuration | numeric(15,0) | userId | "UserIdT" | functionalArea | "FunctionalAreaT" | sourceRef | character varying(255) | url | "URLT" | tranImageSize | numeric(15,0) | Indexes: "DbTranRepositoryPK" PRIMARY KEY, btree ("countyNo", "tranImageSeqNo") CLUSTER "DbTranRepository_UserId" btree ("countyNo", "userId", "tranImageSeqNo") "DbTranRepository_timestamp" btree ("countyNo", "timestampValue") relname | relpages | reltuples | pg_relation_size --------------------------------+----------+-------------+------------------ DbTranLogRecord | 5524411 | 6.75406e+08 | 42 GB DbTranLogRecordPK | 6581122 | 6.75406e+08 | 50 GB DbTranLogRecord_TableNameSeqNo | 6803441 | 6.75406e+08 | 52 GB DbTranRepository | 22695447 | 1.52376e+08 | 173 GB DbTranRepositoryPK | 1353643 | 1.52376e+08 | 10 GB DbTranRepository_UserId | 1753793 | 1.52376e+08 | 13 GB DbTranRepository_timestamp | 1353682 | 1.52376e+08 | 10 GB (7 rows) oprofile while not much but this delete is running: samples % symbol name 2320174 33.7617 index_getnext 367268 5.3443 LWLockAcquire 299131 4.3528 hash_search_with_hash_value 249459 3.6300 HeapTupleSatisfiesMVCC 229558 3.3404 PinBuffer 222673 3.2402 _bt_checkkeys 204416 2.9745 LWLockRelease 194336 2.8279 heap_page_prune_opt 152353 2.2169 XidInMVCCSnapshot 121131 1.7626 AllocSetAlloc 91123 1.3260 SearchCatCache 88394 1.2863 nocache_index_getattr 85936 1.2505 pglz_compress 76531 1.1136 heap_hot_search_buffer 69532 1.0118 _mdfd_getseg 68743 1.0003 FunctionCall2 64720 0.9418 TransactionIdPrecedes 45298 0.6591 texteq 43183 0.6284 UnpinBuffer 40666 0.5917 base_yyparse If you want more details or the opannotate level, let me know. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > samples % symbol name > 2320174 33.7617 index_getnext I couldn't resist seeing where the time went within this function. Over 13.7% of the opannotate run time was on this bit of code: /* * The xmin should match the previous xmax value, else chain is * broken. (Note: this test is not optional because it protects * us against the case where the prior chain member's xmax aborted * since we looked at it.) */ if (TransactionIdIsValid(scan->xs_prev_xmax) && !TransactionIdEquals(scan->xs_prev_xmax, HeapTupleHeaderGetXmin(heapTuple->t_data))) break; I can't see why it would be such a hotspot, but it is. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: >> samples % symbol name >> 2320174 33.7617 index_getnext > I couldn't resist seeing where the time went within this function. > Over 13.7% of the opannotate run time was on this bit of code: > /* > * The xmin should match the previous xmax value, else chain is > * broken. (Note: this test is not optional because it protects > * us against the case where the prior chain member's xmax aborted > * since we looked at it.) > */ > if (TransactionIdIsValid(scan->xs_prev_xmax) && > !TransactionIdEquals(scan->xs_prev_xmax, > HeapTupleHeaderGetXmin(heapTuple->t_data))) > break; > I can't see why it would be such a hotspot, but it is. Main-memory access waits, maybe? If at_chain_start is false, that xmin fetch would be the first actual touch of a given heap tuple, and could be expected to have to wait for a cache line to be pulled in from RAM. However, you'd have to be spending a lot of time chasing through long HOT chains before that would happen enough to make this a hotspot... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > However, you'd have to be spending a lot of time chasing through > long HOT chains before that would happen enough to make this a > hotspot... That makes it all the more mysterious, then. These tables are insert-only except for a weekly delete of one week of the oldest data. The parent table, with the date, is deleted first, then this table deletes "where not exists" a corresponding parent. I can't see how we'd ever have a HOT chain in these tables. Is there anything in particular you'd like me to check? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > The semi-join and anti-join have helped us quite a bit, but we have > seen a situation where anti-join is chosen even though it is slower > than the "old fashioned" plan. I know there have been other reports > of this, but I just wanted to go on record with my details. In principle, the old-style plan ought to be equivalent to a nestloop antijoin with a seqscan of DbTranLogRecord on the outside and an indexscan of DbTranRepository on the inside. Can you force it to choose such a plan by setting enable_mergejoin off (and maybe enable_hashjoin too)? If so, it'd be interesting to see the estimated costs and actual runtime on 9.0 for that plan. It would also be interesting to check estimated and actual costs for the SELECT COUNT(*) versions of these queries, ie, no actual delete. I'm suspicious that the cost differential has nothing to do with antijoin vs. subplan, and everything to do with whether the targeted tuples are being deleted in physical order (thus improving locality of access for the deletions). If it's the latter, see previous discussions about possibly sorting update/delete targets by CTID before applying the actions. regards, tom lane
you're joining on more than one key. That always hurts performance.
Grzegorz Jaśkiewicz wrote: > you're joining on more than one key. That always hurts performance. That's very clearly *not* the problem, as there is a plan which runs in acceptable time but the optimizer is not choosing without being coerced. (1) Virtually every query we run joins on multi-column keys, yet we have good performance except for this one query. (2) We're talking about a performance regression due to a new release picking a newly available plan which it wrongly estimates to be an order of magnitude faster, when it's actually more than five times slower. -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> The semi-join and anti-join have helped us quite a bit, but we >> have seen a situation where anti-join is chosen even though it is >> slower than the "old fashioned" plan. I know there have been >> other reports of this, but I just wanted to go on record with my >> details. > > In principle, the old-style plan ought to be equivalent to a > nestloop antijoin with a seqscan of DbTranLogRecord on the outside > and an indexscan of DbTranRepository on the inside. Can you force > it to choose such a plan by setting enable_mergejoin off (and > maybe enable_hashjoin too)? Well, I got what I think is the equivalent plan by adding OFFSET 0 to the subquery: Delete (cost=0.00..1239005015.67 rows=337702752 width=6) -> Seq Scan on "DbTranLogRecord" (cost=0.00..1239005015.67 rows=337702752 width=6) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Limit (cost=0.00..1.82 rows=1 width=974) -> Index Scan using "DbTranRepositoryPK" on "DbTranRepository" r (cost=0.00..1.82 rows=1 width=974) Index Cond: ((("countyNo")::smallint = ($0)::smallint) AND (("tranImageSeqNo")::numeric = ($1)::numeric)) > If so, it'd be interesting to see the estimated costs and actual > runtime on 9.0 for that plan. Unfortunately, based on the oprofile information I decided to check out the plan I would get by boosting cpu_index_tuple_cost by a factor of 20. The resulting plan was: Delete (cost=132623778.83..139506491.18 rows=1 width=12) -> Merge Anti Join (cost=132623778.83..139506491.18 rows=1 width=12) Merge Cond: ((("DbTranLogRecord"."tranImageSeqNo")::numeric = (r."tranImageSeqNo")::numeric) AND (("DbTranLogRecord"."countyNo")::smallint = (r."countyNo")::smallint)) -> Sort (cost=107941675.79..109630189.55 rows=675405504 width=20) Sort Key: "DbTranLogRecord"."tranImageSeqNo", "DbTranLogRecord"."countyNo" -> Seq Scan on "DbTranLogRecord" (cost=0.00..7306496.14 rows=675405504 width=20) -> Materialize (cost=24682103.04..25443983.12 rows=152376016 width=20) -> Sort (cost=24682103.04..25063043.08 rows=152376016 width=20) Sort Key: r."tranImageSeqNo", r."countyNo" -> Seq Scan on "DbTranRepository" r (cost=0.00..3793304.86 rows=152376016 width=20) That looked like it had potential, so I started that off and went home before I got your post. It finished in 3 hours and 31 minutes -- more than twice as fast as the nestloop plan used under 8.3. But wait -- it turns out that this pain was self-inflicted. Based on heavy testing of the interactive queries which users run against this database we tuned the database for "fully-cached" settings, with both random_page_cost and _seq_page_cost at 0.1. In a practical sense, the users are almost always running these queries against very recent data which is, in fact, heavily cached -- so it's no surprise that the queries they run perform best with plans based on such costing. The problem is that these weekly maintenance runs need to pass the entire database, so caching effects are far less pronounced. If I set seq_page_cost = 1 and random_page_cost = 2 I get exactly the same (fast) plan as above. I guess the lesson here is not to use the same costing for database-wide off-hours maintenance queries as for ad hoc queries against a smaller set of recent data by users who expect quick response time. I'm fine with tweaking the costs in our maintenance scripts, but it does tend to make me daydream about how the optimizer might possibly auto-tweak such things.... I assume there's now no need to get timings for the old plan? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In principle, the old-style plan ought to be equivalent to a >> nestloop antijoin with a seqscan of DbTranLogRecord on the outside >> and an indexscan of DbTranRepository on the inside. Can you force >> it to choose such a plan by setting enable_mergejoin off (and >> maybe enable_hashjoin too)? > Well, I got what I think is the equivalent plan by adding OFFSET 0 > to the subquery: No, that *is* the old-style plan (plus a useless Limit node, which will surely make it marginally slower). My point was that a nestloop antijoin plan should have the same access pattern and hence very similar performance, maybe even a little better due to not having the SubPlan machinery in there. > But wait -- it turns out that this pain was self-inflicted. Based > on heavy testing of the interactive queries which users run against > this database we tuned the database for "fully-cached" settings, > with both random_page_cost and _seq_page_cost at 0.1. Ah. So it was underestimating the cost of the full-table indexscans, and my guess about nonsequential application of the delete actions wasn't the right guess. The merge antijoin does seem like it should be the fastest way of doing such a large join, so I think the problem is solved. regards, tom lane
On Wed, Nov 10, 2010 at 10:15 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > But wait -- it turns out that this pain was self-inflicted. Based > on heavy testing of the interactive queries which users run against > this database we tuned the database for "fully-cached" settings, > with both random_page_cost and _seq_page_cost at 0.1. In a > practical sense, the users are almost always running these queries > against very recent data which is, in fact, heavily cached -- so > it's no surprise that the queries they run perform best with plans > based on such costing. The problem is that these weekly maintenance > runs need to pass the entire database, so caching effects are far > less pronounced. If I set seq_page_cost = 1 and random_page_cost = > 2 I get exactly the same (fast) plan as above. > > I guess the lesson here is not to use the same costing for > database-wide off-hours maintenance queries as for ad hoc queries > against a smaller set of recent data by users who expect quick > response time. I'm fine with tweaking the costs in our maintenance > scripts, but it does tend to make me daydream about how the > optimizer might possibly auto-tweak such things.... Wow. That's fascinating, and if you don't mind, I might mention this potential problem in a future talk at some point. I've given some thought in the past to trying to maintain some model of which parts of the database are likely to be cached, and trying to adjust costing estimates based on that data. But it's a really hard problem, because what is and is not in cache can change relatively quickly, and you don't want to have too much plan instability. Also, for many workloads, you'd need to have pretty fine-grained statistics to figure out anything useful, which would be expensive and difficult to maintain. But thinking over what you've written here, I'm reminded of something Peter said years ago, also about the optimizer. He was discussed the ratio of the estimated cost to the actual cost and made an off-hand remark that efforts had been made over the years to make that ratio more consistent (i.e. improve the quality of the cost estimates) but that they'd been abandoned because they didn't necessarily produce better plans. Applying that line of thinking to this problem, maybe we should give up on trying to make the estimates truly model reality, and focus more on assigning them values which work well in practice. For example, in your case, it would be sufficient to estimate the amount of data that a given query is going to grovel through and then applying some heuristic to choose values for random_page_cost and seq_page_cost based on the ratio of that value to, I don't know, effective_cache_size. Unfortunately, to know how much data we're going to grovel through, we need to know the plan; and to decide on the right plan, we need to know how much data we're going to grovel through. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > Wow. That's fascinating, and if you don't mind, I might mention > this potential problem in a future talk at some point. I don't mind at all. > For example, in your case, it would be sufficient to estimate the > amount of data that a given query is going to grovel through and > then applying some heuristic to choose values for random_page_cost > and seq_page_cost based on the ratio of that value to, I don't > know, effective_cache_size. That's where my day-dreams on the topic have been starting. > Unfortunately, to know how much data we're going to grovel > through, we need to know the plan; and to decide on the right > plan, we need to know how much data we're going to grovel through. And that's where they've been ending. The only half-sane answer I've thought of is to apply a different cost to full-table or full-index scans based on the ratio with effective cache size. A higher cost for such scans is something which I've already postulated might be worthwhile for SSI, because of the increased risk of rw-conflicts which could ultimately contribute to serialization failures -- to attempt to model, at least in some crude way, the costs associated with transaction retry. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Robert Haas <robertmhaas@gmail.com> wrote: >> Unfortunately, to know how much data we're going to grovel >> through, we need to know the plan; and to decide on the right >> plan, we need to know how much data we're going to grovel through. > And that's where they've been ending. > The only half-sane answer I've thought of is to apply a different > cost to full-table or full-index scans based on the ratio with > effective cache size. This might have some connection to some rather half-baked ideas I've been having in connection with the generalized-inner-indexscan problem. I don't have anything in the way of a coherent presentation to make yet, but the thing I'm being forced to realize is that sane modeling of a complex subplan that's on the inside of a nestloop join requires treating *every* scan type as having different costs "the first time" versus "during rescan". If the total amount of data touched in the query is less than effective_cache_size, it's not unreasonable to suppose that I/O costs during rescan might be zero, even for a seqscan or a non-parameterized indexscan. In fact, only parameterized indexscans would be able to touch pages they'd not touched the first time, and so they ought to have higher not lower rescan costs in this environment. But once the total data volume exceeds effective_cache_size, you have to do something different since you shouldn't any longer assume the data is all cached from the first scan. (This isn't quite as hard as the case you're talking about, since I think the relevant data volume is the sum of the sizes of the tables used in the query; which is easy to estimate at the start of planning, unlike the portion of the tables that actually gets touched.) An idea that isn't even half-baked yet is that once we had a cost model like that, we might be able to produce plans that are well-tuned for a heavily cached environment by applying the "rescan" cost model even to the first scan for a particular query. So that might lead to some sort of "assume_cached" GUC parameter, and perhaps Kevin could tune his reporting queries by turning that off instead of messing with individual cost constants. Or maybe we could be smarter if we could extract an estimate for the amount of data touched in the query ... but like you, I don't see a good way to get that number soon enough. regards, tom lane
On Wed, Nov 10, 2010 at 6:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Robert Haas <robertmhaas@gmail.com> wrote: >>> Unfortunately, to know how much data we're going to grovel >>> through, we need to know the plan; and to decide on the right >>> plan, we need to know how much data we're going to grovel through. > >> And that's where they've been ending. > >> The only half-sane answer I've thought of is to apply a different >> cost to full-table or full-index scans based on the ratio with >> effective cache size. Kevin, yes, good point. Bravo! Let's do that. Details TBD, but suppose effective_cache_size = 1GB. What we know for sure is that a 4 GB table is not going to be fully cached but a 4 MB table may well be. In fact, I think we should assume that the 4 MB table IS cached, because the point is that if it's used at all, it soon will be. It's almost certainly a bad idea to build a plan around the idea of minimizing reads from that 4 MB table in favor of doing a substantial amount of additional work somewhere else. I suppose this could break down if you had hundreds and hundreds of 4 MB tables all of which were accessed regularly, but that's an unusual situation, and anyway it's not clear that assuming them all uncached is going to be any better than assuming them all cached. > This might have some connection to some rather half-baked ideas I've > been having in connection with the generalized-inner-indexscan problem. > I don't have anything in the way of a coherent presentation to make yet, > but the thing I'm being forced to realize is that sane modeling of a > complex subplan that's on the inside of a nestloop join requires > treating *every* scan type as having different costs "the first time" > versus "during rescan". If the total amount of data touched in the > query is less than effective_cache_size, it's not unreasonable to > suppose that I/O costs during rescan might be zero, even for a seqscan or > a non-parameterized indexscan. In fact, only parameterized indexscans > would be able to touch pages they'd not touched the first time, and so > they ought to have higher not lower rescan costs in this environment. > But once the total data volume exceeds effective_cache_size, you have to > do something different since you shouldn't any longer assume the data is > all cached from the first scan. (This isn't quite as hard as the case > you're talking about, since I think the relevant data volume is the sum > of the sizes of the tables used in the query; which is easy to > estimate at the start of planning, unlike the portion of the tables > that actually gets touched.) Well, we don't want the costing model to have sharp edges. effective_cache_size can't be taken as much more than an educated guess, and what actually happens will depend a lot on what else is going on on the system. If only one query is running on a system at a time and it is repeatedly seq-scanning a large table, the cost of reading pages in will be very small until the table grows large enough that you can't fit the whole thing in memory at once, and then will abruptly go through the roof. But realistically you're not going to know exactly where that edge is going to be, because you can't predict exactly how much concurrent activity there will be, for example, or how much work_mem allocations will push out of the OS buffer cache. So I'm thinking we should start the costs at something like 0.05/0.05 for tables that are much smaller than effective_cache_size and ramp up to 4/1 for tables that are larger than effective_cache_size. Maybe just by linearly ramping up, although that has a certain feeling of being without mathemetical soundness. > An idea that isn't even half-baked yet is that once we had a cost model > like that, we might be able to produce plans that are well-tuned for a > heavily cached environment by applying the "rescan" cost model even to > the first scan for a particular query. So that might lead to some sort > of "assume_cached" GUC parameter, and perhaps Kevin could tune his > reporting queries by turning that off instead of messing with individual > cost constants. I think the real goal here should be to try to avoid needing a GUC. A lot of people could benefit if the system could make some attempt to recognize on its own which queries are likely to be cached. We already have parameters you can hand-tune for each query as necessary. Being able to set some parameters system-wide and then get sensible behavior automatically would be much nicer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/10/2010 5:43 PM, Kevin Grittner wrote:
The "effective_cache_size" is, in my humble opinion, a wrong method. It would be much easier to have a parameter, let's call it "optimizer_index_caching", which would give the assumption of the percentage of an index that is cached. In other words, if "optimizer_index_caching" was set to 80, the optimizer would assume that 80% of any index is cached and would apply different cost estimate. It's not exact but it's simple and modifiable. It would also be a great tool in the hands of the DBA which has to manage OLTP database or DW database and would be able to create a definitive bias toward one type of the execution plan.
I have to confess that the idea about such parameter is not entirely mine: http://tinyurl.com/33gu4f6
The only half-sane answer I've thought of is to apply a different cost to full-table or full-index scans based on the ratio with effective cache size.
The "effective_cache_size" is, in my humble opinion, a wrong method. It would be much easier to have a parameter, let's call it "optimizer_index_caching", which would give the assumption of the percentage of an index that is cached. In other words, if "optimizer_index_caching" was set to 80, the optimizer would assume that 80% of any index is cached and would apply different cost estimate. It's not exact but it's simple and modifiable. It would also be a great tool in the hands of the DBA which has to manage OLTP database or DW database and would be able to create a definitive bias toward one type of the execution plan.
I have to confess that the idea about such parameter is not entirely mine: http://tinyurl.com/33gu4f6
-- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
2010/11/11 Robert Haas <robertmhaas@gmail.com>
-- But thinking over what you've written here, I'm reminded of something
Peter said years ago, also about the optimizer. He was discussed the
ratio of the estimated cost to the actual cost and made an off-hand
remark that efforts had been made over the years to make that ratio
more consistent (i.e. improve the quality of the cost estimates) but
that they'd been abandoned because they didn't necessarily produce
better plans. Applying that line of thinking to this problem, maybe
we should give up on trying to make the estimates truly model reality,
and focus more on assigning them values which work well in practice.
For example, in your case, it would be sufficient to estimate the
amount of data that a given query is going to grovel through and then
applying some heuristic to choose values for random_page_cost and
seq_page_cost based on the ratio of that value to, I don't know,
effective_cache_size.
As for me, the simplest solution would be to allow to set costs on per-relation basis. E.g. I know that this relation is most time in memory and other one (archive) is on the disk. This could work like charm along with buffer pools (portions of shared cache) - tables (or indexes) that are required to be cached can be assigned to bufferpool that has enough size to hold all the data, archive ones - to small bufferpool. This can guarantie that after query on the archive data, cached tables are still cached.
This solutions however, does not help on tables where only some portion of table is activelly used. The solution can be to allow set costs via partial indexes - e.g. "for any table access using this index, use this cost values". This, BTW, will make table access via given index more preferable.
Best regards,
Vitalii Tymchyshyn
On Wed, Nov 10, 2010 at 10:47:21PM -0500, Robert Haas wrote: > On Wed, Nov 10, 2010 at 6:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > >> Robert Haas <robertmhaas@gmail.com> wrote: > >>> Unfortunately, to know how much data we're going to grovel > >>> through, we need to know the plan; and to decide on the right > >>> plan, we need to know how much data we're going to grovel through. > > > >> And that's where they've been ending. > > > >> The only half-sane answer I've thought of is to apply a different > >> cost to full-table or full-index scans based on the ratio with > >> effective cache size. > > Kevin, yes, good point. Bravo! Let's do that. Details TBD, but > suppose effective_cache_size = 1GB. What we know for sure is that a 4 > GB table is not going to be fully cached but a 4 MB table may well be. > In fact, I think we should assume that the 4 MB table IS cached, > because the point is that if it's used at all, it soon will be. It's > almost certainly a bad idea to build a plan around the idea of > minimizing reads from that 4 MB table in favor of doing a substantial > amount of additional work somewhere else. I suppose this could break > down if you had hundreds and hundreds of 4 MB tables all of which were > accessed regularly, but that's an unusual situation, and anyway it's > not clear that assuming them all uncached is going to be any better > than assuming them all cached. > > > This might have some connection to some rather half-baked ideas I've > > been having in connection with the generalized-inner-indexscan problem. > > I don't have anything in the way of a coherent presentation to make yet, > > but the thing I'm being forced to realize is that sane modeling of a > > complex subplan that's on the inside of a nestloop join requires > > treating *every* scan type as having different costs "the first time" > > versus "during rescan". ?If the total amount of data touched in the > > query is less than effective_cache_size, it's not unreasonable to > > suppose that I/O costs during rescan might be zero, even for a seqscan or > > a non-parameterized indexscan. ?In fact, only parameterized indexscans > > would be able to touch pages they'd not touched the first time, and so > > they ought to have higher not lower rescan costs in this environment. > > But once the total data volume exceeds effective_cache_size, you have to > > do something different since you shouldn't any longer assume the data is > > all cached from the first scan. ?(This isn't quite as hard as the case > > you're talking about, since I think the relevant data volume is the sum > > of the sizes of the tables used in the query; which is easy to > > estimate at the start of planning, unlike the portion of the tables > > that actually gets touched.) > > Well, we don't want the costing model to have sharp edges. > effective_cache_size can't be taken as much more than an educated > guess, and what actually happens will depend a lot on what else is > going on on the system. If only one query is running on a system at a > time and it is repeatedly seq-scanning a large table, the cost of > reading pages in will be very small until the table grows large enough > that you can't fit the whole thing in memory at once, and then will > abruptly go through the roof. But realistically you're not going to > know exactly where that edge is going to be, because you can't predict > exactly how much concurrent activity there will be, for example, or > how much work_mem allocations will push out of the OS buffer cache. > So I'm thinking we should start the costs at something like 0.05/0.05 > for tables that are much smaller than effective_cache_size and ramp up > to 4/1 for tables that are larger than effective_cache_size. Maybe > just by linearly ramping up, although that has a certain feeling of > being without mathemetical soundness. > > > An idea that isn't even half-baked yet is that once we had a cost model > > like that, we might be able to produce plans that are well-tuned for a > > heavily cached environment by applying the "rescan" cost model even to > > the first scan for a particular query. ?So that might lead to some sort > > of "assume_cached" GUC parameter, and perhaps Kevin could tune his > > reporting queries by turning that off instead of messing with individual > > cost constants. > > I think the real goal here should be to try to avoid needing a GUC. A > lot of people could benefit if the system could make some attempt to > recognize on its own which queries are likely to be cached. We > already have parameters you can hand-tune for each query as necessary. > Being able to set some parameters system-wide and then get sensible > behavior automatically would be much nicer. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > I agree with the goal of avoiding the need for a GUC. This needs to be as automatic as possible. One idea I had had was computing a value for the amount of cache data in the system by keeping a sum or a weighted sum of the table usage in the system. Smaller tables and indexes would contribute a smaller amount to the total, while larger indexes and tables would contribute a larger amount. Then by comparing this running total to the effective_cache_size, set the random and sequential costs for a query. This would allow the case of many 4MB tables to favor disk I/O more than memory I/O. The weighting could be a function of simultaneous users of the table. I know this is a bit of hand-waving but some sort of dynamic feedback needs to be provided to the planning process as system use increases. Regards, Ken
Kenneth Marshall wrote: > I agree with the goal of avoiding the need for a GUC. This needs to > be as automatic as possible. One idea I had had was computing a value > for the amount of cache data in the system by keeping a sum or a > weighted sum of the table usage in the system. Smaller tables and > indexes would contribute a smaller amount to the total, while larger > indexes and tables would contribute a larger amount. Then by comparing > this running total to the effective_cache_size, set the random and > sequential costs for a query. This would allow the case of many 4MB > tables to favor disk I/O more than memory I/O. The weighting could > be a function of simultaneous users of the table. I know this is a > bit of hand-waving but some sort of dynamic feedback needs to be > provided to the planning process as system use increases. > > Regards, > Ken > > Kenneth, you seem to be only concerned with the accuracy of the planning process, not with the plan stability. As a DBA who has to monitor real world applications, I find things like an execution plan changing with the use of the system to be my worst nightmare. The part where you say that "this needs to be as automatic as possible" probably means that I will not be able to do anything about it, if the optimizer, by any chance, doesn't get it right. That looks to me like an entirely wrong way to go. When application developer tunes the SQL both him and me expect that SQL to always perform that way, not to change the execution plan because the system is utilized more than it was 1 hour ago. Nobody seems to have taken my suggestion about having a parameter which would simply "invent" the percentage out of thin air seriously, because it's obviously not accurate. However, the planner accuracy is not the only concern. Running applications on the system usually requires plan stability. Means of external control of the execution plan, DBA knobs and buttons that can be turned and pushed to produce the desired plan are also very much desired. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On Thu, Nov 11, 2010 at 09:15:58AM -0500, Mladen Gogala wrote: > Kenneth Marshall wrote: >> I agree with the goal of avoiding the need for a GUC. This needs to >> be as automatic as possible. One idea I had had was computing a value >> for the amount of cache data in the system by keeping a sum or a >> weighted sum of the table usage in the system. Smaller tables and >> indexes would contribute a smaller amount to the total, while larger >> indexes and tables would contribute a larger amount. Then by comparing >> this running total to the effective_cache_size, set the random and >> sequential costs for a query. This would allow the case of many 4MB >> tables to favor disk I/O more than memory I/O. The weighting could >> be a function of simultaneous users of the table. I know this is a >> bit of hand-waving but some sort of dynamic feedback needs to be >> provided to the planning process as system use increases. >> >> Regards, >> Ken >> >> > Kenneth, you seem to be only concerned with the accuracy of the planning > process, not with the plan stability. As a DBA who has to monitor real > world applications, I find things like an execution plan changing with the > use of the system to be my worst nightmare. The part where you say that > "this needs to be as automatic as possible" probably means that I will not > be able to do anything about it, if the optimizer, by any chance, doesn't > get it right. That looks to me like an entirely wrong way to go. > When application developer tunes the SQL both him and me expect that SQL to > always perform that way, not to change the execution plan because the > system is utilized more than it was 1 hour ago. Nobody seems to have taken > my suggestion about having a parameter > which would simply "invent" the percentage out of thin air seriously, > because it's obviously not accurate. > However, the planner accuracy is not the only concern. Running applications > on the system usually requires plan stability. Means of > external control of the execution plan, DBA knobs and buttons that can be > turned and pushed to produce the desired plan are also very much desired. > > -- > Mladen Gogala Sr. Oracle DBA > 1500 Broadway > New York, NY 10036 > (212) 329-5251 > www.vmsinfo.com > Hi Mladen, I think in many ways, this is the same problem. Because we are not correctly modeling the system, the plan choices are not accurate either for some scenarios. This means that when plan costs are compared, the evaluation is not accurate. This is what causes the terrible plans being right next to the good plans and is what impacts the "plan stability". If the costs are correct, then in fact the plan stability will be much better with the better costing, not worse. Plans with close costs should actually have close performance. Regards, Ken
Mladen Gogala <mladen.gogala@vmsinfo.com> wrote: > create a definitive bias toward one type of the execution plan. We're talking about trying to support the exact opposite. This all started because a database which was tuned for good response time for relatively small queries against a "hot" portion of some tables chose a bad plan for a weekend maintenance run against the full tables. We're talking about the possibility of adapting the cost factors based on table sizes as compared to available cache, to more accurately model the impact of needing to do actual disk I/O for such queries. This also is very different from trying to adapt queries to what happens to be currently in cache. As already discussed on a recent thread, the instability in plans and the failure to get to an effective cache set make that a bad idea. The idea discussed here would maintain a stable plan for a given query, it would just help choose a good plan based on the likely level of caching. -Kevin
--- On Thu, 11/11/10, Mladen Gogala <mladen.gogala@vmsinfo.com> wrote: > From: Mladen Gogala <mladen.gogala@vmsinfo.com> > Subject: Re: [PERFORM] anti-join chosen even when slower than old plan > To: "Kenneth Marshall" <ktm@rice.edu> > Cc: "Robert Haas" <robertmhaas@gmail.com>, "Tom Lane" <tgl@sss.pgh.pa.us>, "Kevin Grittner" <Kevin.Grittner@wicourts.gov>,"pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org> > Date: Thursday, November 11, 2010, 9:15 AM > Kenneth Marshall wrote: > > I agree with the goal of avoiding the need for a GUC. > This needs to > > be as automatic as possible. One idea I had had was > computing a value > > for the amount of cache data in the system by keeping > a sum or a > > weighted sum of the table usage in the system. Smaller > tables and > > indexes would contribute a smaller amount to the > total, while larger > > indexes and tables would contribute a larger amount. > Then by comparing > > this running total to the effective_cache_size, set > the random and > > sequential costs for a query. This would allow the > case of many 4MB > > tables to favor disk I/O more than memory I/O. The > weighting could > > be a function of simultaneous users of the table. I > know this is a > > bit of hand-waving but some sort of dynamic feedback > needs to be > > provided to the planning process as system use > increases. > > > > Regards, > > Ken > > > > > Kenneth, you seem to be only concerned with the accuracy of > the planning process, not with the plan stability. As a DBA > who has to monitor real world applications, I find things > like an execution plan changing with the use of the system > to be my worst nightmare. The part where you say that "this > needs to be as automatic as possible" probably means that I > will not be able to do anything about it, if the optimizer, > by any chance, doesn't get it right. That looks to me like > an entirely wrong way to go. > When application developer tunes the SQL both him and me > expect that SQL to always perform that way, not to change > the execution plan because the system is utilized more than > it was 1 hour ago. Nobody seems to have taken my suggestion > about having a parameter > which would simply "invent" the percentage out of thin air > seriously, because it's obviously not accurate. > However, the planner accuracy is not the only concern. > Running applications on the system usually requires plan > stability. Means of > external control of the execution plan, DBA knobs and > buttons that can be turned and pushed to produce the desired > plan are also very much desired. > > -- Mladen Gogala Sr. Oracle DBA > 1500 Broadway > New York, NY 10036 > (212) 329-5251 > www.vmsinfo.com > Mladen, Been there, done that with Oracle for more years than I care to remember or admit. Having the necessary knobs was both dauntingand a godsend, depending on if you could find the right one(s) to frob during production use, and you turned themthe right way and amount. I personally find having less knobbage with PostgreSQL to be a huge benefit over Oracle. In that spirit, I offer the following suggestion: (Ken's original suggestion inspired me, so if I misunderstand it, Ken,please correct me.) What if the code that managed the shared buffer cache kept track of how many buffers were in the cache for each table andindex? Then the optimizer could know the ratio of cached to non-cached table of index buffers (how many pages are inPG's buffer cache vs. the total number of pages required for the entire table, assuming autovacuum is working well) andplan accordingly. It would even be possible to skew the estimate based on the ratio of shared_buffers to effective_cache_size. The optimizer could then dynamically aadjust the random and sequential costs per query prior to planning,with (hopefully) plans optimized to the current condition of the server and host caches just prior to execution. There are lots of assumptions here, the primary ones being the shared buffer cache's state doesn't change significantly betweenthe start of planning and actual execution time, and the host is dedicated to running the database and nothing elsethat would trash the host's file system cache. I admit that I haven't looked at the code for this yet, so I don't knowif I'm on to something or off in the weeds. Regards, Bob Lunney
Kevin Grittner wrote: > Mladen Gogala <mladen.gogala@vmsinfo.com> wrote: > > >> create a definitive bias toward one type of the execution plan. >> > > We're talking about trying to support the exact opposite. I understand this, that is precisely the reason for my intervention into the discussion of experts, which I am not. > This all > started because a database which was tuned for good response time > for relatively small queries against a "hot" portion of some tables > chose a bad plan for a weekend maintenance run against the full > tables. We're talking about the possibility of adapting the cost > factors based on table sizes as compared to available cache, to more > accurately model the impact of needing to do actual disk I/O for > such queries. > Kevin, in my experience, the hardest thing to do is to tune so called mixed type databases. In practice, databases are usually separated: OLTP database on one group of servers, reporting database and the data warehouse on another group of servers. Postgres 9.0 has made a great stride toward such possibility with the new replication facilities. Again, having an optimizer which will choose the plan completely accurately is, at least in my opinion, less important than having a possibility of manual control, the aforementioned "knobs and buttons" and produce the same plan for the same statement. Trying to make the optimizer smart enough for all types of loads is akin to looking for the Holy Grail. Inevitably, you will face some hard questions, like the one about the airspeed velocity of an unladen swallow, and the whole search is likely to end in pretty funny way, not producing the desired "optimizing genie in the CPU". > > This also is very different from trying to adapt queries to what > happens to be currently in cache. As already discussed on a recent > thread, the instability in plans and the failure to get to an > effective cache set make that a bad idea. The idea discussed here > would maintain a stable plan for a given query, it would just help > choose a good plan based on the likely level of caching. > Kevin, I am talking from the perspective of a DBA who is involved with a production databases on day-to-day basis. I am no expert but I do believe to speak from a perspective of users that Postgres has to win in order to make further inroads into the corporate server rooms. Without the possibility of such control and the plan stability, it is hard for me to recommend more extensive use of PostgreSQL to my boss. Whatever solution is chosen, it needs to have "knobs and buttons" and produce the plans that will not change when the CPU usage goes up. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
Mladen Gogala <mladen.gogala@vmsinfo.com> writes: > Again, having an optimizer which will choose the plan completely > accurately is, at least in my opinion, less important than having a > possibility of manual control, the aforementioned "knobs and buttons" > and produce the same plan for the same statement. More knobs and buttons is the Oracle way, and the end result of that process is that you have something as hard to use as Oracle. That's generally not thought of as desirable in this community. regards, tom lane
On 11/11/10 9:13 AM, Mladen Gogala wrote: > Kevin Grittner wrote: >> Mladen Gogala <mladen.gogala@vmsinfo.com> wrote: >> >>> create a definitive bias toward one type of the execution plan. >> >> We're talking about trying to support the exact opposite. > I understand this, that is precisely the reason for my intervention into the discussion of experts, which I am not. >> This all >> started because a database which was tuned for good response time >> for relatively small queries against a "hot" portion of some tables >> chose a bad plan for a weekend maintenance run against the full >> tables. We're talking about the possibility of adapting the cost >> factors based on table sizes as compared to available cache, to more >> accurately model the impact of needing to do actual disk I/O for >> such queries. > Kevin, in my experience, the hardest thing to do is to tune so called mixed type databases. In practice, databases areusually separated: OLTP database on one group of servers, reporting database and the data warehouse on another group ofservers. Postgres 9.0 has made a great stride toward such possibility with the new replication facilities. Again, havingan optimizer which will choose the plan completely accurately is, at least in my opinion, less important than havinga possibility of manual control, the aforementioned "knobs and buttons" and produce the same plan for the same statement.Trying to make the optimizer smart enough for all types of loads is akin to looking for the Holy Grail. Inevitably,you will face some hard questions, like the one about the airspeed velocity of an unladen swallow, and the wholesearch is likely to end in pretty funny way, not producing the desired "optimizing genie in the CPU". What about rule-based configuration? You provide a default configuration (what Postgres does now), and then allow one ormore alternate configurations that are triggered when certain rules match. The rules might be things like: - Per user or group of users. Set up a user or group for your maintenance task and you automatically get your own config. - A set of tables. If you do a query that uses tables X, Y, and Z, this configuration applies to you. - A regular expression applied to the SQL. If the regexp matches, the configuration applies. - System resource usage. If some other process is gobbling memory, switch to a configuration with lower memory requirements. - A time of day. Maybe on weekends, different rules apply. ... and so on. I don't know what the right parameters might be, but surely the original poster's problem would be solvedby this solution. It gives performance experts the tool they need for complex installations, without adding FUD tothe lives of everyone else. Craig >> >> This also is very different from trying to adapt queries to what >> happens to be currently in cache. As already discussed on a recent >> thread, the instability in plans and the failure to get to an >> effective cache set make that a bad idea. The idea discussed here >> would maintain a stable plan for a given query, it would just help >> choose a good plan based on the likely level of caching. > Kevin, I am talking from the perspective of a DBA who is involved with a production databases on day-to-day basis. I amno expert but I do believe to speak from a perspective of users that Postgres has to win in order to make further inroadsinto the corporate server rooms. Without the possibility of such control and the plan stability, it is hard for meto recommend more extensive use of PostgreSQL to my boss. Whatever solution is chosen, it needs to have "knobs and buttons"and produce the plans that will not change when the CPU usage goes up. >
On Thu, Nov 11, 2010 at 10:00 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Mladen Gogala <mladen.gogala@vmsinfo.com> wrote: > >> create a definitive bias toward one type of the execution plan. > > We're talking about trying to support the exact opposite. This all > started because a database which was tuned for good response time > for relatively small queries against a "hot" portion of some tables > chose a bad plan for a weekend maintenance run against the full > tables. We're talking about the possibility of adapting the cost > factors based on table sizes as compared to available cache, to more > accurately model the impact of needing to do actual disk I/O for > such queries. > > This also is very different from trying to adapt queries to what > happens to be currently in cache. As already discussed on a recent > thread, the instability in plans and the failure to get to an > effective cache set make that a bad idea. The idea discussed here > would maintain a stable plan for a given query, it would just help > choose a good plan based on the likely level of caching. Let's back up a moment and talk about what the overall goal is, here. Ideally, we would like PostgreSQL to have excellent performance at all times, under all circumstances, with minimal tuning. Therefore, we do NOT want to add variables that will, by design, need constant manual adjustment. That is why I suggested that Tom's idea of an assume_cached GUC is probably not what we really want to do. On the other hand, what I understand Mladen to be suggesting is something completely different. He's basically saying that, of course, he wants it to work out of the box most of the time, but since there are guaranteed to be cases where it doesn't, how about providing some knobs that aren't intended to be routinely twaddled but which are available in case of emergency? Bravo, I say! Consider the case of whether a table is cached. Currently, we estimate that it isn't, and you can sort of twaddle that assumption globally by setting seq_page_cost and random_page_cost. In 9.0, you can twaddle it with a bit more granularity by adjusting seq_page_cost and random_page_cost on a per-tablespace basis. But that's really intended to handle the case where you have one tablespace on an SSD and another that isn't. It doesn't really model caching at all; we're just abusing it as if it does. If 90% of a table is cached, you can't simply multiply the cost of reading it by 0.1, because now many of those reads will be random I/O rather than sequential I/O. The right thing to do is to estimate what percentage of the table will be cached, then estimate how much random and sequential I/O it'll take to get the rest, and then compute the cost. To do that, we can adopt the approach proposed upthread of comparing the size of the table to effective_cache_size. We come up with some function f, such that f(effective_cache_size, table_size) = assumed_caching_percentage, and then from there we estimate random I/Os and sequential I/Os, and from there we estimate costs. This is a good system, almost certainly better than what we have now. However, it's also guaranteed to not always work. The DBA may know, for example, that one particular table that is quite large is always fully cached because it is very heavily access. So why not let them pin the assumed_caching_percentage for that table to 100%? I don't see any reason at all. Most people will never need to touch that setting, but it's there in case you really, really need it. We've traditionally been reluctant to do this sort of thing (as the email Tom just sent reflects) but I think we should soften up a bit. A product gets hard to use when it has knobs that MUST be tuned to make it work at all, and certainly AFAICT Oracle falls into that category. My rollback segment is full? My table is out of extents? Well allocate some more space then; I certainly wouldn't have imposed an arbitrary cap on the table size if I'd known I was doing it. However, that's not the same thing as having knobs that are *available* when the shit really hits the fan. By failing to provide that type of knob, we're not facilitating ease of use; we're just making it difficult for the small percentage of people who have problems to fix them, which is another kind of non-ease-of-use. In fact, we already have a few knobs of this type. We have a statistics target which can be overriden on a per-column basis, and beginning in 9.0, you can override the planner's n_distinct estimates in the same way. Why? Because we know that it's not achievable to estimate n_distinct accurately in all cases without making ANALYZE unreasonably slow. I bet that very, VERY few people will ever use that feature, so it costs nothing in terms of "oh, another setting I have to configure". But for people who are getting bitten by inaccurate n_distinct estimates, it will be very nice to have that as an escape hatch. I see no harm, and much value, in providing similar escape hatches elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tom Lane wrote: > More knobs and buttons is the Oracle way, True. Very true. > and the end result of that > process is that you have something as hard to use as Oracle. Also, you end up with something which is extremely reliable and adjustable to variety of conditions. > That's > generally not thought of as desirable in this community. > > regards, tom lane > Allow me to play the devil's advocate again. This community is still much, much smaller than even the MySQL community, much less Oracle's community. If growth of the community is the goal, copying a page or two from the Oracle's book, looks like a good idea to me. The only thing I dislike about Oracle is its price, not its complexity. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
Robert Haas <robertmhaas@gmail.com> writes: > Let's back up a moment and talk about what the overall goal is, here. > Ideally, we would like PostgreSQL to have excellent performance at all > times, under all circumstances, with minimal tuning. Therefore, we do > NOT want to add variables that will, by design, need constant manual > adjustment. That is why I suggested that Tom's idea of an > assume_cached GUC is probably not what we really want to do. On the > other hand, what I understand Mladen to be suggesting is something > completely different. He's basically saying that, of course, he wants > it to work out of the box most of the time, but since there are > guaranteed to be cases where it doesn't, how about providing some > knobs that aren't intended to be routinely twaddled but which are > available in case of emergency? Bravo, I say! Um ... those are exactly the same thing. You're just making different assumptions about how often you will need to twiddle the setting. Neither assumption is based on any visible evidence, unfortunately. I was thinking of assume_cached as something that could be set-and-forget most of the time, and you're entirely right to criticize it on the grounds that maybe it wouldn't. But to support a proposal that doesn't even exist yet on the grounds that it *would* be set-and-forget seems a tad inconsistent. We can't make that judgment without a whole lot more details than have been provided yet for any idea in this thread. I do think that something based around a settable-per-table caching percentage might be a reasonable way to proceed. But the devil is in the details, and we don't have those yet. regards, tom lane
I wrote: > I do think that something based around a settable-per-table caching > percentage might be a reasonable way to proceed. BTW ... on reflection it seems that this would *not* solve the use-case Kevin described at the start of this thread. What he's got AIUI is some large tables whose recent entries are well-cached, and a lot of queries that tend to hit that well-cached portion, plus a few queries that hit the whole table and so see largely-not-cached behavior. We can't represent that very well with a caching knob at the table level. Either a high or a low setting will be wrong for one set of queries or the other. It might work all right if he were to partition the table and then have a different caching value attached to the currently-latest partition, but that doesn't sound exactly maintenance-free either. Also, that only works with the current statically-planned approach to partitioned tables. I think where we're trying to go with partitioning is that the planner doesn't consider the individual partitions, but the executor just hits the right one at runtime --- so cost modifiers attached to individual partitions aren't going to work in that environment. The most practical solution for his case still seems to be to twiddle some GUC or other locally in the maintenance scripts that do the full-table-scan queries. Unfortunately we don't have an equivalent of per-session SET (much less SET LOCAL) for per-relation attributes. Not sure if we want to go there. regards, tom lane
On Thu, Nov 11, 2010 at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Let's back up a moment and talk about what the overall goal is, here. >> Ideally, we would like PostgreSQL to have excellent performance at all >> times, under all circumstances, with minimal tuning. Therefore, we do >> NOT want to add variables that will, by design, need constant manual >> adjustment. That is why I suggested that Tom's idea of an >> assume_cached GUC is probably not what we really want to do. On the >> other hand, what I understand Mladen to be suggesting is something >> completely different. He's basically saying that, of course, he wants >> it to work out of the box most of the time, but since there are >> guaranteed to be cases where it doesn't, how about providing some >> knobs that aren't intended to be routinely twaddled but which are >> available in case of emergency? Bravo, I say! > > Um ... those are exactly the same thing. You're just making different > assumptions about how often you will need to twiddle the setting. > Neither assumption is based on any visible evidence, unfortunately. > > I was thinking of assume_cached as something that could be > set-and-forget most of the time, and you're entirely right to criticize > it on the grounds that maybe it wouldn't. But to support a proposal > that doesn't even exist yet on the grounds that it *would* be > set-and-forget seems a tad inconsistent. We can't make that judgment > without a whole lot more details than have been provided yet for any > idea in this thread. Well, maybe I misunderstood what you were proposing. I had the impression that you were proposing something that would *by design* require adjustment for each query, so evidently I missed the point. It seems to me that random_page_cost and seq_page_cost are pretty close to set-and-forget already. We don't have many reports of people needing to tune these values on a per-query basis; most people seem to just guesstimate a cluster-wide value and call it good. Refining the algorithm should only make things better. > I do think that something based around a settable-per-table caching > percentage might be a reasonable way to proceed. But the devil is in > the details, and we don't have those yet. I think one of the larger devils in the details is deciding how to estimate the assumed caching percentage when the user hasn't specified one. Frankly, I suspect that if we simply added a reloption called assumed_caching_percentage and made it default to zero, we would make a bunch of DBAs happy; they'd knock down seq_page_cost and random_page_cost enough to account for the general level of caching and then bump assumed_caching_percentage up for hot tables/indexes (or ones that they want to have become hot). I think we can do better than that, but the right formula isn't exactly obvious. I feel safe saying that if effective_cache_size=1GB and table_size=4MB, then we ought to take the table as fully cached. But it's far from clear what caching percentage we should assume when table_size=400MB, and it seems like the sort of thing that will lead to endless bikeshedding. There's probably no perfect answer, but I feel we can likely come up with something that is better than a constant (which would probably still be better than what we have now). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 11, 2010 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I do think that something based around a settable-per-table caching >> percentage might be a reasonable way to proceed. > > BTW ... on reflection it seems that this would *not* solve the use-case > Kevin described at the start of this thread. What he's got AIUI is some > large tables whose recent entries are well-cached, and a lot of queries > that tend to hit that well-cached portion, plus a few queries that hit > the whole table and so see largely-not-cached behavior. We can't > represent that very well with a caching knob at the table level. Either > a high or a low setting will be wrong for one set of queries or the > other. Yeah. For Kevin's case, it seems like we want the caching percentage to vary not so much based on which table we're hitting at the moment but on how much of it we're actually reading. However, the two problems are related enough that I think it might be feasible to come up with one solution that answers both needs, or perhaps two somewhat-intertwined solutions. > The most practical solution for his case still seems to be to twiddle > some GUC or other locally in the maintenance scripts that do the > full-table-scan queries. Unfortunately we don't have an equivalent > of per-session SET (much less SET LOCAL) for per-relation attributes. > Not sure if we want to go there. I doubt it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW ... on reflection it seems that this would *not* solve the > use-case Kevin described at the start of this thread. What he's > got AIUI is some large tables whose recent entries are well- > cached, and a lot of queries that tend to hit that well-cached > portion, plus a few queries that hit the whole table and so see > largely-not-cached behavior. We can't represent that very well > with a caching knob at the table level. Either a high or a low > setting will be wrong for one set of queries or the other. Exactly right. > The most practical solution for his case still seems to be to > twiddle some GUC or other locally in the maintenance scripts that > do the full-table-scan queries. Yes, that works fine. The thread spun off in this speculative direction because I started thinking about whether there was any reasonable way for PostgreSQL to automatically handle such things without someone having to notice the problem and do the per-script tuning. I don't know whether any of the ideas thus spawned are worth the effort -- it's not a situation I find myself in all that often. I guess it could be considered an "ease of use" feature. > Unfortunately we don't have an equivalent of per-session SET (much > less SET LOCAL) for per-relation attributes. Not sure if we want > to go there. Besides the "fully-scanned object size relative to relation size costing adjustment" idea, the only one which seemed to be likely to be useful for this sort of issue was the "costing factors by user ID" idea -- the interactive queries hitting the well-cached portion of the tables are run through a read-only user ID, while the weekly maintenance scripts (obviously) are not. With the settings I initially had assigned to the cluster the maintenance scripts would never have seen this issue; it was tuning to resolve end-user complaints of slowness in the interactive queries which set up the conditions for failure, and if I'd had per-user settings, I probably would have (and definitely *should* have) used them. FWIW, I can certainly see the potential of some other ideas which came up on the thread; what might have seemed like antipathy toward them was more of an attempt to point out that they would not have helped at all with the problem which started this thread. -Kevin
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Nov 11, 2010 at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I do think that something based around a settable-per-table caching >> percentage might be a reasonable way to proceed. �But the devil is in >> the details, and we don't have those yet. > I think one of the larger devils in the details is deciding how to > estimate the assumed caching percentage when the user hasn't specified > one. I was imagining something very similar to the handling of seq_page_cost, ie, there's a GUC controlling the system-wide default and then you can override that per-table. But the real question is whether per-table is a useful granularity to control it at. See my later message. regards, tom lane
I wrote: > Besides the "fully-scanned object size relative to relation size > costing adjustment" idea, s/relation size/effective cache size/ -Kevin
Robert Haas <robertmhaas@gmail.com> writes: > Yeah. For Kevin's case, it seems like we want the caching percentage > to vary not so much based on which table we're hitting at the moment > but on how much of it we're actually reading. Well, we could certainly take the expected number of pages to read and compare that to effective_cache_size. The thing that's missing in that equation is how much other stuff is competing for cache space. I've tried to avoid having the planner need to know the total size of the database cluster, but it's kind of hard to avoid that if you want to model this honestly. Would it be at all workable to have an estimate that so many megs of a table are in cache (independently of any other table), and then we could scale the cost based on the expected number of pages to read versus that number? The trick here is that DBAs really aren't going to want to set such a per-table number (at least, most of the time) so we need a formula to get to a default estimate for that number based on some simple system-wide parameters. I'm not sure if that's any easier. BTW, it seems that all these variants have an implicit assumption that if you're reading a small part of the table it's probably part of the working set; which is an assumption that could be 100% wrong. I don't see a way around it without trying to characterize the data access at an unworkably fine level, though. regards, tom lane
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Besides the "fully-scanned object size relative to relation size > costing adjustment" idea, the only one which seemed to be likely to > be useful for this sort of issue was the "costing factors by user > ID" idea -- the interactive queries hitting the well-cached portion > of the tables are run through a read-only user ID, while the weekly > maintenance scripts (obviously) are not. With the settings I > initially had assigned to the cluster the maintenance scripts would > never have seen this issue; it was tuning to resolve end-user > complaints of slowness in the interactive queries which set up the > conditions for failure, and if I'd had per-user settings, I probably > would have (and definitely *should* have) used them. Erm ... you can in fact do "ALTER USER SET random_page_cost" today. As long as the settings are GUC parameters we have quite a lot of flexibility about how to control them. This gets back to my earlier point that our current form of per-relation properties (reloptions) is considerably less flexible than a GUC. I think that if we create any strong planner dependence on such properties, we're going to end up needing to be able to set them in all the same ways you can set a GUC. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I've tried to avoid having the planner need to know the total size > of the database cluster, but it's kind of hard to avoid that if > you want to model this honestly. Agreed. Perhaps the cost could start escalating when the pages to access hit (effective_cache_size * relation_size / database_size), and escalate to the defaults (or some new GUCs) in a linear fashion until you hit effective_cache_size? > BTW, it seems that all these variants have an implicit assumption > that if you're reading a small part of the table it's probably > part of the working set I would say that the assumption should be that seq_page_cost and random_page_cost model the costs for less extreme (and presumably more common) queries, and that we're providing a way of handling the exceptional, extreme queries. -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Erm ... you can in fact do "ALTER USER SET random_page_cost" > today. Ouch. I'm embarrassed to have missed that. I'll do that instead of adding those settings to the scripts, then. Thanks for pointing that out. -Kevin
On Thursday 11 November 2010 19:58:49 Tom Lane wrote: > I wrote: > > I do think that something based around a settable-per-table caching > > percentage might be a reasonable way to proceed. > > BTW ... on reflection it seems that this would *not* solve the use-case > Kevin described at the start of this thread. What he's got AIUI is some > large tables whose recent entries are well-cached, and a lot of queries > that tend to hit that well-cached portion, plus a few queries that hit > the whole table and so see largely-not-cached behavior. We can't > represent that very well with a caching knob at the table level. Either > a high or a low setting will be wrong for one set of queries or the > other. > > It might work all right if he were to partition the table and then have > a different caching value attached to the currently-latest partition, > but that doesn't sound exactly maintenance-free either. Also, that only > works with the current statically-planned approach to partitioned > tables. I think where we're trying to go with partitioning is that > the planner doesn't consider the individual partitions, but the executor > just hits the right one at runtime --- so cost modifiers attached to > individual partitions aren't going to work in that environment. > > The most practical solution for his case still seems to be to twiddle > some GUC or other locally in the maintenance scripts that do the > full-table-scan queries. Unfortunately we don't have an equivalent > of per-session SET (much less SET LOCAL) for per-relation attributes. > Not sure if we want to go there. As dicussed in another thread some time ago another possibility is to probe how well the data i cached using mincore() or similar... While it presents problem with cache ramp-up it quite cool for other use-cases (like this one). Andre
On Thu, Nov 11, 2010 at 1:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Besides the "fully-scanned object size relative to relation size >> costing adjustment" idea, the only one which seemed to be likely to >> be useful for this sort of issue was the "costing factors by user >> ID" idea -- the interactive queries hitting the well-cached portion >> of the tables are run through a read-only user ID, while the weekly >> maintenance scripts (obviously) are not. With the settings I >> initially had assigned to the cluster the maintenance scripts would >> never have seen this issue; it was tuning to resolve end-user >> complaints of slowness in the interactive queries which set up the >> conditions for failure, and if I'd had per-user settings, I probably >> would have (and definitely *should* have) used them. > > Erm ... you can in fact do "ALTER USER SET random_page_cost" today. > As long as the settings are GUC parameters we have quite a lot of > flexibility about how to control them. This gets back to my earlier > point that our current form of per-relation properties (reloptions) is > considerably less flexible than a GUC. I think that if we create any > strong planner dependence on such properties, we're going to end up > needing to be able to set them in all the same ways you can set a GUC. In Kevin's particular case, would this mechanism not help? By that I mean he could have two users: one user for the daily, the tables-ought-to-be-in-hot-cache use case. The other use could make use of the ALTER USER SET ... mechanism to drive the weekly reporting (tables are probably not hot) use case. -- Jon
On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Yeah. For Kevin's case, it seems like we want the caching percentage >> to vary not so much based on which table we're hitting at the moment >> but on how much of it we're actually reading. > > Well, we could certainly take the expected number of pages to read and > compare that to effective_cache_size. The thing that's missing in that > equation is how much other stuff is competing for cache space. I've > tried to avoid having the planner need to know the total size of the > database cluster, but it's kind of hard to avoid that if you want to > model this honestly. I'm not sure I agree with that. I mean, you could easily have a database that is much larger than effective_cache_size, but only that much of it is hot. Or, the hot portion could move around over time. And for reasons of both technical complexity and plan stability, I don't think we want to try to model that. It seems perfectly reasonable to say that reading 25% of effective_cache_size will be more expensive *per-page* than reading 5% of effective_cache_size, independently of what the total cluster size is. > Would it be at all workable to have an estimate that so many megs of a > table are in cache (independently of any other table), and then we could > scale the cost based on the expected number of pages to read versus that > number? The trick here is that DBAs really aren't going to want to set > such a per-table number (at least, most of the time) so we need a > formula to get to a default estimate for that number based on some simple > system-wide parameters. I'm not sure if that's any easier. That's an interesting idea. For the sake of argument, suppose we assume that a relation which is less than 5% of effective_cache_size will be fully cached; and anything larger we'll assume that much of it is cached. Consider a 4GB machine with effective_cache_size set to 3GB. Then we'll assume that any relation less than 153MB table is 100% cached, a 1 GB table is 15% cached, and a 3 GB table is 5% cached. That doesn't seem quite right, though: the caching percentage drops off very quickly after you exceed the threshold. *thinks* I wondering if we could do something with a formula like 3 * amount_of_data_to_read / (3 * amount_of_data_to_read + effective_cache_size) = percentage NOT cached. That is, if we're reading an amount of data equal to effective_cache_size, we assume 25% caching, and plot a smooth curve through that point. In the examples above, we would assume that a 150MB read is 87% cached, a 1GB read is 50% cached, and a 3GB read is 25% cached. > BTW, it seems that all these variants have an implicit assumption that > if you're reading a small part of the table it's probably part of the > working set; which is an assumption that could be 100% wrong. I don't > see a way around it without trying to characterize the data access at > an unworkably fine level, though. Me neither, but I think it will frequently be true, and I'm not sure it will hurt very much when it isn't. I mean, if you execute the same query repeatedly, that data will become hot soon enough. If you execute a lot of different queries that each touch a small portion of a big, cold table, we might underestimate the costs of the index probes, but so what? There's probably no better strategy for accessing that table anyway. Perhaps you can construct an example where this underestimate affects the join order in an undesirable fashion, but I'm having a hard time getting worked up about that as a potential problem case. Our current system - where we essentially assume that the caching percentage is uniform across the board - can have the same problem in less artificial cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
---- Original message ---- >Date: Thu, 11 Nov 2010 15:29:40 -0500 >From: pgsql-performance-owner@postgresql.org (on behalf of Robert Haas <robertmhaas@gmail.com>) >Subject: Re: [PERFORM] anti-join chosen even when slower than old plan >To: Tom Lane <tgl@sss.pgh.pa.us> >Cc: Kevin Grittner <Kevin.Grittner@wicourts.gov>,Mladen Gogala <mladen.gogala@vmsinfo.com>,"pgsql-performance@postgresql.org"<pgsql-performance@postgresql.org> > >On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Yeah. For Kevin's case, it seems like we want the caching percentage >>> to vary not so much based on which table we're hitting at the moment >>> but on how much of it we're actually reading. >> >> Well, we could certainly take the expected number of pages to read and >> compare that to effective_cache_size. The thing that's missing in that >> equation is how much other stuff is competing for cache space. I've >> tried to avoid having the planner need to know the total size of the >> database cluster, but it's kind of hard to avoid that if you want to >> model this honestly. > >I'm not sure I agree with that. I mean, you could easily have a >database that is much larger than effective_cache_size, but only that >much of it is hot. Or, the hot portion could move around over time. >And for reasons of both technical complexity and plan stability, I >don't think we want to try to model that. It seems perfectly >reasonable to say that reading 25% of effective_cache_size will be >more expensive *per-page* than reading 5% of effective_cache_size, >independently of what the total cluster size is. > >> Would it be at all workable to have an estimate that so many megs of a >> table are in cache (independently of any other table), and then we could >> scale the cost based on the expected number of pages to read versus that >> number? The trick here is that DBAs really aren't going to want to set >> such a per-table number (at least, most of the time) so we need a >> formula to get to a default estimate for that number based on some simple >> system-wide parameters. I'm not sure if that's any easier. > >That's an interesting idea. For the sake of argument, suppose we >assume that a relation which is less than 5% of effective_cache_size >will be fully cached; and anything larger we'll assume that much of it >is cached. Consider a 4GB machine with effective_cache_size set to >3GB. Then we'll assume that any relation less than 153MB table is >100% cached, a 1 GB table is 15% cached, and a 3 GB table is 5% >cached. That doesn't seem quite right, though: the caching percentage >drops off very quickly after you exceed the threshold. > >*thinks* > >I wondering if we could do something with a formula like 3 * >amount_of_data_to_read / (3 * amount_of_data_to_read + >effective_cache_size) = percentage NOT cached. That is, if we're >reading an amount of data equal to effective_cache_size, we assume 25% >caching, and plot a smooth curve through that point. In the examples >above, we would assume that a 150MB read is 87% cached, a 1GB read is >50% cached, and a 3GB read is 25% cached. > >> BTW, it seems that all these variants have an implicit assumption that >> if you're reading a small part of the table it's probably part of the >> working set; which is an assumption that could be 100% wrong. I don't >> see a way around it without trying to characterize the data access at >> an unworkably fine level, though. > >Me neither, but I think it will frequently be true, and I'm not sure >it will hurt very much when it isn't. I mean, if you execute the same >query repeatedly, that data will become hot soon enough. If you >execute a lot of different queries that each touch a small portion of >a big, cold table, we might underestimate the costs of the index >probes, but so what? There's probably no better strategy for >accessing that table anyway. Perhaps you can construct an example >where this underestimate affects the join order in an undesirable >fashion, but I'm having a hard time getting worked up about that as a >potential problem case. Our current system - where we essentially >assume that the caching percentage is uniform across the board - can >have the same problem in less artificial cases. > >-- >Robert Haas >EnterpriseDB: http://www.enterprisedb.com >The Enterprise PostgreSQL Company > >-- >Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-performance On a thread some time ago, on a similar subject, I opined that I missed the ability to assign tables to tablespaces and buffersto tablespaces, thus having the ability to isolate needed tables (perhaps a One True Lookup Table, for example; ora Customer table) to memory without fear of eviction. I was sounding beaten about the face and breast. It really is an "Enterprise" way of handling the situation. regards, Robert
On Thu, Nov 11, 2010 at 03:56:25PM -0500, gnuoytr@rcn.com wrote: > On a thread some time ago, on a similar subject, I opined that I missed the ability to assign tables to tablespaces andbuffers to tablespaces, thus having the ability to isolate needed tables (perhaps a One True Lookup Table, for example;or a Customer table) to memory without fear of eviction. > > I was sounding beaten about the face and breast. It really is an "Enterprise" way of handling the situation. > > regards, > Robert > ALTER TABLE can be used to change the tablespace of a table and/or index. Cheers, Ken
2010/11/11 Tom Lane <tgl@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: >> Yeah. For Kevin's case, it seems like we want the caching percentage >> to vary not so much based on which table we're hitting at the moment >> but on how much of it we're actually reading. > > Well, we could certainly take the expected number of pages to read and > compare that to effective_cache_size. The thing that's missing in that > equation is how much other stuff is competing for cache space. I've > tried to avoid having the planner need to know the total size of the > database cluster, but it's kind of hard to avoid that if you want to > model this honestly. > > Would it be at all workable to have an estimate that so many megs of a > table are in cache Yes, with Linux ... at least. > (independently of any other table), and then we could > scale the cost based on the expected number of pages to read versus that > number? The trick here is that DBAs really aren't going to want to set > such a per-table number (at least, most of the time) so we need a > formula to get to a default estimate for that number based on some simple > system-wide parameters. I'm not sure if that's any easier. My current ideas for future POC with pgfincore are around what is said currently in this thread. I'd like to have some maintenance stuff like auto-ANALYZE which report table and index usage of the OS cache, it might be % of data in cache and distribution of data in cache (perhaps only my last 15% of the table are in cache, or perhaps 15% of blocks with a more regular-random?- distribution) My current stats around OS cache illustrate that the OS page cache remain stable : number of blocks in memory per object does not change a lot once application have run long enough. Those are good stats to automaticaly adjust random_page_cost and seq_page_cost per per table or index. DBA provide accurate (with the hardware) random_page_cost and seq_page_cost , perhaps we may want a mem_page_cost (?). Or we just adjust rand_page_cost and seq_page_cost based on the average data in cache. Actually I think that updating *_page_cost and keeping the current design of effective_cache_size (in costsize.c) may rock enough. > > BTW, it seems that all these variants have an implicit assumption that > if you're reading a small part of the table it's probably part of the > working set; which is an assumption that could be 100% wrong. I don't > see a way around it without trying to characterize the data access at > an unworkably fine level, though. Exactly. > > regards, tom lane > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2010/11/11 Robert Haas <robertmhaas@gmail.com>: > On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Yeah. For Kevin's case, it seems like we want the caching percentage >>> to vary not so much based on which table we're hitting at the moment >>> but on how much of it we're actually reading. >> >> Well, we could certainly take the expected number of pages to read and >> compare that to effective_cache_size. The thing that's missing in that >> equation is how much other stuff is competing for cache space. I've >> tried to avoid having the planner need to know the total size of the >> database cluster, but it's kind of hard to avoid that if you want to >> model this honestly. > > I'm not sure I agree with that. I mean, you could easily have a > database that is much larger than effective_cache_size, but only that > much of it is hot. Or, the hot portion could move around over time. > And for reasons of both technical complexity and plan stability, I > don't think we want to try to model that. It seems perfectly > reasonable to say that reading 25% of effective_cache_size will be > more expensive *per-page* than reading 5% of effective_cache_size, > independently of what the total cluster size is. > >> Would it be at all workable to have an estimate that so many megs of a >> table are in cache (independently of any other table), and then we could >> scale the cost based on the expected number of pages to read versus that >> number? The trick here is that DBAs really aren't going to want to set >> such a per-table number (at least, most of the time) so we need a >> formula to get to a default estimate for that number based on some simple >> system-wide parameters. I'm not sure if that's any easier. > > That's an interesting idea. For the sake of argument, suppose we > assume that a relation which is less than 5% of effective_cache_size > will be fully cached; and anything larger we'll assume that much of it > is cached. Consider a 4GB machine with effective_cache_size set to > 3GB. Then we'll assume that any relation less than 153MB table is > 100% cached, a 1 GB table is 15% cached, and a 3 GB table is 5% > cached. That doesn't seem quite right, though: the caching percentage > drops off very quickly after you exceed the threshold. > > *thinks* > > I wondering if we could do something with a formula like 3 * > amount_of_data_to_read / (3 * amount_of_data_to_read + > effective_cache_size) = percentage NOT cached. That is, if we're > reading an amount of data equal to effective_cache_size, we assume 25% > caching, and plot a smooth curve through that point. In the examples > above, we would assume that a 150MB read is 87% cached, a 1GB read is > 50% cached, and a 3GB read is 25% cached. But isn't it already the behavior of effective_cache_size usage ? See index_pages_fetched() in costsize.c > >> BTW, it seems that all these variants have an implicit assumption that >> if you're reading a small part of the table it's probably part of the >> working set; which is an assumption that could be 100% wrong. I don't >> see a way around it without trying to characterize the data access at >> an unworkably fine level, though. > > Me neither, but I think it will frequently be true, and I'm not sure > it will hurt very much when it isn't. I mean, if you execute the same > query repeatedly, that data will become hot soon enough. If you > execute a lot of different queries that each touch a small portion of > a big, cold table, we might underestimate the costs of the index > probes, but so what? There's probably no better strategy for > accessing that table anyway. Perhaps you can construct an example > where this underestimate affects the join order in an undesirable > fashion, but I'm having a hard time getting worked up about that as a > potential problem case. Our current system - where we essentially > assume that the caching percentage is uniform across the board - can > have the same problem in less artificial cases. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
I'd say there are two Qs here: 1) Modify costs based on information on how much of the table is in cache. It would be great if this can be done, but I'd prefer to have it as admin knobs (because of plan stability). May be both admin and automatic ways can be followed with some parallel (disableable) process modify knobs on admin behalf. In this case different strategies to automatically modify knobs can be applied. 2) Modify costs for part of table retrieval. Then you need to define "part". Current ways are partitioning and partial indexes. Some similar to partial index thing may be created, that has only "where" clause and no data. But has statistics and knobs (and may be personal bufferspace if they are introduced). I don't like to gather data about "last X percents" or like, because it works only in clustering and it's hard for optimizer to decide if it will be enough to scan only this percents for given query. Best regards, Vitalii Tymchyshyn
I supposed it was an answer to my mail but not sure... please keep CC'ed people, it is easier to follow threads (at least for me) 2010/11/12 Vitalii Tymchyshyn <tivv00@gmail.com>: > I'd say there are two Qs here: > > 1) Modify costs based on information on how much of the table is in cache. > It would be great if this can be done, but I'd prefer to have it as admin > knobs (because of plan stability). May be both admin and automatic ways can > be followed with some parallel (disableable) process modify knobs on admin > behalf. In this case different strategies to automatically modify knobs can > be applied. OS cache is usualy stable enough to keep your plans stable too, I think. > > 2) Modify costs for part of table retrieval. Then you need to define "part". > Current ways are partitioning and partial indexes. Some similar to partial > index thing may be created, that has only "where" clause and no data. But > has statistics and knobs (and may be personal bufferspace if they are > introduced). I don't like to gather data about "last X percents" or like, > because it works only in clustering and it's hard for optimizer to decide if > it will be enough to scan only this percents for given query. Modifying random_page_cost and sequential_page_cost thanks to statistics about cached blocks can be improved if we know the distribution. It does not mean : we know we have last 15% in cache, and we are goign to request those 15%. > > Best regards, Vitalii Tymchyshyn > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
12.11.10 12:56, Cédric Villemain написав(ла): > I supposed it was an answer to my mail but not sure... please keep > CC'ed people, it is easier to follow threads (at least for me) > OK > 2010/11/12 Vitalii Tymchyshyn<tivv00@gmail.com>: > >> I'd say there are two Qs here: >> >> 1) Modify costs based on information on how much of the table is in cache. >> It would be great if this can be done, but I'd prefer to have it as admin >> knobs (because of plan stability). May be both admin and automatic ways can >> be followed with some parallel (disableable) process modify knobs on admin >> behalf. In this case different strategies to automatically modify knobs can >> be applied. >> > OS cache is usualy stable enough to keep your plans stable too, I think. > Not if it is on edge. There are always edge cases where data fluctuates near some threshold. > >> 2) Modify costs for part of table retrieval. Then you need to define "part". >> Current ways are partitioning and partial indexes. Some similar to partial >> index thing may be created, that has only "where" clause and no data. But >> has statistics and knobs (and may be personal bufferspace if they are >> introduced). I don't like to gather data about "last X percents" or like, >> because it works only in clustering and it's hard for optimizer to decide if >> it will be enough to scan only this percents for given query. >> > Modifying random_page_cost and sequential_page_cost thanks to > statistics about cached blocks can be improved if we know the > distribution. > > It does not mean : we know we have last 15% in cache, and we are goign > to request those 15%. > You mean *_cost for the whole table, don't you? That is case (1) for me. Case (2) is when different cost values are selected based on what portion of table is requested in the query. E.g. when we have data for the whole day in one table, data for the last hour is cached and all the other data is not. Optimizer then may use different *_cost for query that requires all the data and for query that requires only last hour data. But, as I've said, that is much more complex task then (1). Best regards, Vitalii Tymchyshyn
2010/11/12 Vitalii Tymchyshyn <tivv00@gmail.com>: > 12.11.10 12:56, Cédric Villemain написав(ла): >> >> I supposed it was an answer to my mail but not sure... please keep >> CC'ed people, it is easier to follow threads (at least for me) >> > > OK >> >> 2010/11/12 Vitalii Tymchyshyn<tivv00@gmail.com>: >> >>> >>> I'd say there are two Qs here: >>> >>> 1) Modify costs based on information on how much of the table is in >>> cache. >>> It would be great if this can be done, but I'd prefer to have it as >>> admin >>> knobs (because of plan stability). May be both admin and automatic ways >>> can >>> be followed with some parallel (disableable) process modify knobs on >>> admin >>> behalf. In this case different strategies to automatically modify knobs >>> can >>> be applied. >>> >> >> OS cache is usualy stable enough to keep your plans stable too, I think. >> > > Not if it is on edge. There are always edge cases where data fluctuates near > some threshold. So far I did some analysis on the topic with pgfincore. Tables and index first have peak and holes if you graph the % of blocks in cache at the server start, but after a while, it is more stable. Maybe there are applications where linux faill to find a 'stable' page cache. If people are able to graph the pgfincore results for all or part of the objects of their database it will give us more robust analysis. Especially when corner case with the planner exists (like here). >> >> >>> >>> 2) Modify costs for part of table retrieval. Then you need to define >>> "part". >>> Current ways are partitioning and partial indexes. Some similar to >>> partial >>> index thing may be created, that has only "where" clause and no data. But >>> has statistics and knobs (and may be personal bufferspace if they are >>> introduced). I don't like to gather data about "last X percents" or like, >>> because it works only in clustering and it's hard for optimizer to decide >>> if >>> it will be enough to scan only this percents for given query. >>> >> >> Modifying random_page_cost and sequential_page_cost thanks to >> statistics about cached blocks can be improved if we know the >> distribution. >> >> It does not mean : we know we have last 15% in cache, and we are goign >> to request those 15%. >> > > You mean *_cost for the whole table, don't you? That is case (1) for me. Yes. > Case (2) is when different cost values are selected based on what portion of > table is requested in the query. E.g. when we have data for the whole day in > one table, data for the last hour is cached and all the other data is not. > Optimizer then may use different *_cost for query that requires all the data > and for query that requires only last hour data. But, as I've said, that is > much more complex task then (1). I need to think some more time of that. -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Fri, Nov 12, 2010 at 4:15 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: >> I wondering if we could do something with a formula like 3 * >> amount_of_data_to_read / (3 * amount_of_data_to_read + >> effective_cache_size) = percentage NOT cached. That is, if we're >> reading an amount of data equal to effective_cache_size, we assume 25% >> caching, and plot a smooth curve through that point. In the examples >> above, we would assume that a 150MB read is 87% cached, a 1GB read is >> 50% cached, and a 3GB read is 25% cached. > > But isn't it already the behavior of effective_cache_size usage ? No. The ideal of trying to know what is actually in cache strikes me as an almost certain non-starter. It can change very quickly, even as a result of the query you're actually running. And getting the information we'd need in order to do it that way would be very expensive, when it can be done at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Nov 12, 2010 at 4:15 AM, C�dric Villemain > <cedric.villemain.debian@gmail.com> wrote: >>> I wondering if we could do something with a formula like 3 * >>> amount_of_data_to_read / (3 * amount_of_data_to_read + >>> effective_cache_size) = percentage NOT cached. �That is, if we're >>> reading an amount of data equal to effective_cache_size, we assume 25% >>> caching, and plot a smooth curve through that point. �In the examples >>> above, we would assume that a 150MB read is 87% cached, a 1GB read is >>> 50% cached, and a 3GB read is 25% cached. >> But isn't it already the behavior of effective_cache_size usage ? > No. I think his point is that we already have a proven formula (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air. The problem is to figure out what numbers to apply the M-L formula to. I've been thinking that we ought to try to use it in the context of the query as a whole rather than for individual table scans; the current usage already has some of that flavor but we haven't taken it to the logical conclusion. > The ideal of trying to know what is actually in cache strikes me as an > almost certain non-starter. Agreed on that point. Plan stability would go out the window. regards, tom lane
On Fri, Nov 12, 2010 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think his point is that we already have a proven formula > (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air. > The problem is to figure out what numbers to apply the M-L formula to. I'm not sure that's really measuring the same thing, although I'm not opposed to using it if it produces reasonable answers. > I've been thinking that we ought to try to use it in the context of the > query as a whole rather than for individual table scans; the current > usage already has some of that flavor but we haven't taken it to the > logical conclusion. That's got a pretty severe chicken-and-egg problem though, doesn't it? You're going to need to know how much data you're touching to estimate the costs so you can pick the best plan, but you can't know how much data will ultimately be touched until you've got the whole plan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2010/11/12 Tom Lane <tgl@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Nov 12, 2010 at 4:15 AM, Cédric Villemain >> <cedric.villemain.debian@gmail.com> wrote: >>>> I wondering if we could do something with a formula like 3 * >>>> amount_of_data_to_read / (3 * amount_of_data_to_read + >>>> effective_cache_size) = percentage NOT cached. That is, if we're >>>> reading an amount of data equal to effective_cache_size, we assume 25% >>>> caching, and plot a smooth curve through that point. In the examples >>>> above, we would assume that a 150MB read is 87% cached, a 1GB read is >>>> 50% cached, and a 3GB read is 25% cached. > >>> But isn't it already the behavior of effective_cache_size usage ? > >> No. > > I think his point is that we already have a proven formula > (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air. > The problem is to figure out what numbers to apply the M-L formula to. > > I've been thinking that we ought to try to use it in the context of the > query as a whole rather than for individual table scans; the current > usage already has some of that flavor but we haven't taken it to the > logical conclusion. > >> The ideal of trying to know what is actually in cache strikes me as an >> almost certain non-starter. > > Agreed on that point. Plan stability would go out the window. Point is not to now the current cache, but like for ANALYZE on a regular basis (probably something around number of page read/hit) run a cache_analyze which report stats like ANALYZE do, and may be adjusted per table like auto_analyze is. -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Hello, Just a short though: Is it imaginable to compare the prognoses of the plans with the actual results and somehow log the worst cases ? a) to help the DBA locate bad statistics and queries b) as additional information source for the planner This could possibly affect parameters of your formula on the fly. best regards, Marc Mamin
On Sat, Nov 13, 2010 at 1:32 AM, Marc Mamin <M.Mamin@intershop.de> wrote: > Hello, > > Just a short though: > > Is it imaginable to compare the prognoses of the plans with the actual > results > and somehow log the worst cases ? > > a) to help the DBA locate bad statistics and queries > b) as additional information source for the planner > > This could possibly affect parameters of your formula on the fly. > > best regards, > > Marc Mamin The contrib module auto_explain might help out here if you wanted to roll your own solution for plan comparison.
On Sat, Nov 13, 2010 at 4:32 AM, Marc Mamin <M.Mamin@intershop.de> wrote: > Hello, > > Just a short though: > > Is it imaginable to compare the prognoses of the plans with the actual > results > and somehow log the worst cases ? > > a) to help the DBA locate bad statistics and queries > b) as additional information source for the planner > > This could possibly affect parameters of your formula on the fly. Yeah, I've thought about this, but it's not exactly clear what would be most useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tom Lane wrote: > Mladen Gogala <mladen.gogala@vmsinfo.com> writes: > > Again, having an optimizer which will choose the plan completely > > accurately is, at least in my opinion, less important than having a > > possibility of manual control, the aforementioned "knobs and buttons" > > and produce the same plan for the same statement. > > More knobs and buttons is the Oracle way, and the end result of that > process is that you have something as hard to use as Oracle. That's > generally not thought of as desirable in this community. Let reply, but Mladen, you might want to look at my blog entry explaining why knobs are often not useful because they are only used by a small percentage of users (and confuse the rest): http://momjian.us/main/blogs/pgblog/2009.html#January_10_2009 -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas wrote: > On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> Yeah. ?For Kevin's case, it seems like we want the caching percentage > >> to vary not so much based on which table we're hitting at the moment > >> but on how much of it we're actually reading. > > > > Well, we could certainly take the expected number of pages to read and > > compare that to effective_cache_size. ?The thing that's missing in that > > equation is how much other stuff is competing for cache space. ?I've > > tried to avoid having the planner need to know the total size of the > > database cluster, but it's kind of hard to avoid that if you want to > > model this honestly. > > I'm not sure I agree with that. I mean, you could easily have a > database that is much larger than effective_cache_size, but only that > much of it is hot. Or, the hot portion could move around over time. > And for reasons of both technical complexity and plan stability, I > don't think we want to try to model that. It seems perfectly > reasonable to say that reading 25% of effective_cache_size will be > more expensive *per-page* than reading 5% of effective_cache_size, > independently of what the total cluster size is. Late reply, but one idea is to have the executor store hit counts for later use by the optimizer. Only the executor knows how many pages it had to request from the kernel for a query. Perhaps getrusage could tell us how often we hit the disk. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Nov 12, 2010 at 4:15 AM, C�dric Villemain > > <cedric.villemain.debian@gmail.com> wrote: > >>> I wondering if we could do something with a formula like 3 * > >>> amount_of_data_to_read / (3 * amount_of_data_to_read + > >>> effective_cache_size) = percentage NOT cached. �That is, if we're > >>> reading an amount of data equal to effective_cache_size, we assume 25% > >>> caching, and plot a smooth curve through that point. �In the examples > >>> above, we would assume that a 150MB read is 87% cached, a 1GB read is > >>> 50% cached, and a 3GB read is 25% cached. > > >> But isn't it already the behavior of effective_cache_size usage ? > > > No. > > I think his point is that we already have a proven formula > (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air. > The problem is to figure out what numbers to apply the M-L formula to. > > I've been thinking that we ought to try to use it in the context of the > query as a whole rather than for individual table scans; the current > usage already has some of that flavor but we haven't taken it to the > logical conclusion. Is there a TODO here? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
2011/1/19 Bruce Momjian <bruce@momjian.us>: > Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >> > On Fri, Nov 12, 2010 at 4:15 AM, Cédric Villemain >> > <cedric.villemain.debian@gmail.com> wrote: >> >>> I wondering if we could do something with a formula like 3 * >> >>> amount_of_data_to_read / (3 * amount_of_data_to_read + >> >>> effective_cache_size) = percentage NOT cached. That is, if we're >> >>> reading an amount of data equal to effective_cache_size, we assume 25% >> >>> caching, and plot a smooth curve through that point. In the examples >> >>> above, we would assume that a 150MB read is 87% cached, a 1GB read is >> >>> 50% cached, and a 3GB read is 25% cached. >> >> >> But isn't it already the behavior of effective_cache_size usage ? >> >> > No. >> >> I think his point is that we already have a proven formula >> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air. >> The problem is to figure out what numbers to apply the M-L formula to. >> >> I've been thinking that we ought to try to use it in the context of the >> query as a whole rather than for individual table scans; the current >> usage already has some of that flavor but we haven't taken it to the >> logical conclusion. > > Is there a TODO here? it looks like, yes. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + It's impossible for everything to be true. + > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Thu, Jan 20, 2011 at 4:17 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: >>> I think his point is that we already have a proven formula >>> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air. >>> The problem is to figure out what numbers to apply the M-L formula to. >>> >>> I've been thinking that we ought to try to use it in the context of the >>> query as a whole rather than for individual table scans; the current >>> usage already has some of that flavor but we haven't taken it to the >>> logical conclusion. >> >> Is there a TODO here? > > it looks like, yes. "Modify the planner to better estimate caching effects"? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/1/20 Robert Haas <robertmhaas@gmail.com>: > On Thu, Jan 20, 2011 at 4:17 AM, Cédric Villemain > <cedric.villemain.debian@gmail.com> wrote: >>>> I think his point is that we already have a proven formula >>>> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air. >>>> The problem is to figure out what numbers to apply the M-L formula to. >>>> >>>> I've been thinking that we ought to try to use it in the context of the >>>> query as a whole rather than for individual table scans; the current >>>> usage already has some of that flavor but we haven't taken it to the >>>> logical conclusion. >>> >>> Is there a TODO here? >> >> it looks like, yes. > > "Modify the planner to better estimate caching effects"? or "Estimate caching effect in the query context instead of per object" (the point above) and "Improve the estimate of the caching effects" (more or less M-L review, fine control of cache estimate) ? > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2011/1/19 Bruce Momjian <bruce@momjian.us>: > Robert Haas wrote: >> On Thu, Nov 11, 2010 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Robert Haas <robertmhaas@gmail.com> writes: >> >> Yeah. ?For Kevin's case, it seems like we want the caching percentage >> >> to vary not so much based on which table we're hitting at the moment >> >> but on how much of it we're actually reading. >> > >> > Well, we could certainly take the expected number of pages to read and >> > compare that to effective_cache_size. ?The thing that's missing in that >> > equation is how much other stuff is competing for cache space. ?I've >> > tried to avoid having the planner need to know the total size of the >> > database cluster, but it's kind of hard to avoid that if you want to >> > model this honestly. >> >> I'm not sure I agree with that. I mean, you could easily have a >> database that is much larger than effective_cache_size, but only that >> much of it is hot. Or, the hot portion could move around over time. >> And for reasons of both technical complexity and plan stability, I >> don't think we want to try to model that. It seems perfectly >> reasonable to say that reading 25% of effective_cache_size will be >> more expensive *per-page* than reading 5% of effective_cache_size, >> independently of what the total cluster size is. > > Late reply, but one idea is to have the executor store hit counts for > later use by the optimizer. Only the executor knows how many pages it > had to request from the kernel for a query. Perhaps getrusage could > tell us how often we hit the disk. AFAIK getrusage does not provide access to real IO counters but filesystem's ones. :-( -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Robert Haas wrote: > On Thu, Jan 20, 2011 at 4:17 AM, C?dric Villemain > <cedric.villemain.debian@gmail.com> wrote: > >>> I think his point is that we already have a proven formula > >>> (Mackert-Lohmann) and shouldn't be inventing a new one out of thin air. > >>> The problem is to figure out what numbers to apply the M-L formula to. > >>> > >>> I've been thinking that we ought to try to use it in the context of the > >>> query as a whole rather than for individual table scans; the current > >>> usage already has some of that flavor but we haven't taken it to the > >>> logical conclusion. > >> > >> Is there a TODO here? > > > > it looks like, yes. > > "Modify the planner to better estimate caching effects"? Added to TODO. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +