Thread: performance of IN (subquery)
I'm using PG 7.4.3 on Mac OS X. I am disappointed with the performance of queries like 'select foo from bar where baz in (subquery)', or updates like 'update bar set foo = 2 where baz in (subquery)'. PG always seems to want to do a sequential scan of the bar table. I wish there were a way of telling PG, "use the index on baz in your plan, because I know that the subquery will return very few results". Where it really matters, I have been constructing dynamic queries by looping over the values for baz and building a separate query for each one and combining with a UNION (or just directly updating, in the update case). Depending on the size of the bar table, I can get speedups of hundreds or even more than a thousand times, but it is a big pain to have to do this. Any tips? Thanks, Kevin Murphy Illustrated: The query I want to do is very slow: select bundle_id from build.elements where elementid in ( SELECT superlocs_2.element_id FROM superlocs_2 NATURAL JOIN bundle_superlocs_2 WHERE bundle_superlocs_2.protobundle_id = 1); ----------- 7644 7644 (2 rows) Time: 518.242 ms The subquery is fast: SELECT superlocs_2.element_id FROM superlocs_2 NATURAL JOIN bundle_superlocs_2 WHERE bundle_superlocs_2.protobundle_id = 1; ------------ 41209 25047 (2 rows) Time: 3.268 ms And using indexes on the main table is fast: select bundle_id from build.elements where elementid in (41209, 25047); ----------- 7644 7644 (2 rows) Time: 2.468 ms The plan for the slow query: egenome_test=# explain analyze select bundle_id from build.elements where elementid in ( SELECT superlocs_2.element_id FROM superlocs_2 NATURAL JOIN bundle_superlocs_2 WHERE bundle_superlocs_2.protobundle_id = 1); egenome_test-# egenome_test(# egenome_test(# egenome_test(# QUERY PLAN \ ------------------------------------------------------------------------ ------------------------------------------------------------- Hash Join (cost=70.33..72.86 rows=25 width=4) (actual time=583.051..583.059 rows=2 loops=1) Hash Cond: ("outer".element_id = "inner".elementid) -> HashAggregate (cost=47.83..47.83 rows=25 width=4) (actual time=0.656..0.658 rows=2 loops=1) -> Hash Join (cost=22.51..47.76 rows=25 width=4) (actual time=0.615..0.625 rows=2 loops=1) Hash Cond: ("outer".superloc_id = "inner".superloc_id) -> Seq Scan on superlocs_2 (cost=0.00..20.00 rows=1000 width=8) (actual time=0.004..0.012 rows=9 loops=1) -> Hash (cost=22.50..22.50 rows=5 width=4) (actual time=0.076..0.076 rows=0 loops=1) -> Seq Scan on bundle_superlocs_2 (cost=0.00..22.50 rows=5 width=4) (actual time=0.024..0.033 rows=2 loops=1) Filter: (protobundle_id = 1) -> Hash (cost=20.00..20.00 rows=1000 width=8) (actual time=581.802..581.802 rows=0 loops=1) -> Seq Scan on elements (cost=0.00..20.00 rows=1000 width=8) (actual time=0.172..405.243 rows=185535 loops=1) Total runtime: 593.843 ms (12 rows)
On Thu, 26 Aug 2004, Kevin Murphy wrote: > I'm using PG 7.4.3 on Mac OS X. > > I am disappointed with the performance of queries like 'select foo from bar > where baz in (subquery)', or updates like 'update bar set foo = 2 where baz > in (subquery)'. PG always seems to want to do a sequential scan of the bar > table. I wish there were a way of telling PG, "use the index on baz in your > plan, because I know that the subquery will return very few results". Where > it really matters, I have been constructing dynamic queries by looping over > the values for baz and building a separate query for each one and combining > with a UNION (or just directly updating, in the update case). Depending on > the size of the bar table, I can get speedups of hundreds or even more than a > thousand times, but it is a big pain to have to do this. > > Any tips? > > Thanks, > Kevin Murphy > > Illustrated: > > The query I want to do is very slow: > > select bundle_id from build.elements > where elementid in ( > SELECT superlocs_2.element_id > FROM superlocs_2 NATURAL JOIN bundle_superlocs_2 > WHERE bundle_superlocs_2.protobundle_id = 1); > ----------- > 7644 > 7644 > (2 rows) > Time: 518.242 ms what field type is protobundle_id? if you typecast the '1' to be the same, does the index get used? Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
Kevin Murphy wrote: > ------------------------------------------------------------------------ > ------------------------------------------------------------- > Hash Join (cost=70.33..72.86 rows=25 width=4) (actual > time=583.051..583.059 rows=2 loops=1) > Hash Cond: ("outer".element_id = "inner".elementid) > -> HashAggregate (cost=47.83..47.83 rows=25 width=4) (actual > time=0.656..0.658 rows=2 loops=1) > -> Hash Join (cost=22.51..47.76 rows=25 width=4) (actual > time=0.615..0.625 rows=2 loops=1) > Hash Cond: ("outer".superloc_id = "inner".superloc_id) > -> Seq Scan on superlocs_2 (cost=0.00..20.00 rows=1000 width=8) > (actual time=0.004..0.012 rows=9 loops=1) > -> Hash (cost=22.50..22.50 rows=5 width=4) (actual time=0.076..0.076 > rows=0 loops=1) > -> Seq Scan on bundle_superlocs_2 (cost=0.00..22.50 rows=5 width=4) > (actual time=0.024..0.033 rows=2 loops=1) > Filter: (protobundle_id = 1) > -> Hash (cost=20.00..20.00 rows=1000 width=8) (actual > time=581.802..581.802 rows=0 loops=1) > -> Seq Scan on elements (cost=0.00..20.00 rows=1000 width=8) (actual > time=0.172..405.243 rows=185535 loops=1) The planner thinks that the sequential scan on elements will return 1000 rows, but it actually returned 185000. Did you ANALYZE this table recently? Afterthought: It would be nice if the database was smart enough to analyze a table of its own accord when a sequential scan returns more than, say, 20 times what it was supposed to. Paul > Total runtime: 593.843 ms > (12 rows) > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > >
> Afterthought: It would be nice if the database was smart enough to > analyze a table of its own accord when a sequential scan returns more > than, say, 20 times what it was supposed to. I've wondered on several occasions if there is any good reason for PG not to automatically perform an analyze concurrently with a seq scan as it's happening. That way, no extra disk IO is needed and the stats could say up-to-date for almost free. Any hackers around who can say why this might be a bad idea, or is it one of those things that just needs a volunteer? (I'm not; at least not now.)
"Arthur Ward" <award-postgresql@dominionsciences.com> writes: > Any hackers around who can say why this might be a bad idea, or is it one > of those things that just needs a volunteer? (I'm not; at least not now.) a) that would make plans change spontaneously. I hate being paged in the middle of the night because some query is suddenly being slow when it had been performing fine before. b) Not all sequential scans will actually complete the scan. There could be a limit imposed or a the sequential scan could be inside a EXISTS. In that case the scan could be aborted at any point. What I do think would be easy to do would be to keep statistics on the expense of various components of the cost estimates. cpu_*_cost, random_page_cost effective_cache_size, ought to be values that can be solved for empirically from the timing results. But that still doesn't have to be done on every query. There's a trade-off between work done on every query to plan queries and the benefit. Gathering statistics and storing them on every sequential scan is way too much work slowing down every query for minimal gain. -- greg
Paul Tillotson <pntil@shentel.net> writes: > The planner thinks that the sequential scan on elements will return 1000 > rows, but it actually returned 185000. Did you ANALYZE this table recently? Or either of the other ones? All those scan costs look like defaults :-( > Afterthought: It would be nice if the database was smart enough to > analyze a table of its own accord when a sequential scan returns more > than, say, 20 times what it was supposed to. I've thought about this before. One simple trick would be to get rid of the current pg_class reltuples/relpages fields in favor of a tuples-per-page estimate, which could be multiplied by RelationGetNumberOfBlocks() during planning. In the absence of any ANALYZE data the tuples-per-page estimate might be pretty bogus, but it couldn't be off by more than an order of magnitude or so either way. And in any case we'd have a guaranteed up-to-date number of blocks. The objections that could be raised to this are (AFAICS) two: 1. Adding at least an lseek() kernel call per table, and per index, to every planning operation. I'm not sure this would be significant, but I'm not sure it wouldn't be, either. 2. Instability of plans. Right now, the planner will not change plans underneath you --- you have to issue an explicit VACUUM or ANALYZE to change the terms of discussion. That would stop being true if physical file size were always taken into account. Maybe this is a problem, or maybe it isn't ... as someone who likes to be able to debug planner behavior without actually creating umpteen-gig test tables, my world view may be a bit skewed ... It's certainly doable if we decide the pluses outweigh the minuses. Thoughts? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I've thought about this before. One simple trick would be to get rid of the > current pg_class reltuples/relpages fields in favor of a tuples-per-page > estimate, which could be multiplied by RelationGetNumberOfBlocks() during > planning. This would do something interesting to one of the problem cases I have now. I have trouble testing a particular batch job that generates a large amount of precalculated denormalized data. That's because the data is empty when the job starts, so the plpgsql function that handles the job caches plans based on an empty table. But as the job proceeds the data grows and I'm afraid the cached plan may start performing poorly. In order to test it I need to run it once, analyze, then reload the function, truncate the data and run it another time. And hope I generated good representative test data. I'm thinking of changing to a non-plpgsql implementation. But that's only half the issue. I'm not about to run analyze in the middle of the data generation (which wouldn't work anyways since it's in a transaction). So I can't really get good statistics for this job, not until we're actually in a steady state in production. Sometimes I wonder whether I wouldn't rather a more predictable system with less focus on statistics. The resulting plans would be more predictable and predictability is a good thing in production systems... > In the absence of any ANALYZE data the tuples-per-page estimate might be > pretty bogus, but it couldn't be off by more than an order of magnitude or > so either way. I don't see why it couldn't. If you have a table badly in need of vacuuming (or had one at the time of the last analyze) it could be off by way more than an order of magnitude. For that matter, a table that had undergone many deletes and then been vacuumed would not change length for a long time afterward even as many new inserts are performed. Until the table is analyzed the estimated table size could be off by an arbitrary factor. > The objections that could be raised to this are (AFAICS) two: > > 1. Adding at least an lseek() kernel call per table, and per index, to > every planning operation. I'm not sure this would be significant, > but I'm not sure it wouldn't be, either. That seems like something that could be addressed with enough time. A single value per table, couldn't it be stored in shared memory and only updated whenever the heap is extended or truncated? Even if you have thousands of tables that would only be a few kilobytes of shared memory. > 2. Instability of plans. Right now, the planner will not change plans > underneath you --- you have to issue an explicit VACUUM or ANALYZE > to change the terms of discussion. That would stop being true if > physical file size were always taken into account. Maybe this is a > problem, or maybe it isn't ... as someone who likes to be able to > debug planner behavior without actually creating umpteen-gig test > tables, my world view may be a bit skewed ... This is what I'm afraid of. As a OLTP application programmer -- web sites, I admit -- I care a lot more about plan stability than finding optimal plans. An well written OLTP application will only have a fixed number of queries that are executed repeatedly with different parameters. I don't care how long the queries take as long as they're always "fast enough". Every time a new plan is tried there's a chance that it will be wrong. It only takes one wrong plan out of the hundreds of queries to bring down the entire application. Ideally I would want a guarantee that every query would *always* result in the same plan. Once I've tested them and approved the plans I want to know that only those approved plans will ever run, and I want to be present and be able to verify new plans before they go into production. I doubt I'm going to convince anyone today, but I think there will be a gradual change in mindset as the new binary protocol becomes more popular. And postgres takes over some of mysql's web mindshare. -- greg
Tom Lane wrote: > 2. Instability of plans. Right now, the planner will not change plans > underneath you --- you have to issue an explicit VACUUM or ANALYZE > to change the terms of discussion. That would stop being true if > physical file size were always taken into account. Maybe this is a > problem, or maybe it isn't ... as someone who likes to be able to > debug planner behavior without actually creating umpteen-gig test > tables, my world view may be a bit skewed ... Did you forgot the autovacuum daemon ? I didn't see anyone bitten or paged during the night for autovacuum daemon job. Regards Gaetano Mendola
Tom Lane wrote: > I've thought about this before. One simple trick would be to get rid > of the current pg_class reltuples/relpages fields in favor of a > tuples-per-page estimate, which could be multiplied by > RelationGetNumberOfBlocks() during planning. In the absence of any > ANALYZE data the tuples-per-page estimate might be pretty bogus, but > it couldn't be off by more than an order of magnitude or so either way. > And in any case we'd have a guaranteed up-to-date number of blocks. > > The objections that could be raised to this are (AFAICS) two: > [snip] > 2. Instability of plans. Right now, the planner will not change plans > underneath you --- you have to issue an explicit VACUUM or ANALYZE > to change the terms of discussion. That would stop being true if > physical file size were always taken into account. Maybe this is a > problem, or maybe it isn't ... as someone who likes to be able to > debug planner behavior without actually creating umpteen-gig test > tables, my world view may be a bit skewed ... > > It's certainly doable if we decide the pluses outweigh the minuses. > Thoughts? My first reaction is to wonder if this would give performance exactly equal to running a true ANALYZE in every situation? If not, then you would end up with an automated pseudo-ANALYZE (performance-wise). In my opinion, it is almost a feature that non-ANALYZE-d tables give such horrendous performance, it kicks you in the butt to do some thinking about when to correctly deal with ANALYZEing. So, in short, I think it is a huge win if we could have automatic ANALYZE with true ANALYZE performance, but a huge loss if the automatic ANALYZE performance is not exactly as good as a true ANALYZE. -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil Personal: http://www.jandr.org/ ***-*--*----*-------*------------*--------------------*---------------
Thanks all for the reminders about analyzing, and I apologize for wasting everyone's time. The main table did indeed need to be analyzed (I had truncated it and repopulated it with "insert ... select" but forgotten to analyze). The other tables are very small temporary tables, and I assumed, whether correctly or not, that analyzing would not be helpful for them. All this is happening in the context of an algorithm in a PL/PGSQL function. The three temporary tables are reused thousands of times. I wasn't sure if it would be better to truncate them between uses to keep them small or just allow them to grow. Unfortunately for the Tree of Knowledge, performance is now more than adequate, so I may not do this experiment. Thanks, Kevin Murphy
>2. Instability of plans. Right now, the planner will not change plans > underneath you --- you have to issue an explicit VACUUM or ANALYZE > to change the terms of discussion. That would stop being true if > physical file size were always taken into account. Maybe this is a > problem, or maybe it isn't ... as someone who likes to be able to > debug planner behavior without actually creating umpteen-gig test > tables, my world view may be a bit skewed ... > >It's certainly doable if we decide the pluses outweigh the minuses. >Thoughts? > > > Even though Ijust proposed it, now I think that the idea of having postgres automatically gather statistics only when an estimate is grossly wrong is not a good one. This will just obscure the need to run analyze, and many people will then always be using moderately bad statistics without realizing it. (I think I used postgres in a production environment for about six months without even knowing about the ANALYZE command--with small data sets it's not obvious--and it was still blazingly fast compared to foxpro.) If pgautovacuum does ANALYZE and we can get most people to use pgautovacuum, I think this problem will go away. Another possibility is simply to make postgres generate a warning, saying: WARNING: statistics for table xxxxxx are off by a factor of y. You should ANALYZE this table. Paul
Jon Lapham <lapham@jandr.org> writes: > Tom Lane wrote: >> I've thought about this before. One simple trick would be to get rid >> of the current pg_class reltuples/relpages fields in favor of a >> tuples-per-page estimate, which could be multiplied by >> RelationGetNumberOfBlocks() during planning. > My first reaction is to wonder if this would give performance exactly > equal to running a true ANALYZE in every situation? No, this is quite orthogonal to ANALYZE (and several orders of magnitude cheaper...) It would fix a lot of the simpler cases, though, since the first-order effects like table size would be right. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > I'm not about to run analyze in the middle of the data generation > (which wouldn't work anyways since it's in a transaction). Since 7.3 or 7.4, you *can* run ANALYZE in the middle of a transaction. The cached-plan business is a problem, I agree, but I think it's orthogonal to this particular discussion (and you can always use EXECUTE if you have to). >> In the absence of any ANALYZE data the tuples-per-page estimate might be >> pretty bogus, but it couldn't be off by more than an order of magnitude or >> so either way. > I don't see why it couldn't. If you have a table badly in need of vacuuming > (or had one at the time of the last analyze) it could be off by way more than > an order of magnitude. Well, I was actually thinking of the physical tuples-per-page stat (perhaps better expressed as an average tuple size), but you are right that the fraction of dead tuples is also something to think about. We don't model that explicitly ATM but maybe we should. The original VACUUM-based stats code couldn't really do much with it, since VACUUM would leave no dead tuples behind in the first place; but separate ANALYZE could definitely make an estimate of the fraction of dead tuples. > Ideally I would want a guarantee that every query would *always* > result in the same plan. Once I've tested them and approved the plans > I want to know that only those approved plans will ever run, and I > want to be present and be able to verify new plans before they go into > production. > I doubt I'm going to convince anyone today, Nope, you aren't. The above seems to me to be a recipe for degradation of performance over time, precisely because the plans wouldn't change in the face of changes in the situation. I've resisted adding "planner hints" to the language for this reason, and I'm certainly not eager to offer any hard guarantees about plans not changing. regards, tom lane
Is there anything for the TODO here? --------------------------------------------------------------------------- Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > I'm not about to run analyze in the middle of the data generation > > (which wouldn't work anyways since it's in a transaction). > > Since 7.3 or 7.4, you *can* run ANALYZE in the middle of a transaction. > The cached-plan business is a problem, I agree, but I think it's > orthogonal to this particular discussion (and you can always use EXECUTE > if you have to). > > >> In the absence of any ANALYZE data the tuples-per-page estimate might be > >> pretty bogus, but it couldn't be off by more than an order of magnitude or > >> so either way. > > > I don't see why it couldn't. If you have a table badly in need of vacuuming > > (or had one at the time of the last analyze) it could be off by way more than > > an order of magnitude. > > Well, I was actually thinking of the physical tuples-per-page stat > (perhaps better expressed as an average tuple size), but you are right > that the fraction of dead tuples is also something to think about. > We don't model that explicitly ATM but maybe we should. The original > VACUUM-based stats code couldn't really do much with it, since VACUUM > would leave no dead tuples behind in the first place; but separate > ANALYZE could definitely make an estimate of the fraction of dead tuples. > > > Ideally I would want a guarantee that every query would *always* > > result in the same plan. Once I've tested them and approved the plans > > I want to know that only those approved plans will ever run, and I > > want to be present and be able to verify new plans before they go into > > production. > > > I doubt I'm going to convince anyone today, > > Nope, you aren't. The above seems to me to be a recipe for degradation > of performance over time, precisely because the plans wouldn't change in > the face of changes in the situation. I've resisted adding "planner > hints" to the language for this reason, and I'm certainly not eager to > offer any hard guarantees about plans not changing. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > >>I'm not about to run analyze in the middle of the data generation >>(which wouldn't work anyways since it's in a transaction). > > > Since 7.3 or 7.4, you *can* run ANALYZE in the middle of a transaction. > The cached-plan business is a problem, I agree, but I think it's > orthogonal to this particular discussion (and you can always use EXECUTE > if you have to). > How does EXECUTE solve the cached-plan business?
On Fri, 27 Aug 2004 11:09:26 -0400, Joseph Shraibman <jks@selectacast.net> wrote: > How does EXECUTE solve the cached-plan business? It re-plans the query at every run. -- Markus Bertheau <mbertheau@gmail.com>
Have you thought about using existence checking: WHERE EXISTS (SELECT '1' FROM FOO2 WHERE BAZ = BAZ2)
If the index exists on BAZ2 you might get away with a quick index only check.
Duane
-----Original Message-----
From: Kevin Murphy [mailto:murphy@genome.chop.edu]
Sent: Thursday, August 26, 2004 3:24 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] performance of IN (subquery)
I'm using PG 7.4.3 on Mac OS X.
I am disappointed with the performance of queries like 'select foo from
bar where baz in (subquery)', or updates like 'update bar set foo = 2
where baz in (subquery)'. PG always seems to want to do a sequential
scan of the bar table. I wish there were a way of telling PG, "use the
index on baz in your plan, because I know that the subquery will return
very few results". Where it really matters, I have been constructing
dynamic queries by looping over the values for baz and building a
separate query for each one and combining with a UNION (or just
directly updating, in the update case). Depending on the size of the
bar table, I can get speedups of hundreds or even more than a thousand
times, but it is a big pain to have to do this.
Any tips?
Thanks,
Kevin Murphy
Illustrated:
The query I want to do is very slow:
select bundle_id from build.elements
where elementid in (
SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1);
-----------
7644
7644
(2 rows)
Time: 518.242 ms
The subquery is fast:
SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1;
------------
41209
25047
(2 rows)
Time: 3.268 ms
And using indexes on the main table is fast:
select bundle_id from build.elements
where elementid in (41209, 25047);
-----------
7644
7644
(2 rows)
Time: 2.468 ms
The plan for the slow query:
egenome_test=# explain analyze select bundle_id from build.elements
where elementid in (
SELECT superlocs_2.element_id
FROM superlocs_2 NATURAL JOIN bundle_superlocs_2
WHERE bundle_superlocs_2.protobundle_id = 1);
egenome_test-# egenome_test(# egenome_test(# egenome_test(#
QUERY PLAN
\
------------------------------------------------------------------------
-------------------------------------------------------------
Hash Join (cost=70.33..72.86 rows=25 width=4) (actual
time=583.051..583.059 rows=2 loops=1)
Hash Cond: ("outer".element_id = "inner".elementid)
-> HashAggregate (cost=47.83..47.83 rows=25 width=4) (actual
time=0.656..0.658 rows=2 loops=1)
-> Hash Join (cost=22.51..47.76 rows=25 width=4) (actual
time=0.615..0.625 rows=2 loops=1)
Hash Cond: ("outer".superloc_id = "inner".superloc_id)
-> Seq Scan on superlocs_2 (cost=0.00..20.00 rows=1000
width=8) (actual time=0.004..0.012 rows=9 loops=1)
-> Hash (cost=22.50..22.50 rows=5 width=4) (actual
time=0.076..0.076 rows=0 loops=1)
-> Seq Scan on bundle_superlocs_2
(cost=0.00..22.50 rows=5 width=4) (actual time=0.024..0.033 rows=2
loops=1)
Filter: (protobundle_id = 1)
-> Hash (cost=20.00..20.00 rows=1000 width=8) (actual
time=581.802..581.802 rows=0 loops=1)
-> Seq Scan on elements (cost=0.00..20.00 rows=1000 width=8)
(actual time=0.172..405.243 rows=185535 loops=1)
Total runtime: 593.843 ms
(12 rows)
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
According to the docs is specifically doesn't. http://www.postgresql.org/docs/7.4/static/sql-prepare.html When the PREPARE statement is executed, the specified statement is parsed, rewritten, and planned. When an EXECUTE command is subsequently issued, the prepared statement need only be executed. Thus, the parsing, rewriting, and planning stages are only performed once, instead of every time the statement is executed. Markus Bertheau wrote: > On Fri, 27 Aug 2004 11:09:26 -0400, Joseph Shraibman > <jks@selectacast.net> wrote: > > >> How does EXECUTE solve the cached-plan business? > > > It re-plans the query at every run. >
Joseph Shraibman <jks@selectacast.net> writes: > Markus Bertheau wrote: >> It re-plans the query at every run. > According to the docs is specifically doesn't. You're confusing SQL EXECUTE statement with the quite different plpgsql EXECUTE statement ... regards, tom lane
Paul Tillotson wrote: > If pgautovacuum does ANALYZE and we can get most people to use > pgautovacuum, I think this problem will go away. It does. We won't get most people to use it for 8.0 since it is still a contrib project, but hopefully it will be built-in for 8.1.
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > I'm not about to run analyze in the middle of the data generation > > (which wouldn't work anyways since it's in a transaction). > > Since 7.3 or 7.4, you *can* run ANALYZE in the middle of a transaction. > The cached-plan business is a problem, I agree, but I think it's > orthogonal to this particular discussion (and you can always use EXECUTE > if you have to). It's orthogonal. My point was that I have a bigger problem, but even if I address it by switching away from plpgsql, or I guess by using EXECUTE, I would still have a problem. I didn't realize you could run analyze in a transaction, but even being able to I wouldn't really want to have to do that repeatedly during the job. This new approach would actually complete the fix, a perl or plpgsql EXECUTE implementation would gradually shift statistics during the job. Except that the first thing the job does is delete all the old records. This is inside a transaction. So an estimate based on the heap size would be off by a factor of two by the time the job is done. > but separate ANALYZE could definitely make an estimate of the fraction of > dead tuples. With analyze in a transaction I'm not clear what the semantics should be though. I suppose it should only count tuples visible to the transaction analyze? > Nope, you aren't. The above seems to me to be a recipe for degradation > of performance over time, precisely because the plans wouldn't change in > the face of changes in the situation. A gradual degradation is ok. A gradual degradation means I can schedule a nightly analyze and report on any changed plans and either automatically accept them or manually approve them individually. A sudden degradation is much more dangerous. Even if it's rare, a sudden degradation means an outage in prime time. As I said, it doesn't matter to me if every query is 10% slower than possible, as long as no query takes 1000% as long as necessary even if it's a 1 in 1000 occurrence. > I've resisted adding "planner hints" to the language for this reason, and > I'm certainly not eager to offer any hard guarantees about plans not > changing. I just want to control _when_ they change. Eventually you'll come around. I think it'll be a slow gradual change in thinking as the user-base changes though. Not something I'll change with a single argument in one day. -- greg
Greg Stark <gsstark@mit.edu> writes: > It's orthogonal. My point was that I have a bigger problem, but even if I > address it by switching away from plpgsql, or I guess by using EXECUTE, I > would still have a problem. I didn't realize you could run analyze in a > transaction, but even being able to I wouldn't really want to have to do that > repeatedly during the job. Why not? Given the sampling behavior that's been in there for a release or two, ANALYZE is pretty cheap on large tables; certainly much cheaper than any processing you might be doing that's going to grovel over the whole table. > Except that the first thing the job does is delete all the old records. This > is inside a transaction. So an estimate based on the heap size would be off by > a factor of two by the time the job is done. Could you use TRUNCATE? I dunno if locking the table is okay for you. It is transaction safe though. > With analyze in a transaction I'm not clear what the semantics should be > though. I suppose it should only count tuples visible to the transaction > analyze? It currently uses SnapshotNow, so would see committed tuples of other transactions plus uncommitted ones of the present transaction. This is not exactly the same thing as the transaction's snapshot, but close. > A sudden degradation is much more dangerous. Even if it's rare, a sudden > degradation means an outage in prime time. [ shrug ] You can get a sudden degradation with fixed plans, too. All it takes is an addition of a lot of rows in some table that had been small. regards, tom lane
Greg Stark wrote: > Ideally I would want a guarantee that every query would *always* result in the > same plan. Once I've tested them and approved the plans I want to know that > only those approved plans will ever run, and I want to be present and be able > to verify new plans before they go into production. What you are saying is "never run an ANALYZE" or if you do it you have to re-test all your plans. "*always*" the same plan is a non sense because the plan depends on the data distribution, do you test your plans for each given histogram slice ? > I doubt I'm going to convince anyone today... For sure not me. Regards Gaetano Mendola
Tom Lane <tgl@sss.pgh.pa.us> writes: > > Except that the first thing the job does is delete all the old records. This > > is inside a transaction. So an estimate based on the heap size would be off by > > a factor of two by the time the job is done. > > Could you use TRUNCATE? I dunno if locking the table is okay for you. > It is transaction safe though. Well, if necessary I could, but if I can do it without downtime all the better. In any case I think I'll be ok with a factor of 2 misestimation. I was just giving an example use case for you to chew on when analyzing this new proposal. I'm not sure where I stand with the idea. I like the idea that table sizes would always be fairly reasonable even without statistics. But I also have a really strong desire for plan stability. > [ shrug ] You can get a sudden degradation with fixed plans, too. All > it takes is an addition of a lot of rows in some table that had been > small. Well, presumably I should be aware if my data distribution is changing drastically. That's under my control. At least the performance change will be proportionate to the distribution change. With plans changing on the fly I could have a query that degrades 1% for every row added and then suddenly becomes 10x slower when I add a 17th extra row. Of course such a system isn't perfectly tuned, or the optimizer issue should be found and fixed. But I would rather find out about it without having my application fail. -- greg