Thread: Single table forcing sequential scans on query plans

Single table forcing sequential scans on query plans

From

Cristian Gafton

Date:

16 March 2008, 00:27:33

I have a weird query execution plan problem I am trying to debug on 
Postgresql 8.2.6. I have a query that joins against a temporary table that 
has very few rows. If the temporary table is analyzed before the query runs, 
all is well. If the temporary table is not analyzed before running the query, 
the execution plan chosen by Postgres is comprised entirely of sequential 
scans, and the whole query can take many hours to execute.

In the application exhibiting this issue there is an "analyze tmpInstanceId" 
coded right before executing this query. But sometimes PostgreSQL chooses to 
execute the query as if the analyze has not been run, or its results 
discarded, or I don't know why every once in a while this query goes off on 
the all sequential scan plan. (I have put some debugging code in the 
application, and when the query takes longer to execute than expected, I 
print the explain right after the query finished executing, and the plan it 
comes up with is the second one I'm including. When it's fast, it is the 
first one)

My questions are:
- what would make the analyze operation "fail" in the eyes of the planner?
- why joining to a single unanalyzed table disables any and all indexes from 
the other tables references in the query?

create temporary table tmpInstanceId(    idx         serial primary key,    instanceId  integer
);
create index tmpInstanceIdIdx on tmpInstanceId(instanceId);
insert into tmpInstanceId(instanceId) values (492121), (492125);


--analyze tmpInstanceId;

explain
--analyze         select tmpInstanceId.idx, Items.item, Versions.version,    Flavors.flavor, tt.flags, Nodes.timeStamps
       from tmpInstanceId         join TroveTroves as tt using(instanceId)         join Instances on tt.includedId =
Instances.instanceId        join Items on Instances.itemId = Items.itemId         join Versions on Instances.versionId
=Versions.versionId         join Flavors on Instances.flavorId = Flavors.flavorId         join Nodes on
Instances.itemId= Nodes.itemId and             Instances.versionId = Nodes.versionId ;
 

explain analyze when analyzing the temporary table first:
QUERYPLAN
 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=3258.50..28509.89 rows=217 width=333) (actual time=359.328..671.078 rows=10571 loops=1)  ->  Nested Loop
(cost=3258.50..27812.10rows=217 width=322) (actual time=331.048..557.858 rows=10571 loops=1)        ->  Hash Join
(cost=3258.50..27112.73rows=217 width=276) (actual time=330.947..503.239 rows=10571 loops=1)              Hash Cond:
(instances.flavorid= flavors.flavorid)              ->  Nested Loop  (cost=3222.89..27074.13 rows=217 width=45) (actual
time=329.451..489.734rows=10571 loops=1)                    ->  Hash Join  (cost=3222.89..14374.81 rows=3916 width=20)
(actualtime=329.295..412.439 rows=10571 loops=1)                          Hash Cond: (instances.instanceid =
tt.includedid)                         ->  Seq Scan on instances  (cost=0.00..9317.19 rows=478819 width=16) (actual
time=0.037..185.202rows=478819 loops=1)                          ->  Hash  (cost=3173.94..3173.94 rows=3916 width=12)
(actualtime=16.738..16.738 rows=10571 loops=1)                                ->  Nested Loop  (cost=0.00..3173.94
rows=3916width=12) (actual time=0.159..11.248 rows=10571 loops=1)                                      ->  Seq Scan on
tmpinstanceid (cost=0.00..1.02 rows=2 width=8) (actual time=0.005..0.007 rows=2 loops=1)
     ->  Index Scan using trovetrovesinstanceincluded_uq on trovetroves tt  (cost=0.00..1561.99 rows=1958 width=12)
(actualtime=0.102..3.400 rows=5286 loops=2)                                            Index Cond:
(tmpinstanceid.instanceid= tt.instanceid)                    ->  Index Scan using nodesversionid_fk on nodes
(cost=0.00..3.23rows=1 width=25) (actual time=0.005..0.006 rows=1 loops=10571)                          Index Cond:
((instances.versionid= nodes.versionid) AND (instances.itemid = nodes.itemid))              ->  Hash
(cost=28.05..28.05rows=605 width=239) (actual time=1.406..1.406 rows=605 loops=1)                    ->  Seq Scan on
flavors (cost=0.00..28.05 rows=605 width=239) (actual time=0.062..0.744 rows=605 loops=1)        ->  Index Scan using
versions_pkeyon versions  (cost=0.00..3.21 rows=1 width=58) (actual time=0.003..0.004 rows=1 loops=10571)
IndexCond: (instances.versionid = versions.versionid)  ->  Index Scan using items_pkey on items  (cost=0.00..3.20
rows=1width=23) (actual time=0.009..0.010 rows=1 loops=10571)        Index Cond: (instances.itemid = items.itemid)Total
runtime:673.460 ms
 


explain without analyzing the temporaray table first:
QUERYPLAN
 

-----------------------------------------------------------------------------------------------------------------------------------------Hash
Join (cost=2380399.89..2448745.27 rows=210188 width=333)  Hash Cond: (instances.itemid = items.itemid)  ->  Hash Join
(cost=2379801.92..2443155.34rows=210188 width=322)        Hash Cond: (instances.versionid = versions.versionid)
-> Hash Join  (cost=2378320.77..2437733.16 rows=210188 width=276)              Hash Cond: (instances.flavorid =
flavors.flavorid)             ->  Hash Join  (cost=2378285.15..2434807.46 rows=210188 width=45)                    Hash
Cond:(tmpinstanceid.instanceid = tt.instanceid)                    ->  Seq Scan on tmpinstanceid  (cost=0.00..29.40
rows=1940width=8)                    ->  Hash  (cost=2260473.14..2260473.14 rows=5291201 width=45)
   ->  Hash Join  (cost=95937.67..2260473.14 rows=5291201 width=45)                                Hash Cond:
(tt.includedid= instances.instanceid)                                ->  Seq Scan on trovetroves tt
(cost=0.00..1753045.32rows=95620632 width=12)                                ->  Hash  (cost=95606.47..95606.47
rows=26496width=41)                                      ->  Merge Join  (cost=89458.93..95606.47 rows=26496 width=41)
                                         Merge Cond: ((instances.versionid = nodes.versionid) AND (instances.itemid =
nodes.itemid))                                           ->  Sort  (cost=54491.66..55688.71 rows=478819 width=16)
                                          Sort Key: instances.versionid, instances.itemid
                  ->  Seq Scan on instances  (cost=0.00..9317.19 rows=478819 width=16)
         ->  Sort  (cost=34967.27..35731.14 rows=305546 width=25)                                                  Sort
Key:nodes.versionid, nodes.itemid                                                  ->  Seq Scan on nodes
(cost=0.00..7130.46rows=305546 width=25)              ->  Hash  (cost=28.05..28.05 rows=605 width=239)
 ->  Seq Scan on flavors  (cost=0.00..28.05 rows=605 width=239)        ->  Hash  (cost=946.07..946.07 rows=42807
width=58)             ->  Seq Scan on versions  (cost=0.00..946.07 rows=42807 width=58)  ->  Hash  (cost=354.65..354.65
rows=19465width=23)        ->  Seq Scan on items  (cost=0.00..354.65 rows=19465 width=23)
 
(28 rows)

Thanks,

Cristian
-- 
Cristian Gafton
rPath, Inc.

Re: Single table forcing sequential scans on query plans

From

Tom Lane

Date:

16 March 2008, 18:21:56

Cristian Gafton <gafton@rpath.com> writes:
> I have a weird query execution plan problem I am trying to debug on 
> Postgresql 8.2.6. I have a query that joins against a temporary table that 
> has very few rows.

Is it possible that the temp table ever has exactly zero rows?

> My questions are:
> - what would make the analyze operation "fail" in the eyes of the planner?
> - why joining to a single unanalyzed table disables any and all indexes from 
> the other tables references in the query?

That's entirely the wrong way to think about it.  The planner is
choosing a good plan based on its estimates of table sizes, which
are wildly different in the two cases:

>                                        ->  Seq Scan on tmpinstanceid  (cost=0.00..1.02 rows=2 width=8) (actual
time=0.005..0.007rows=2 loops=1)

>                      ->  Seq Scan on tmpinstanceid  (cost=0.00..29.40 rows=1940 width=8)

If there actually were nearly 2000 rows in the temp table, that
nested-loops plan would take about a thousand times longer than
it does, and you'd not be nearly so pleased with it.  The
merge-and-hash-joins plan looks quite sane to me for that table size.

The larger estimate is coming from some heuristics that are applied
when the table size recorded in pg_class.relpages & reltuples is
exactly zero.  It's intentionally not small, to keep us from choosing
a plan with brittle performance behavior when we are looking at a
table that's never been vacuumed or analyzed.

The only idea I have for how the planner could "ignore" a previous
analyze result is if the analyze found the table to be of zero size.
Then the heuristic would still be applied because relpages == 0.
        regards, tom lane

Re: Single table forcing sequential scans on query plans

From

Cristian Gafton

Date:

16 March 2008, 22:26:34

On Sun, 16 Mar 2008, Tom Lane wrote:

> > I have a weird query execution plan problem I am trying to debug on 
> > Postgresql 8.2.6. I have a query that joins against a temporary table that 
> > has very few rows.
> 
> Is it possible that the temp table ever has exactly zero rows?

Ah, that is indeed a possibility. If I am to understand correctly, there is 
no way to represent the difference between an un-analyzed table and a 
zero-sized analyzed table as far as the query planner is concerned? Looks 
like I'll have to do a "select count(*)" before running query to avoid 
entering this trap. (That feels a bit suboptimal since the conary repository 
code does extensive work with/through temporary tables, and this could very 
well end up not being the only section affected...)

> That's entirely the wrong way to think about it.  The planner is
> choosing a good plan based on its estimates of table sizes, which
> are wildly different in the two cases:
> 
> >             ->  Seq Scan on tmpinstanceid  (cost=0.00..1.02 rows=2 width=8) (actual time=0.005..0.007 rows=2
loops=1)
> 
> >     ->  Seq Scan on tmpinstanceid  (cost=0.00..29.40 rows=1940 width=8)

In this particular case it would be nice if there would be a differentiation 
between "estimate size 0" and "estimate size unknown".

> The only idea I have for how the planner could "ignore" a previous
> analyze result is if the analyze found the table to be of zero size.
> Then the heuristic would still be applied because relpages == 0.

For now I will try to run with the assumption that the massive sequential 
scans are caused by joing an empty table in the query and try to work my way 
around it - unless there is some trick to tell the planner that this is a 
query that would be much better optimized away instead of causing a massive 
IO storm.

Thanks,

Cristian
-- 
Cristian Gafton
rPath, Inc.

Re: Single table forcing sequential scans on query plans

From

Tom Lane

Date:

16 March 2008, 22:42:45

Cristian Gafton <gafton@rpath.com> writes:
> On Sun, 16 Mar 2008, Tom Lane wrote:
>> Is it possible that the temp table ever has exactly zero rows?

> Ah, that is indeed a possibility. If I am to understand correctly, there is 
> no way to represent the difference between an un-analyzed table and a 
> zero-sized analyzed table as far as the query planner is concerned?

While thinking about your report I was considering having VACUUM and
ANALYZE always set relpages to at least 1.  Then seeing relpages=0
would indeed indicate a never-analyzed table, whereas relpages=1
when physical table size is zero could be taken to indicate that
we should trust the table to be really empty.  I'm not sure though
whether this sort of convention would confuse any existing code.

Another possibility (though not a back-patchable solution) is that
we could just dispense with the heuristic size estimate and trust a
zero-sized table to stay zero-sized.  This would be relying on the
assumption that autovacuum will kick in and update the stats, leading
to invalidation of any existing plans that assume the table is small.
I don't feel very comfortable about that though --- throwing a few
hundred tuples into a table might not be enough to draw autovacuum's
attention, but it could surely be enough to create a performance
disaster for nestloop plans.

Could you confirm that your problem cases are actually caused by this
effect and not something else?
        regards, tom lane

Re: Single table forcing sequential scans on query plans

From

Cristian Gafton

Date:

17 March 2008, 03:36:03

On Sun, 16 Mar 2008, Tom Lane wrote:

> > Ah, that is indeed a possibility. If I am to understand correctly, there is 
> > no way to represent the difference between an un-analyzed table and a 
> > zero-sized analyzed table as far as the query planner is concerned?
> 
> While thinking about your report I was considering having VACUUM and
> ANALYZE always set relpages to at least 1.  Then seeing relpages=0
> would indeed indicate a never-analyzed table, whereas relpages=1
> when physical table size is zero could be taken to indicate that
> we should trust the table to be really empty.  I'm not sure though
> whether this sort of convention would confuse any existing code.

If having a discrepancy between relpages and table size is a concern,
could relpages be a negative value to mark a non-analyzed table?

> Another possibility (though not a back-patchable solution) is that
> we could just dispense with the heuristic size estimate and trust a
> zero-sized table to stay zero-sized.  This would be relying on the

I think improving the estimator would get us further, since in most cases it 
seems to get it relatively right.

> Could you confirm that your problem cases are actually caused by this
> effect and not something else?

Yes, confirmed. The runaway queries all are joining against an empty 
temporary table.

Thanks,

Cristian
-- 
Cristian Gafton
rPath, Inc.

Re: Single table forcing sequential scans on query plans

From

Tom Lane

Date:

17 March 2008, 04:45:46

Cristian Gafton <gafton@rpath.com> writes:
> On Sun, 16 Mar 2008, Tom Lane wrote:
>> While thinking about your report I was considering having VACUUM and
>> ANALYZE always set relpages to at least 1.  Then seeing relpages=0
>> would indeed indicate a never-analyzed table, whereas relpages=1
>> when physical table size is zero could be taken to indicate that
>> we should trust the table to be really empty.  I'm not sure though
>> whether this sort of convention would confuse any existing code.

> If having a discrepancy between relpages and table size is a concern,
> could relpages be a negative value to mark a non-analyzed table?

No, the value is really a uint32, though we don't declare it that way
for lack of having any such SQL type :-(.  (uint32)-1 is just as legal
a value as 1, though perhaps a lot less likely.  Anyway, client code
looking at the column is probably more likely to get confused by a
negative value for relpages than by a value that doesn't match
underlying reality (which it can't easily see anyway).

>> Could you confirm that your problem cases are actually caused by this
>> effect and not something else?

> Yes, confirmed. The runaway queries all are joining against an empty 
> temporary table.

Good, just wanted to be sure.  If there are not objections, I'll
put in the at-least-1 hack.
        regards, tom lane

Re: Single table forcing sequential scans on query plans

From

Alvaro Herrera

Date:

17 March 2008, 13:38:04

Tom Lane wrote:

> Another possibility (though not a back-patchable solution) is that
> we could just dispense with the heuristic size estimate and trust a
> zero-sized table to stay zero-sized.  This would be relying on the
> assumption that autovacuum will kick in and update the stats, leading
> to invalidation of any existing plans that assume the table is small.
> I don't feel very comfortable about that though --- throwing a few
> hundred tuples into a table might not be enough to draw autovacuum's
> attention, but it could surely be enough to create a performance
> disaster for nestloop plans.

FWIW autovacuum fires an analyze with the 51st tuple inserted on a
table on 8.3's default configuration.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.