Thread: SeqScan costs

SeqScan costs

From
Simon Riggs
Date:
If we access a 1 block table using a SeqScan then it costs
seq_page_cost, or 1 by default.

If we access the same 1 block table using an IndexScan then the access
costs random_page_cost to access the index block and then
random_page_cost to access to the data block.

So the same block accessed for different reasons is judged to have two
different costs. But that clearly must be wrong, since the I/O is a 1
block I/O in either case,

This leads to favouring seq scans in cases where the actual index scan
timing is actually less than seq scan.

Proposal: Make the first block of a seq scan cost random_page_cost, then
after that every additional block costs seq_page_cost.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: SeqScan costs

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Proposal: Make the first block of a seq scan cost random_page_cost, then
> after that every additional block costs seq_page_cost.

This is only going to matter for a table of 1 block (or at least very
few blocks), and for such a table it's highly likely that it's in RAM
anyway.  So I'm unconvinced that the proposed change represents a
better model of reality.

Perhaps more to the point, you haven't provided any actual evidence
that this is a better approach.  I'm disinclined to tinker with the
fundamental cost models on the basis of handwaving.
        regards, tom lane


Re: SeqScan costs

From
Simon Riggs
Date:
On Tue, 2008-08-12 at 15:46 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Proposal: Make the first block of a seq scan cost random_page_cost, then
> > after that every additional block costs seq_page_cost.
>
> This is only going to matter for a table of 1 block (or at least very
> few blocks), and for such a table it's highly likely that it's in RAM
> anyway.  So I'm unconvinced that the proposed change represents a
> better model of reality.

The access cost should be the same for a 1 block table, whether its on
disk or in memory.

> Perhaps more to the point, you haven't provided any actual evidence
> that this is a better approach.  I'm disinclined to tinker with the
> fundamental cost models on the basis of handwaving.

I've written a simple test suite

psql -f seq.sql -v numblocks=x -v pkval=y -v filler=z

to investigate various costs and elapsed times.

AFAICS the cost cross-over is much higher than the actual elapsed time
cross-over for both narrow and wide tables.

Thats why using SET enable_seqscan=off helps performance in many cases,
or why people reduce random_page_cost to force index selection.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: SeqScan costs

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Proposal: Make the first block of a seq scan cost>> random_page_cost, then after that every additional block costs>>
seq_page_cost.
Tom> This is only going to matter for a table of 1 block (or at leastTom> very few blocks), and for such a table it's
highlylikely thatTom> it's in RAM anyway.  So I'm unconvinced that the proposed changeTom> represents a better model of
reality.

Simple example which demonstrates a 10x speed improvement for index
scan over seqscan for a 1-block table (on 8.3.3):

create table oneblock (id integer primary key, value text not null); 
insert into oneblock select i, 'row ' || i from generate_series(1,200) i;

test=> select pg_relation_size('oneblock');pg_relation_size 
------------------            8192

analyze oneblock;

set enable_seqscan=true;

select (select value from oneblock where id = i) from generate_series(1,200) i, generate_series(1,5000) j;
Time: 25596.709 ms  (that's 25.6 us per row)

set enable_seqscan=false;

select (select value from oneblock where id = i) from generate_series(1,200) i, generate_series(1,5000) j;
Time: 2415.691 ms   (that's 2.4 us per row)

(removing the subselect entirely gives 0.4us per row, so it's actually
about a 12x speed difference for the subselect alone.)

The planner costs the seqscan at 3.50 and the indexscan at 8.27.

-- 
Andrew (irc:RhodiumToad)


Re: SeqScan costs

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> On Tue, 2008-08-12 at 15:46 -0400, Tom Lane wrote:
>> Simon Riggs <simon@2ndquadrant.com> writes:
>> > Proposal: Make the first block of a seq scan cost random_page_cost, then
>> > after that every additional block costs seq_page_cost.
>> 
>> This is only going to matter for a table of 1 block (or at least very
>> few blocks), and for such a table it's highly likely that it's in RAM
>> anyway.  So I'm unconvinced that the proposed change represents a
>> better model of reality.

I think the first block of a sequential scan is clearly a random access. If
that doesn't represent reality well then perhaps we need to tackle both
problems together. Somehow we need to discount scan i/o cost based on how much
of the table we expect to be in cache. For 1-block tables if we should expect
them to be in cache we should be zeroing out all the i/o cost whether random
or sequential.

> The access cost should be the same for a 1 block table, whether its on
> disk or in memory.

Uhm, huh? That can't be what you meant to write?

> AFAICS the cost cross-over is much higher than the actual elapsed time
> cross-over for both narrow and wide tables. 
>
> Thats why using SET enable_seqscan=off helps performance in many cases,
> or why people reduce random_page_cost to force index selection.

People lower random_page_cost because we're not doing a good job estimating
how much of a table is in cache. I think that would be a great target for some
careful analysis. If you can come up with specific places and reasonable
heuristics to discount i/o costs based on effective_cache_size and then
demonstrate cases where it produces consistently better cost estimates that
would be a huge help.

I've been running benchmarks where I see accurate random_page_costs of 13-80
on uncached data on a moderate sized raid array. But of course when a some of
the data is cached the effective random_page_cost is much much lower than
that.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: SeqScan costs

From
Jeff Davis
Date:
On Tue, 2008-08-12 at 23:58 +0100, Gregory Stark wrote:
> People lower random_page_cost because we're not doing a good job
>  estimating how much of a table is in cache. 

Is it because of a bad estimate about how much of a table is in cache,
or a bad assumption about the distribution of access to a table?

If the planner were to know, for example, that 10-20% of the table is
likely to be in cache, will that really make a difference in the plan? I
suspect that it would mostly only matter when the entire table is
cached, the correlation is low, and the query is somewhat selective
(which is a possible use case, but fairly narrow).

I suspect that this has more to do with the fact that some data is
naturally going to be accessed much more frequently than other data in a
large table. But how do you determine, at plan time, whether the query
will be mostly accessing hot data, or cold data?

Regards,Jeff Davis




Re: SeqScan costs

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
>> On Tue, 2008-08-12 at 15:46 -0400, Tom Lane wrote:
>>> This is only going to matter for a table of 1 block (or at least very
>>> few blocks), and for such a table it's highly likely that it's in RAM
>>> anyway.  So I'm unconvinced that the proposed change represents a
>>> better model of reality.

> I think the first block of a sequential scan is clearly a random access. If
> that doesn't represent reality well then perhaps we need to tackle both
> problems together.

The point I was trying to make (evidently not too well) is that fooling
around with fundamental aspects of the cost models is not something that
should be done without any evidence.  We've spent ten years getting the
system to behave reasonably well with the current models, and it's quite
possible that changing them to be "more accurate" according to a
five-minute analysis is going to make things markedly worse overall.

I'm not necessarily opposed to making this change --- it does sound
kinda plausible --- but I want to see some hard evidence that it does
more good than harm before we put it in.

> People lower random_page_cost because we're not doing a good job estimating
> how much of a table is in cache.

Agreed, the elephant in the room is that we lack enough data to model
caching effects with any degree of realism.
        regards, tom lane


Re: SeqScan costs

From
Simon Riggs
Date:
On Tue, 2008-08-12 at 23:58 +0100, Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> 
> > On Tue, 2008-08-12 at 15:46 -0400, Tom Lane wrote:
> >> Simon Riggs <simon@2ndquadrant.com> writes:
> >> > Proposal: Make the first block of a seq scan cost random_page_cost, then
> >> > after that every additional block costs seq_page_cost.
> >> 
> >> This is only going to matter for a table of 1 block (or at least very
> >> few blocks), and for such a table it's highly likely that it's in RAM
> >> anyway.  So I'm unconvinced that the proposed change represents a
> >> better model of reality.
> 
> I think the first block of a sequential scan is clearly a random access. 

Agreed

> If
> that doesn't represent reality well then perhaps we need to tackle both
> problems together. 

It does represent reality.

> Somehow we need to discount scan i/o cost based on how much
> of the table we expect to be in cache. For 1-block tables if we should expect
> them to be in cache we should be zeroing out all the i/o cost whether random
> or sequential.

This isn't relevant. 

> > The access cost should be the same for a 1 block table, whether its on
> > disk or in memory.
> 
> Uhm, huh? That can't be what you meant to write?

It can be. The *comparative* access cost (in the optimizer) between
random and sequential access should be exactly the same for a 1 block
table whether the block is on disk or in memory. In reality the block
access time is identical for a 1 block table at each level of the
storage hierarchy

* in shared_buffers (ReadBuffer)
* in filesystem cache (read() to cache)
* on disk (read() to disk)

> > AFAICS the cost cross-over is much higher than the actual elapsed time
> > cross-over for both narrow and wide tables. 
> >
> > Thats why using SET enable_seqscan=off helps performance in many cases,
> > or why people reduce random_page_cost to force index selection.
> 
> People lower random_page_cost because we're not doing a good job estimating
> how much of a table is in cache. I think that would be a great target for some
> careful analysis. If you can come up with specific places and reasonable
> heuristics to discount i/o costs based on effective_cache_size and then
> demonstrate cases where it produces consistently better cost estimates that
> would be a huge help.
> 
> I've been running benchmarks where I see accurate random_page_costs of 13-80
> on uncached data on a moderate sized raid array. But of course when a some of
> the data is cached the effective random_page_cost is much much lower than
> that.

Agreed, but thats a harder problem and nothing I wish to raise here.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: SeqScan costs

From
Simon Riggs
Date:
On Tue, 2008-08-12 at 23:22 -0400, Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> >> On Tue, 2008-08-12 at 15:46 -0400, Tom Lane wrote:
> >>> This is only going to matter for a table of 1 block (or at least very
> >>> few blocks), and for such a table it's highly likely that it's in RAM
> >>> anyway.  So I'm unconvinced that the proposed change represents a
> >>> better model of reality.
> 
> > I think the first block of a sequential scan is clearly a random access. If
> > that doesn't represent reality well then perhaps we need to tackle both
> > problems together.
> 
> The point I was trying to make (evidently not too well) is that fooling
> around with fundamental aspects of the cost models is not something that
> should be done without any evidence.  We've spent ten years getting the
> system to behave reasonably well with the current models, and it's quite
> possible that changing them to be "more accurate" according to a
> five-minute analysis is going to make things markedly worse overall.
> 
> I'm not necessarily opposed to making this change --- it does sound
> kinda plausible --- but I want to see some hard evidence that it does
> more good than harm before we put it in.

psql -f seq.sql -v numblocks=2 -v pkval=Anything -v filler=Varies

When numblocks=2 I consistently see that an index scan is actually
faster than a seqscan, yet the planner chooses a seqscan in all cases. 

This is true for any value of pkval and values of filler up to 4-500
bytes. We already take into account the length of rows because we
estimate the CPU costs per row not per block. That is not what I wish to
change.

This same situation occurs for all small tables. What I conclude is that
the "disk cost" swamps the CPU costs and so we end up with a seq scan
when we really want an index scan.

There are two ways of looking at this
* we work out a complex scheme for knowing when to remove disk costs
* we realise that the "disk cost" is actually the same on the *first*
block whether we are in memory or on disk. 

If we take the second way, then we have a small but crucial correction
factor that produces better plans in most cases on small tables. Doing
it this way allows us to not worry about the cacheing, but just have a
scheme that balances the access costs better so that although they are
still present in the total cost the final plan choice is less dependent
upon the disk cost and more dependent upon the CPU costs.

This analysis is the result of experience, then measurement, not theory.
I've been looking for an easy and justifiable way to nudge the cost
factors so that they work better for small tables.
 run_cost += random_page_cost + seq_page_cost * (baserel->pages - 1);


> > People lower random_page_cost because we're not doing a good job estimating
> > how much of a table is in cache.
> 
> Agreed, the elephant in the room is that we lack enough data to model
> caching effects with any degree of realism.

I'm specifically talking about a proposal that works whether or not the
first block of the table is in cache, because I see a problem with small
table access.

I'm not suggesting that we model cacheing effects (though we may choose
to later). If you did, you might need to consider cross-statement
effects such as the likelihood that a UPDATE .. WHERE CURRENT OF CURSOR
is more likely to find the block in cache, or other effects such as
certain MFVs might actually be more likely to be in cache than non-MFVs
and so index scans against them are actually more preferable than it
might appear.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: SeqScan costs

From
Zeugswetter Andreas OSB sIT
Date:
> > > Proposal: Make the first block of a seq scan cost random_page_cost, then
> > > after that every additional block costs seq_page_cost.

+1

> AFAICS the cost cross-over is much higher than the actual elapsed time
> cross-over for both narrow and wide tables.

Which makes absolute sense, since readahead can only reduce cost when you
read more than one page (and more than a few when lacking fadvise tech).

I am wondering about your test though. It was all cached, so it seems
we also underestimate the CPU cost of accessing and comparing all the rows
during seq scan.

> Thats why using SET enable_seqscan=off helps performance in many cases,
> or why people reduce random_page_cost to force index selection.

Sequential scans that cause more IO than an alternate index path are often
counter productive in highly concurrent scenarios.
In such scenarios it is really reasonable to lower random_page_cost.

Andreas


Re: SeqScan costs

From
Decibel!
Date:
On Aug 12, 2008, at 4:52 PM, Andrew Gierth wrote:
>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>>> Proposal: Make the first block of a seq scan cost
>>> random_page_cost, then after that every additional block costs
>>> seq_page_cost.
>
> ?Tom> This is only going to matter for a table of 1 block (or at least
> ?Tom> very few blocks), and for such a table it's highly likely that
> ?Tom> it's in RAM anyway.? So I'm unconvinced that the proposed change
> ?Tom> represents a better model of reality.
>
> Simple example which demonstrates a 10x speed improvement for index
> scan over seqscan for a 1-block table (on 8.3.3):
>
> create table oneblock (id integer primary key, value text not null);?
> insert into oneblock select i, 'row ' || i from generate_series 
> (1,200) i;
>
> test=> select pg_relation_size('oneblock');
> ?pg_relation_size?
> ------------------
> ?? ? ? ? ? ? 8192
>
> analyze oneblock;
>
> set enable_seqscan=true;
>
> select (select value from oneblock where id = i)
> ? from generate_series(1,200) i, generate_series(1,5000) j;
> Time: 25596.709 ms? (that's 25.6 us per row)
>
> set enable_seqscan=false;
>
> select (select value from oneblock where id = i)
> ? from generate_series(1,200) i, generate_series(1,5000) j;
> Time: 2415.691 ms ? (that's 2.4 us per row)

Roughly what I get on my MBP (I'm about a factor of 2 slower). This  
makes me think it's an issue of having to slog through an entire page  
one row at a time vs just using the index. To test this I tested  
selecting i=200 (remember we start filling data at the back of the  
page, so 200 would actually be the front, and I'm assuming the first  
value that would be hit) vs i=1. With seqscans, I saw about a 10%  
difference. With index scans the difference was moot, but also note  
that now index scans are in-between seqscans in performance.

decibel@platter.local=# explain analyze select (select value from  
oneblock where id = 200)  from generate_series(1,200) i, generate_series(1,500000) j;
                           QUERY  
 
PLAN
------------------------------------------------------------------------ 
----------------------------------------------------------------- Nested Loop  (cost=17.00..20029.50 rows=1000000
width=0)(actual  
 
time=270.867..65821.373 rows=100000000 loops=1)   InitPlan     ->  Seq Scan on oneblock  (cost=0.00..3.50 rows=1
width=7) 
 
(actual time=0.052..0.053 rows=1 loops=1)           Filter: (id = 200)   ->  Function Scan on generate_series i
(cost=0.00..12.50 
 
rows=1000 width=0) (actual time=0.062..0.351 rows=200 loops=1)   ->  Materialize  (cost=13.50..23.50 rows=1000 width=0)
(actual 
 
time=1.368..164.634 rows=500000 loops=200)         ->  Function Scan on generate_series j  (cost=0.00..12.50  
rows=1000 width=0) (actual time=270.743..459.335 rows=500000 loops=1) Total runtime: 79055.822 ms
(8 rows)

decibel@platter.local=# explain analyze select (select value from  
oneblock where id = 1)  from generate_series(1,200) i, generate_series(1,500000) j;
                         QUERY  
 
PLAN
------------------------------------------------------------------------ 
----------------------------------------------------------------- Nested Loop  (cost=17.00..20029.50 rows=1000000
width=0)(actual  
 
time=261.941..72937.226 rows=100000000 loops=1)   InitPlan     ->  Seq Scan on oneblock  (cost=0.00..3.50 rows=1
width=7) 
 
(actual time=0.025..0.056 rows=1 loops=1)           Filter: (id = 1)   ->  Function Scan on generate_series i
(cost=0.00..12.50 
 
rows=1000 width=0) (actual time=0.060..0.346 rows=200 loops=1)   ->  Materialize  (cost=13.50..23.50 rows=1000 width=0)
(actual 
 
time=1.375..182.474 rows=500000 loops=200)         ->  Function Scan on generate_series j  (cost=0.00..12.50  
rows=1000 width=0) (actual time=261.815..448.652 rows=500000 loops=1) Total runtime: 87702.315 ms
(8 rows)

decibel@platter.local=# set enable_seqscan =off;
SET
decibel@platter.local=# explain analyze select (select value from  
oneblock where id = 200)  from generate_series(1,200) i, generate_series(1,500000) j;
                           QUERY  
 
PLAN
------------------------------------------------------------------------ 
----------------------------------------------------------------- Nested Loop  (cost=21.77..20034.27 rows=1000000
width=0)(actual  
 
time=262.219..69851.786 rows=100000000 loops=1)   InitPlan     ->  Index Scan using oneblock_pkey on oneblock   
(cost=0.00..8.27 rows=1 width=7) (actual time=0.024..0.026 rows=1  
loops=1)           Index Cond: (id = 200)   ->  Function Scan on generate_series i  (cost=0.00..12.50  
rows=1000 width=0) (actual time=0.062..0.355 rows=200 loops=1)   ->  Materialize  (cost=13.50..23.50 rows=1000 width=0)
(actual 
 
time=1.325..174.314 rows=500000 loops=200)         ->  Function Scan on generate_series j  (cost=0.00..12.50  
rows=1000 width=0) (actual time=262.119..449.383 rows=500000 loops=1) Total runtime: 83024.952 ms
(8 rows)

decibel@platter.local=# explain analyze select (select value from  
oneblock where id = 1)  from generate_series(1,200) i, generate_series(1,500000) j;
                         QUERY  
 
PLAN
------------------------------------------------------------------------ 
----------------------------------------------------------------- Nested Loop  (cost=21.77..20034.27 rows=1000000
width=0)(actual  
 
time=262.175..68943.985 rows=100000000 loops=1)   InitPlan     ->  Index Scan using oneblock_pkey on oneblock   
(cost=0.00..8.27 rows=1 width=7) (actual time=0.023..0.025 rows=1  
loops=1)           Index Cond: (id = 1)   ->  Function Scan on generate_series i  (cost=0.00..12.50  
rows=1000 width=0) (actual time=0.062..0.339 rows=200 loops=1)   ->  Materialize  (cost=13.50..23.50 rows=1000 width=0)
(actual 
 
time=1.325..176.056 rows=500000 loops=200)         ->  Function Scan on generate_series j  (cost=0.00..12.50  
rows=1000 width=0) (actual time=262.079..454.692 rows=500000 loops=1) Total runtime: 82598.556 ms
(8 rows)

decibel@platter.local=#
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: SeqScan costs

From
Andrew Gierth
Date:
>>>>> "Decibel!" == Decibel!  <decibel@decibel.org> writes:
Decibel> Roughly what I get on my MBP (I'm about a factor of 2Decibel> slower). This makes me think it's an issue of
havingto slogDecibel> through an entire page one row at a time vs just using theDecibel> index. To test this I tested
selectingi=200 (remember weDecibel> start filling data at the back of the page, so 200 wouldDecibel> actually be the
front,and I'm assuming the first value thatDecibel> would be hit) vs i=1. With seqscans, I saw about a 10%Decibel>
difference.With index scans the difference was moot, butDecibel> also note that now index scans are in-between seqscans
inDecibel>performance.
 

The problem is that by looking for a constant row, you're actually
eliminating the entire effect being tested, because the uncorrelated
subselect is run ONCE as an initplan, and the entire query time is
then spent elsewhere. The differences in runtime you're seeing are
pure noise (the fact that you had to increase the iteration count so
much should have been a clue here).

-- 
Andrew (irc:RhodiumToad)


Re: SeqScan costs

From
Decibel!
Date:
On Wed, Aug 13, 2008 at 07:33:40PM +0100, Andrew Gierth wrote:
> The following message is a courtesy copy of an article
> that has been posted to pgsql.hackers as well.
>
> >>>>> "Decibel!" == Decibel!  <decibel@decibel.org> writes:
>
>  Decibel> Roughly what I get on my MBP (I'm about a factor of 2
>  Decibel> slower). This makes me think it's an issue of having to slog
>  Decibel> through an entire page one row at a time vs just using the
>  Decibel> index. To test this I tested selecting i=200 (remember we
>  Decibel> start filling data at the back of the page, so 200 would
>  Decibel> actually be the front, and I'm assuming the first value that
>  Decibel> would be hit) vs i=1. With seqscans, I saw about a 10%
>  Decibel> difference. With index scans the difference was moot, but
>  Decibel> also note that now index scans are in-between seqscans in
>  Decibel> performance.
>
> The problem is that by looking for a constant row, you're actually
> eliminating the entire effect being tested, because the uncorrelated
> subselect is run ONCE as an initplan, and the entire query time is
> then spent elsewhere. The differences in runtime you're seeing are
> pure noise (the fact that you had to increase the iteration count so
> much should have been a clue here).

Makes sense, and yeah, I was wondering a bit about that. I'll try to
fake it out with offset 0 later on if no one beats me to it; I do still
think we could just be seeing the effect of slogging through 200 tuples
instead of going directly to the one we want.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: SeqScan costs

From
Gregory Stark
Date:
"Decibel!" <decibel@decibel.org> writes:

> Makes sense, and yeah, I was wondering a bit about that. I'll try to
> fake it out with offset 0 later on if no one beats me to it; I do still
> think we could just be seeing the effect of slogging through 200 tuples
> instead of going directly to the one we want.

Of course index lookups aren't magic. You don't get to go *directly* to the
one you want. You still have to slog through index tuples to find the right
pointer.

That means going to the index meta page, find the fast root pointer, look up
that page, look at the single leaf page pointer, look up that page, and do a
binary search of the 200 leaf pointers. Once you find the resulting match,
look up the heap page and *then* go directly to the right tuple.

So that means up to four trips to the buffer cache, trips through lwlocks,
etc. And it still means up to 9 btree comparisons. Still less than 200 but
it's not entirely free.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: SeqScan costs

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> That means going to the index meta page, find the fast root pointer, look up
> that page, look at the single leaf page pointer, look up that page, and do a
> binary search of the 200 leaf pointers. Once you find the resulting match,
> look up the heap page and *then* go directly to the right tuple.

Actually, the metapage access has been cached for some time, and there's
not going to be a separate root page if you only have 1 page worth of
index entries.  But yeah, for large indexes there are going to be
multiple page accesses before you find what you want.
        regards, tom lane


Re: SeqScan costs

From
Decibel!
Date:
On Aug 13, 2008, at 3:52 PM, Decibel! wrote:
>> The problem is that by looking for a constant row, you're actually
>> eliminating the entire effect being tested, because the uncorrelated
>> subselect is run ONCE as an initplan, and the entire query time is
>> then spent elsewhere. The differences in runtime you're seeing are
>> pure noise (the fact that you had to increase the iteration count so
>> much should have been a clue here).
>
> Makes sense, and yeah, I was wondering a bit about that. I'll try to
> fake it out with offset 0 later on if no one beats me to it; I do  
> still
> think we could just be seeing the effect of slogging through 200  
> tuples
> instead of going directly to the one we want.


OK, ran the test again via this query:

explain analyze select (select value from oneblock where id = i) from  
generate_series(1,1) i, generate_series(1,100000) j;

changing 1,1 to 200,200 as needed. I don't see any meaningful  
differences between 1,1 and 200,200. The seqscan case is still  
notably slower than the index case (~5500ms vs ~800ms).

It'd be useful to get strace data on this, but OS X doesn't have  
that :/ (and I'm on 10.4 so no dtrace either). Can someone get an  
strace from this?
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: SeqScan costs

From
Tom Lane
Date:
Decibel! <decibel@decibel.org> writes:
> It'd be useful to get strace data on this, but OS X doesn't have  
> that :/ (and I'm on 10.4 so no dtrace either).

See ktrace.
        regards, tom lane


Re: SeqScan costs

From
Andrew Gierth
Date:
>>>>> "Decibel" == Decibel!  <decibel@decibel.org> writes:
Decibel> OK, ran the test again via this query:
Decibel> explain analyze select (select value from oneblock where id = i)Decibel> from generate_series(1,1) i,
generate_series(1,100000)j;
 
Decibel> changing 1,1 to 200,200 as needed. I don't see anyDecibel> meaningful differences between 1,1 and 200,200.

Well of course you don't, since it scans all the rows regardless.
(Scalar subselects don't abort after the first row, they only abort if
they find _more_ than one row, and in this example there is only one,
so the whole of "oneblock" is scanned every time.)

You could likely expose a difference using LIMIT 1 in the subselect,
but that doesn't tell us anything we didn't already know (which is
that yes, index scan is much faster than seqscan even for 1-block
tables, except in the rare case when neither the index page nor the
table page are in cache, causing the indexscan to take two page
fetches rather than just one).

Oddly enough, when I try it with LIMIT 1, it _does_ show a significant
speed difference according to the row position, _but_ the index scan
is still twice as fast even when fetching only row 1 (which is indeed
physically first).

-- 
Andrew (irc:RhodiumToad)


Re: SeqScan costs

From
Decibel!
Date:
On Aug 13, 2008, at 10:45 PM, Andrew Gierth wrote:
> You could likely expose a difference using LIMIT 1 in the subselect,
> but that doesn't tell us anything we didn't already know (which is
> that yes, index scan is much faster than seqscan even for 1-block
> tables, except in the rare case when neither the index page nor the
> table page are in cache, causing the indexscan to take two page
> fetches rather than just one).
>
> Oddly enough, when I try it with LIMIT 1, it _does_ show a significant
> speed difference according to the row position, _but_ the index scan
> is still twice as fast even when fetching only row 1 (which is indeed
> physically first).


So the question is: why?? How can it be cheaper to hit 2 buffers than 1?

Though, unless we can improve the speed of seqscanning an entire page  
vs pulling the exact row we need it's probably still a moot point.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: SeqScan costs

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>>> On Tue, 2008-08-12 at 15:46 -0400, Tom Lane wrote:
>>>> This is only going to matter for a table of 1 block (or at least very
>>>> few blocks), and for such a table it's highly likely that it's in RAM
>>>> anyway.  So I'm unconvinced that the proposed change represents a
>>>> better model of reality.
>
>> I think the first block of a sequential scan is clearly a random access. If
>> that doesn't represent reality well then perhaps we need to tackle both
>> problems together.
>
> The point I was trying to make (evidently not too well) is that fooling
> around with fundamental aspects of the cost models is not something that
> should be done without any evidence.  We've spent ten years getting the
> system to behave reasonably well with the current models, and it's quite
> possible that changing them to be "more accurate" according to a
> five-minute analysis is going to make things markedly worse overall.
>
> I'm not necessarily opposed to making this change --- it does sound
> kinda plausible --- but I want to see some hard evidence that it does
> more good than harm before we put it in.

I don't want to see this thread completely drop because it also seems pretty
plausible to me too.

So what kind of evidence do we need? I'm thinking a query like

select (select count(*) from 1pagetable) as n1,      (select count(*) from 2pagetable) as n2,      (select count(*)
from3pagetable) as n3,      ... from fairlylargetable
 

for various maximum size subquery tables would give an idea of how much cpu
time is spent thrashing through the sequential scans. If we raise the cost of
small sequential scans do the resulting costs get more accurate or do they get
out of whack?

Perhaps what's also needed here is to measure just how accurate the cpu_*
costs are. Perhaps they need to be raised somewhat if we're underestimating
the cost of digging through 200 tuples on a heap page and the benefit of a
binary search on the index tuples.

>> People lower random_page_cost because we're not doing a good job estimating
>> how much of a table is in cache.
>
> Agreed, the elephant in the room is that we lack enough data to model
> caching effects with any degree of realism.

It looks like we *do* discount the page accesses in index_pages_fetched based
on effective_cache_size. But that's the *only* place we use
effective_cache_size. We aren't discounting sequential scan or heap page
accesses even when the entire table is much smaller than effective_cache_size
and therefore hopefully cached.

We need to think about this. I'm a bit concerned that if we assume small
tables are always cached that we'll run into problems on the poor but common
schema design that has hundreds of tiny tables. But that seems like a narrow
use case and not worth assuming we *never* get any cache effects on sequential
scans at all.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: SeqScan costs

From
Simon Riggs
Date:
On Mon, 2008-08-18 at 16:44 +0100, Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
> 
> > Gregory Stark <stark@enterprisedb.com> writes:
> >>> On Tue, 2008-08-12 at 15:46 -0400, Tom Lane wrote:
> >>>> This is only going to matter for a table of 1 block (or at least
> very
> >>>> few blocks), and for such a table it's highly likely that it's in
> RAM
> >>>> anyway.  So I'm unconvinced that the proposed change represents a
> >>>> better model of reality.
> >
> >> I think the first block of a sequential scan is clearly a random
> access. If
> >> that doesn't represent reality well then perhaps we need to tackle
> both
> >> problems together.
> >
> > The point I was trying to make (evidently not too well) is that
> fooling
> > around with fundamental aspects of the cost models is not something
> that
> > should be done without any evidence.  We've spent ten years getting
> the
> > system to behave reasonably well with the current models, and it's
> quite
> > possible that changing them to be "more accurate" according to a
> > five-minute analysis is going to make things markedly worse overall.
> >
> > I'm not necessarily opposed to making this change --- it does sound
> > kinda plausible --- but I want to see some hard evidence that it
> does
> > more good than harm before we put it in.
> 
> I don't want to see this thread completely drop because it also seems
> pretty
> plausible to me too.
> 
> So what kind of evidence do we need? I'm thinking a query like
> 
> select (select count(*) from 1pagetable) as n1,
>        (select count(*) from 2pagetable) as n2,
>        (select count(*) from 3pagetable) as n3,
>        ...
>   from fairlylargetable
> 
> for various maximum size subquery tables would give an idea of how
> much cpu
> time is spent thrashing through the sequential scans. If we raise the
> cost of
> small sequential scans do the resulting costs get more accurate or do
> they get
> out of whack?

Sounds OK. I've not given up on this yet...

> Perhaps what's also needed here is to measure just how accurate the
> cpu_*
> costs are. Perhaps they need to be raised somewhat if we're
> underestimating
> the cost of digging through 200 tuples on a heap page and the benefit
> of a
> binary search on the index tuples.

Well, that's a can of worms you just opened. I'm trying to suggest a
specific fix to a specific problem.

> >> People lower random_page_cost because we're not doing a good job
> estimating
> >> how much of a table is in cache.
> >
> > Agreed, the elephant in the room is that we lack enough data to
> model
> > caching effects with any degree of realism.
> 
> It looks like we *do* discount the page accesses in
> index_pages_fetched based
> on effective_cache_size. But that's the *only* place we use
> effective_cache_size. We aren't discounting sequential scan or heap
> page
> accesses even when the entire table is much smaller than
> effective_cache_size
> and therefore hopefully cached.

That is about block reuse within the same scan, that's why it only
happens there. It doesn't assume the indexes are already cached, it just
says that they will be if we scan heap blocks in index order.

> We need to think about this. I'm a bit concerned that if we assume
> small tables are always cached 

I've not suggested that and we don't currently assume that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: SeqScan costs

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> I'm not necessarily opposed to making this change --- it does sound
>> kinda plausible --- but I want to see some hard evidence that it does
>> more good than harm before we put it in.

> I don't want to see this thread completely drop because it also seems pretty
> plausible to me too.

> So what kind of evidence do we need? I'm thinking a query like

> select (select count(*) from 1pagetable) as n1,
>        (select count(*) from 2pagetable) as n2,
>        (select count(*) from 3pagetable) as n3,
>        ...
>   from fairlylargetable

Well, no, because there's no freedom of choice there.  What we want is
to find out how this change impacts seqscan vs indexscan choices for
small tables, and then whether those choices get better or worse.

> Perhaps what's also needed here is to measure just how accurate the cpu_*
> costs are. Perhaps they need to be raised somewhat if we're underestimating
> the cost of digging through 200 tuples on a heap page and the benefit of a
> binary search on the index tuples.

Possibly.  I doubt anyone's ever taken a hard look at the cpu_xxx
values.
        regards, tom lane


Re: SeqScan costs

From
Decibel!
Date:
On Aug 18, 2008, at 11:49 AM, Tom Lane wrote:
>> Perhaps what's also needed here is to measure just how accurate  
>> the cpu_*
>> costs are. Perhaps they need to be raised somewhat if we're  
>> underestimating
>> the cost of digging through 200 tuples on a heap page and the  
>> benefit of a
>> binary search on the index tuples.
>
> Possibly.  I doubt anyone's ever taken a hard look at the cpu_xxx
> values.


Josh Berkus indicated at PGCon that he's had luck *decreasing* the  
CPU costs, but IIRC that was mostly on OLAP systems. It seems we need  
some real data here.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828