Thread: again on index usage

again on index usage

From
Daniel Kalchev
Date:
Hello,

I have an table with ca 1.7 million records, with the following structure:
    Table "iplog_gate200112"Attribute |   Type    | Modifier 
-----------+-----------+----------ipaddr    | inet      | input     | bigint    | output    | bigint    | router    |
text     | ipdate    | timestamp | 
 
Indices: iplog_gate200112_ipaddr_idx,        iplog_gate200112_ipdate_idx

the following query 

explain
SELECT sum(input) FROM iplog_gate200112 
WHERE  '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND  '2001-12-01 00:00:00+02' <= ipdate
ANDipdate < '2002-01-01 00:00:00+02' AND   network_subeq(ipaddr, '193.68.240.0/20') AND 'uni-gw' ~ router;
 

results in 

NOTICE:  QUERY PLAN:

Aggregate  (cost=51845.51..51845.51 rows=1 width=8) ->  Seq Scan on iplog_gate200112  (cost=0.00..51845.04 rows=190
width=8)


Why would it not want to use index scan?

Statistics for the table are as follows (from pg_statistic s, pg_attribute a, 
pg_class c
where starelid = c.oid and attrelid = c.oid and staattnum = attnum
and relname = 'iplog_gate200112')
attname | attdispersion | starelid  | staattnum | staop | stanullfrac | 
stacommonfrac |      stacommonval      |        staloval        |        
stahival
---------+---------------+-----------+-----------+-------+-------------+-------
--------+------------------------+------------------------+--------------------
----ipaddr  |   8.85397e-05 | 190565949 |         1 |  1203 |           0 |   
0.000441917 | 192.92.129.1           | 192.92.129.0           | 212.72.197.154input   |     0.0039343 | 190565949 |
   2 |   412 |           0 |     
 
0.0183278 | 0                      | 0                      | 5929816798output  |      0.724808 | 190565949 |         3
|  412 |           0 |      
 
0.835018 | 0                      | 0                      | 2639435033router  |      0.222113 | 190565949 |         4
|  664 |           0 |      
 
0.416541 | sofia5                 | bourgas1               | varna3ipdate  |      0.014311 | 190565949 |         5 |
1322|           0 |     
 
0.0580676 | 2001-12-04 00:00:00+02 | 2001-12-01 00:00:00+02 | 2001-12-31 
00:00:00+02
(5 rows)

The query 

explain
SELECT sum(input) FROM iplog_gate200112 
WHERE  ipdate < '2001-12-01 00:00:00+02'  AND  network_subeq(ipaddr, '193.68.240.0/20') AND 'uni-gw' ~ router;

produces 

Aggregate  (cost=4.91..4.91 rows=1 width=8) ->  Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112  
(cost=0.00..4.91 rows=1 width=8)

Note there are no records with ipdate < '2001-12-01 00:00:00+02' in the table.

Could anyone sched some light? This is on 7.1.3.

Daniel



Re: again on index usage

From
Tom Lane
Date:
Daniel Kalchev <daniel@digsys.bg> writes:
> explain
> SELECT sum(input) FROM iplog_gate200112 
> WHERE 
>   '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND 
>   '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02' AND 
>    network_subeq(ipaddr, '193.68.240.0/20') AND 'uni-gw' ~ router;

> results in 

> NOTICE:  QUERY PLAN:

> Aggregate  (cost=51845.51..51845.51 rows=1 width=8)
>   ->  Seq Scan on iplog_gate200112  (cost=0.00..51845.04 rows=190 width=8)

> Why would it not want to use index scan?

It's difficult to tell from this what it thinks the selectivity of the
ipdate index would be, since the rows estimate includes the effect of
the ipaddr and router restrictions.  What do you get from just

explain
SELECT sum(input) FROM iplog_gate200112 
WHERE  '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND  '2001-12-01 00:00:00+02' <= ipdate
ANDipdate < '2002-01-01 00:00:00+02';
 

If you say "set enable_seqscan to off", does that change the plan?

BTW, the planner does not associate function calls with indexes.  If you
want to have the ipaddr index considered for this query, you need to write
ipaddr <<= '193.68.240.0/20' not network_subeq(ipaddr, '193.68.240.0/20').
(But IIRC, that only works in 7.2 anyway, not earlier releases :-()
        regards, tom lane


Re: again on index usage

From
Sean Chittenden
Date:
> It's difficult to tell from this what it thinks the selectivity of the
> ipdate index would be, since the rows estimate includes the effect of
> the ipaddr and router restrictions.  What do you get from just
> 
> explain
> SELECT sum(input) FROM iplog_gate200112 
> WHERE 
>   '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND 
>   '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02';
> 

I don't know if this'd help, but if you're suming the data and running
this query often, see if a function index would help:

CREATE INDEX sum_input_fnc_idx ON iplog_gate200112 (sum(input));

-sc

-- 
Sean Chittenden


Re: again on index usage

From
Daniel Kalchev
Date:
>>>Tom Lane said:> It's difficult to tell from this what it thinks the selectivity of the> ipdate index would be, since
therows estimate includes the effect of> the ipaddr and router restrictions.  What do you get from just> > explain>
SELECTsum(input) FROM iplog_gate200112 > WHERE >   '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02
00:00:00+02'A    ND >   '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02';
 

Same result (sorry, should have included this originally):


Aggregate  (cost=47721.72..47721.72 rows=1 width=8) ->  Seq Scan on iplog_gate200112  (cost=0.00..47579.54 rows=56873
width=8)

> If you say "set enable_seqscan to off", does that change the plan?

Yes. As expected (I no longer have the problem of NaN estimates :)

Aggregate  (cost=100359.71..100359.71 rows=1 width=8) ->  Index Scan using iplog_gate200112_ipdate_idx on
iplog_gate200112 
 
(cost=0.00..100217.52 rows=56873 width=8)

My belief is that the planner does not want to use index due to low value 
dispersion of the indexed attribute. When splitting the table into several 
smaller tables, index is used.

This bites me, because each such query takes at least 3 minutes and the script 
that generates these needs to execute few thousands queries.
> BTW, the planner does not associate function calls with indexes.  If you> want to have the ipaddr index considered
forthis query, you need to write> ipaddr <<= '193.68.240.0/20' not network_subeq(ipaddr, '193.68.240.0/20').> (But
IIRC,that only works in 7.2 anyway, not earlier releases :-()
 

This is what I though too, but using the ipdate index will be sufficient.

I understand my complaint is not a bug, but rather question of proper planner 
optimization (it worked 'as expected' in 7.0). Perhaps the planner should 
consider the total number of rows, as well as the dispersion factor. With the 
dispersion being around 1.5% and total rows 1.7 million this gives about 25k 
rows with the same value - large enough to trigger sequential scan, as far as 
I understand it, but the cost of scanning 1.7 million rows sequentially is 
just too high.

By the way, the same query takes approx 10 sec with set enable_seqscan to off.

Daniel



Re: again on index usage

From
Tom Lane
Date:
Daniel Kalchev <daniel@digsys.bg> writes:
> Same result (sorry, should have included this originally):

> Aggregate  (cost=47721.72..47721.72 rows=1 width=8)
>   ->  Seq Scan on iplog_gate200112  (cost=0.00..47579.54 rows=56873 width=8)

>>> If you say "set enable_seqscan to off", does that change the plan?

> Aggregate  (cost=100359.71..100359.71 rows=1 width=8)
>   ->  Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112  
> (cost=0.00..100217.52 rows=56873 width=8)

So, what we've got here is a difference of opinion: the planner thinks
that the seqscan will be faster.  How many rows are actually selected
by this WHERE clause?  How long does each plan actually take?
        regards, tom lane


Re: again on index usage

From
Daniel Kalchev
Date:
>>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > Same result (sorry, should have included this
originally):>> > Aggregate  (cost=47721.72..47721.72 rows=1 width=8)> >   ->  Seq Scan on iplog_gate200112
(cost=0.00..47579.54rows=56873 width=    8)> > >>> If you say "set enable_seqscan to off", does that change the plan?>
>> Aggregate  (cost=100359.71..100359.71 rows=1 width=8)> >   ->  Index Scan using iplog_gate200112_ipdate_idx on
iplog_gate200112 > > (cost=0.00..100217.52 rows=56873 width=8)> > So, what we've got here is a difference of opinion:
theplanner thinks> that the seqscan will be faster.  How many rows are actually selected> by this WHERE clause?  How
longdoes each plan actually take?> >             regards, tom lane
 

3-5 minutes with sequential scan; 10-15 sec with index scan. The query returns 
4062 rows. Out of ca 1700000 rows.

With only the datetime constraints (which relates to the index), the number of 
rows is 51764.

In any case, sequential scan of millions of rows cannot be faster than index 
scan. The average number of records for each index key is around 25000 - 
perhaps the planner thinks because the number of tuples in this case is 
higher, it should prefer sequential scan. I guess the planner will do better 
if there is some scaling of these values with respect to the total number of 
rows.

Daniel




Re: again on index usage

From
Hiroshi Inoue
Date:
Daniel Kalchev wrote:
> 
> >>>Tom Lane said:
>  > Daniel Kalchev <daniel@digsys.bg> writes:
>  > > Same result (sorry, should have included this originally):
>  >
>  > >>> If you say "set enable_seqscan to off", does that change the plan?
>  >
>  > > Aggregate  (cost=100359.71..100359.71 rows=1 width=8)
>  > >   ->  Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112
>  > > (cost=0.00..100217.52 rows=56873 width=8)
>  >
>  > So, what we've got here is a difference of opinion: the planner thinks
>  > that the seqscan will be faster.  How many rows are actually selected
>  > by this WHERE clause?  How long does each plan actually take?
>  >
>  >                      regards, tom lane
> 
> 3-5 minutes with sequential scan; 10-15 sec with index scan. The query returns
> 4062 rows. Out of ca 1700000 rows.
> 
> With only the datetime constraints (which relates to the index), the number of
> rows is 51764.

The planner estimates 56873 rows. It seems not that bad.
A plausible reason is your table is nearly clustered 
as to ipdate.

regards,
Hiroshi Inoue


Re: again on index usage

From
Tom Lane
Date:
Daniel Kalchev <daniel@digsys.bg> writes:
> Aggregate  (cost=47721.72..47721.72 rows=1 width=8)
> ->  Seq Scan on iplog_gate200112  (cost=0.00..47579.54 rows=56873 width=
>      8)
>>> 
> If you say "set enable_seqscan to off", does that change the plan?
>>> 
> Aggregate  (cost=100359.71..100359.71 rows=1 width=8)
> ->  Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112  
> (cost=0.00..100217.52 rows=56873 width=8)
>>> 
>>> So, what we've got here is a difference of opinion: the planner thinks
>>> that the seqscan will be faster.  How many rows are actually selected
>>> by this WHERE clause?  How long does each plan actually take?

> 3-5 minutes with sequential scan; 10-15 sec with index scan. The query returns 
> 4062 rows. Out of ca 1700000 rows.

> With only the datetime constraints (which relates to the index), the number of 
> rows is 51764.

Hm.  Okay, so the number-of-rows estimate is not too far off.  I concur
with Hiroshi's comment: the reason the indexscan is so fast must be that
the table is clustered (physical order largely agrees with index order).
This would obviously hold if the records were entered in order by
ipdate; is that true?

The 7.2 planner does try to estimate index order correlation, and would
be likely to price this indexscan much lower, so that it would make the
right choice.  I'd suggest updating to 7.2 as soon as we have RC1 out.
(Don't do it yet, though, since we've got some timestamp bugs to fix
that are probably going to require another initdb.)

> In any case, sequential scan of millions of rows cannot be faster than index 
> scan.

Snort.  If that were true, we could get rid of most of the complexity
in the planner.
        regards, tom lane


Re: again on index usage

From
Daniel Kalchev
Date:
>>>Tom Lane said:> Hm.  Okay, so the number-of-rows estimate is not too far off.  I concur> with Hiroshi's comment: the
reasonthe indexscan is so fast must be that> the table is clustered (physical order largely agrees with index order).>
Thiswould obviously hold if the records were entered in order by> ipdate; is that true?
 

Yes. But... do you want me to cluster it by ipaddr for example and try it 
again? I understand the clustering might help with sequential scans, but why 
would it help with index scans?

Daniel



Re: again on index usage

From
Tom Lane
Date:
Daniel Kalchev <daniel@digsys.bg> writes:
> I understand the clustering might help with sequential scans, but why 
> would it help with index scans?

No, the other way around: it makes no difference for seq scans, but can
speed up index scans quite a lot.  With a clustered table, successive
index-driven fetches tend to hit the same pages rather than hitting
random pages throughout the table.  That saves I/O.

Given the numbers you were quoting, if the table were in perfectly
random order by ipdate then there would probably have been about three
rows per page that the indexscan would've had to fetch.  This would mean
touching each page three times in some random order.  Unless the table
is small enough to fit in Postgres' shared buffer cache, that's going to
represent a lot of extra I/O --- a lot more than reading each page only
once, as a seqscan would do.  At the other extreme, if the table is
perfectly ordered by ipdate then the indexscan need only hit a small
number of pages (all the rows we want are in a narrow range) and we
touch each page many times before moving on to the next.  Very few I/O
requests in that case.

7.1 does not have any statistics about table order, so it uses the
conservative assumption that the ordering is random.  7.2 has more
statistical data and perhaps will make better estimates about the
cost of indexscans.
        regards, tom lane


Re: again on index usage

From
Daniel Kalchev
Date:
>>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > I understand the clustering might help with sequential
scans,but why > > would it help with index scans?> > No, the other way around: it makes no difference for seq scans,
butcan> speed up index scans quite a lot.  With a clustered table, successive> index-driven fetches tend to hit the
samepages rather than hitting> random pages throughout the table.  That saves I/O.
 

Ok, time to go home :-), but...
> Given the numbers you were quoting, if the table were in perfectly> random order by ipdate then there would probably
havebeen about three> rows per page that the indexscan would've had to fetch.  This would mean> touching each page
threetimes in some random order.  Unless the table> is small enough to fit in Postgres' shared buffer cache, that's
goingto> represent a lot of extra I/O --- a lot more than reading each page only> once, as a seqscan would do.  At the
otherextreme, if the table is> perfectly ordered by ipdate then the indexscan need only hit a small> number of pages
(allthe rows we want are in a narrow range) and we> touch each page many times before moving on to the next.  Very few
I/O>requests in that case.
 

In any case, if we need to hit 50k pages (assuming the indexed data is 
randomly scattered in the file), and having to read these three times each, it 
will be less I/O than having to read 1.7 million records. The table will never 
be laid sequentially on the disk, at least not in this case (which adds to the 
table day after day - and this is why data is almost ordered by ipdate).

What I am arguing about is the scaling - is 50k random reads worse than 1.7 
million sequential reads? Eventually considering the tuple size, disk block 
size etc.

I will wait patiently for 4.2 to release and see how this same table performs. 
:-)

Daniel



Re: again on index usage

From
Tom Lane
Date:
Daniel Kalchev <daniel@digsys.bg> writes:
> In any case, if we need to hit 50k pages (assuming the indexed data is 
> randomly scattered in the file), and having to read these three times each, it 
> will be less I/O than having to read 1.7 million records.

How do you arrive at that?  Assuming 100 records per page (probably the
right order of magnitude), the seqscan alternative is 17k page reads.
Yes, you examine more tuples, but CPUs are lots faster than disks.

That doesn't even address the fact that Unix systems reward sequential
reads and penalize random access.  In a seqscan, we can expect that the
kernel will schedule the next page read before we ask for it, so that
our CPU time to examine a page is overlapped with I/O for the next page.
In an indexscan that advantage goes away, because neither we nor the
kernel know which page will be touched next.  On top of the loss of
read-ahead, the filesystem is probably laid out in a way that rewards
sequential access with fewer and shorter seeks.

The tests I've made suggest that the penalty involved is about a factor
of four -- ie, a seqscan can scan four pages in the same amount of time
that it takes to bring in one randomly-accessed page.
        regards, tom lane


Re: again on index usage

From
Daniel Kalchev
Date:
>>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > In any case, if we need to hit 50k pages (assuming the
indexeddata is > > randomly scattered in the file), and having to read these three times each    , it > > will be less
I/Othan having to read 1.7 million records.> > How do you arrive at that?  Assuming 100 records per page (probably the>
rightorder of magnitude), the seqscan alternative is 17k page reads.> Yes, you examine more tuples, but CPUs are lots
fasterthan disks.
 

I tried this:

db=# select * into iplog_test from iplog_gate200112;
SELECT
db=# create index iplog_test_ipaddr_idx on iplog_test(ipaddr);
CREATE
db=# cluster iplog_test_ipaddr_idx on iplog_test;
CLUSTER
db=# create index iplog_test_ipdate_idx on iplog_test(ipdate);
CREATE
db=# vacuum verbose analyze iplog_test;
NOTICE:  --Relation iplog_test--
NOTICE:  Pages 17722: Changed 0, reaped 0, Empty 0, New 0; Tup 1706202: Vac 0, 
Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 80, MaxLen 88; Re-using: Free/Avail. 
Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 1.48s/-1.86u sec.
NOTICE:  Index iplog_test_ipaddr_idx: Pages 5621; Tuples 1706202. CPU 
0.51s/1.80u sec.
NOTICE:  Index iplog_test_ipdate_idx: Pages 4681; Tuples 1706202. CPU 
0.36s/1.92u sec.
NOTICE:  --Relation pg_toast_253297758--
NOTICE:  Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0, Keep/VTL 
0/0, Crash 0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space 0/0; 
EndEmpty/Avail. Pages 0/0. CPU 0.00s/0.00u sec.
NOTICE:  Index pg_toast_253297758_idx: Pages 1; Tuples 0. CPU 0.00s/0.00u sec.
NOTICE:  Analyzing...
VACUUM
db=# explain
db-# SELECT sum(input), sum(output) FROM iplog_test 
db-# WHERE 
db-# '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' 
AND
db-# '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02' 
AND
db-# ipaddr <<= '193.68.240.0/20' AND 'uni-gw' ~ router;
NOTICE:  QUERY PLAN:

Aggregate  (cost=56112.97..56112.97 rows=1 width=16) ->  Seq Scan on iplog_test  (cost=0.00..56111.54 rows=284
width=16)

EXPLAIN

query runs for ca 3.5 minutes.

db=# set enable_seqscan to off;

the query plan is

Aggregate  (cost=100507.36..100507.36 rows=1 width=16) ->  Index Scan using iplog_test_ipdate_idx on iplog_test  
(cost=0.00..100505.94 rows=284 width=16)

query runs for ca 2.2 minutes.

Moves closer to your point :-)

Anyway, the platform is an dual Pentium III 550 MHz (intel) machine with 512 
MB RAM, with 15000 RPM Cheetah for the database, running BSD/OS 4.2. The 
machine is reasonably loaded all the time, so this is very much real-time test.

I agree, that with the 'wrong' clustering the index scan is not so much faster 
than the sequential scan.

Perhaps I need to tune this machine's costs to prefer more disk intensive 
operations over CPU intensive operations?

Let's see what 4.2 will result in.

Daniel



Re: again on index usage

From
"Zeugswetter Andreas SB SD"
Date:
Daniel wrote: (stripped to info I used)
> NOTICE:  Pages 17722: Changed 0, reaped 0, Empty 0, New 0; 
> Tup 1706202: Vac 0, 
> NOTICE:  Index iplog_test_ipaddr_idx: Pages 5621; Tuples 1706202. CPU 
> NOTICE:  Index iplog_test_ipdate_idx: Pages 4681; Tuples 1706202. CPU 

>   ->  Seq Scan on iplog_test  (cost=0.00..56111.54 rows=284 width=16)
> query runs for ca 3.5 minutes.

>   ->  Index Scan using iplog_test_ipdate_idx on iplog_test  
> (cost=0.00..100505.94 rows=284 width=16)
> query runs for ca 2.2 minutes.

I cannot really see how 284 rows can have an estimated index cost of 100506 ?

> 512 MB RAM, with 15000 RPM Cheetah for the database, running

> Perhaps I need to tune this machine's costs to prefer more 
> disk intensive operations over CPU intensive operations?

What is actually estimated wrong here seems to be the estimated
effective cache size, and thus the cache ratio of page fetches.
Most of your pages will be cached.

The tuning parameter is: effective_cache_size

With (an estimated) 50 % of 512 Mb for file caching that number would 
need to be:
effective_cache_size = 32768 # 8k pages

Can you try this and tell us what happens ?

Andreas


Re: again on index usage

From
Daniel Kalchev
Date:
>>>"Zeugswetter Andreas SB SD" said:> What is actually estimated wrong here seems to be the estimated> effective cache
size,and thus the cache ratio of page fetches.> Most of your pages will be cached.> > The tuning parameter is:
effective_cache_size>> With (an estimated) 50 % of 512 Mb for file caching that number would > need to be:>
effective_cache_size= 32768 # 8k pages> > Can you try this and tell us what happens ?
 

I suspected this, but haven't really come to test it. On BSD/OS, the buffer 
cache is 10% of the RAM, in my case

buffer cache = 53522432 (51.04 MB)

I guess effective_cache_size = 6400 will be ok.

Daniel



Re: again on index usage

From
Daniel Kalchev
Date:
[with the new effective_cache_size = 6400]

explain
SELECT sum(input), sum(output) FROM iplog_gate200112
WHERE 
'2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND 
'2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02' AND 
ipaddr <<= '193.68.240.0/20' AND 'uni-gw' ~ router;

gives

Aggregate  (cost=56111.97..56111.97 rows=1 width=16) ->  Seq Scan on iplog_gate200112  (cost=0.00..56110.54 rows=284
width=16)

takes 3 min to execute. (was 10 sec after fresh restart)

db=# set enable_seqscan to off;

Aggregate  (cost=84980.10..84980.10 rows=1 width=16) ->  Index Scan using iplog_gate200112_ipdate_idx on
iplog_gate200112 
 
(cost=0.00..84978.68 rows=284 width=16)

takes 1.8 min to execute. (was 2 sec after fresh reshart)

Still proves my point, But the fresh restart performance is impressive. After 
few minutes the database takes its normal load and in my opinion the buffer 
cache is too much cluttered with pages from other tables.

Which brings another question: with so much RAM recent equipment runs, it may 
be good idea to specifically add to INSTALL instruction on tuning the system 
as soon as it is installed. Most people will stop there, especially after an 
upgrade (as I did).

Daniel



Re: again on index usage

From
"Zeugswetter Andreas SB SD"
Date:
> [with the new effective_cache_size = 6400]

This seems way too low for a 512 Mb machine. Why does your OS
only use so little for filecache ? Is the rest used for processes ?
For the above number you need to consider OS cache and shared_buffers.
You can approximatly add them together minus a few %.

With the data you gave, a calculated value for effective_cache_size
would be 29370, assuming the random_page_cost is actually 4 on your
machine. 29370 might be a slight overestimate, since your new table
will probably still be somewhat sorted by date within one IP.

Try to measure IO/s during the seq scan and during the index path
and calculate the ratio. This should be done during an average workload
on the machine.

Andreas


Re: again on index usage

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> I cannot really see how 284 rows can have an estimated index cost of 100506 ?

The estimated number of indexscanned rows is more like 50k.  The number
you are looking at includes the estimated selectivity of the
non-indexable WHERE clauses, too.

> What is actually estimated wrong here seems to be the estimated
> effective cache size, and thus the cache ratio of page fetches.

Good point, but I think the estimates are only marginally sensitive
to estimated cache size (if they're not, we have a problem, considering
how poorly we can estimate the kernel's disk buffer size).  It would
be interesting for Daniel to try a few different settings of
effective_cache_size and see how much the EXPLAIN costs change.
        regards, tom lane


Re: again on index usage

From
Tom Lane
Date:
Daniel Kalchev <daniel@digsys.bg> writes:
> I agree, that with the 'wrong' clustering the index scan is not so
> much faster than the sequential scan.

It would be interesting to check whether there is any correlation
between ipaddr and ipdate in your test data.  Perhaps clustering
on ipaddr might not destroy the ordering on ipdate as much as you
thought.  A more clearly random-order test would go:

select * into iplog_test from iplog_gate200112 order by random();
create index iplog_test_ipdate_idx on iplog_test(ipdate);
vacuum verbose analyze iplog_test;
<< run queries >>

> Perhaps I need to tune this machine's costs to prefer more disk intensive 
> operations over CPU intensive operations?

Possibly.  I'm not sure there's much point in tuning the cost estimates
until the underlying model is more nearly right (ie, knows something
about correlation).  Do you care to try your dataset with 7.2 beta?
        regards, tom lane


Re: again on index usage

From
"Zeugswetter Andreas SB SD"
Date:
> > What is actually estimated wrong here seems to be the estimated
> > effective cache size, and thus the cache ratio of page fetches.
> 
> Good point, but I think the estimates are only marginally sensitive
> to estimated cache size (if they're not, we have a problem, considering
> how poorly we can estimate the kernel's disk buffer size).  It would
> be interesting for Daniel to try a few different settings of
> effective_cache_size and see how much the EXPLAIN costs change.

Well, the number I told him (29370) should clearly prefer the index.
The estimate is very sensitive to this value :-(
With 29370 (=229 Mb) the index cost is 1,364 instead of 3,887 with the 
default of 1000 pages ==> index scan.

229 Mb file cache with 512Mb Ram is a reasonable value, I have
a lot more here:
Memory    Real     Virtual
free        0 MB    218 MB
procs      95 MB    293 MB
files     159 MB
total     256 MB    512 MB

Andreas


Re: again on index usage

From
Daniel Kalchev
Date:
>>>"Zeugswetter Andreas SB SD" said:> > > [with the new effective_cache_size = 6400]> > This seems way too low for a
512Mb machine. Why does your OS> only use so little for filecache ? Is the rest used for processes ?> For the above
numberyou need to consider OS cache and shared_buffers.> You can approximatly add them together minus a few %.
 

As far as I am aware, 10% for buffer space is the default for BSD operating 
systems... although I have seen buffer space = 50% on MacOS X. There is no 
problem to increase the buffer space in kernel, although I am not very 
confident this will give much better overall performance (well, more memory 
can be added as well).
> With the data you gave, a calculated value for effective_cache_size> would be 29370, assuming the random_page_cost is
actually4 on your> machine. 29370 might be a slight overestimate, since your new table> will probably still be somewhat
sortedby date within one IP.
 

random_page_cost is 4.

If the select into then cluster do this, then yes, it is possible, but not 
guaranteed.

I will try with increased effective_cache_size.

Postmaster is started with -N 128 -B 256 -i -o "-e -S 10240" 

Daniel



Re: again on index usage

From
Daniel Kalchev
Date:
>>>"Zeugswetter Andreas SB SD" said:> > > > What is actually estimated wrong here seems to be the estimated> > >
effectivecache size, and thus the cache ratio of page fetches.> > > > Good point, but I think the estimates are only
marginallysensitive> > to estimated cache size (if they're not, we have a problem, considering> > how poorly we can
estimatethe kernel's disk buffer size).  It would> > be interesting for Daniel to try a few different settings of> >
effective_cache_sizeand see how much the EXPLAIN costs change.> > Well, the number I told him (29370) should clearly
preferthe index.> The estimate is very sensitive to this value :-(> With 29370 (=229 Mb) the index cost is 1,364
insteadof 3,887 with the > default of 1000 pages ==> index scan.
 

But... if I understand it right (effective_cache_size to be related to kernel 
buffer space). it turns out that the estimates are different with reality - my 
buffer cache is ca. 50 MB and I still get at least twice the performance with 
index scan instead of sequential scan - where as Tom explained things should 
be much worse.

I considered the possibility that the clustered table can still maintain some 
ordering by ipdate after being clustered by ipaddr - but with over 65k ip 
addresses, almost evenly spread, this should be not so significant.

Best Regards,
Daniel



Re: again on index usage

From
"Zeugswetter Andreas SB SD"
Date:
>  > > > What is actually estimated wrong here seems to be the estimated
>  > > > effective cache size, and thus the cache ratio of page fetches.
>  > > 
>  > > Good point, but I think the estimates are only marginally sensitive
>  > > to estimated cache size (if they're not, we have a problem, considering
>  > > how poorly we can estimate the kernel's disk buffer size).  It would
>  > > be interesting for Daniel to try a few different settings of
>  > > effective_cache_size and see how much the EXPLAIN costs change.
>  > 
>  > Well, the number I told him (29370) should clearly prefer the index.
>  > The estimate is very sensitive to this value :-(
>  > With 29370 (=229 Mb) the index cost is 1,364 instead of 3,887 with the 
>  > default of 1000 pages ==> index scan.
> 
> But... if I understand it right (effective_cache_size to be related to kernel 
> buffer space). it turns out that the estimates are different with reality - my 
> buffer cache is ca. 50 MB and I still get at least twice the performance with 
> index scan instead of sequential scan - where as Tom explained things should 
> be much worse.

Since pg only reads one 8k page at a time, the IO performance of a seq scan is
probably not nearly a good as it could be when a lot of other IO is done on the
same drive.

First thing you should verify is if there is actually a measurable difference
in IO throughput on the pg drive during the seq scan and the index scan. (iostat)
If there is not, then random_page_cost is too high in your scenario.
(All assuming your data is not still clustered like Tom suggested)

Andreas


Re: again on index usage

From
Bruce Momjian
Date:
This topic seems to come up a lot.  Is there something we are missing in
the FAQ?

---------------------------------------------------------------------------

Zeugswetter Andreas SB SD wrote:
> 
> >  > > > What is actually estimated wrong here seems to be the estimated
> >  > > > effective cache size, and thus the cache ratio of page fetches.
> >  > > 
> >  > > Good point, but I think the estimates are only marginally sensitive
> >  > > to estimated cache size (if they're not, we have a problem, considering
> >  > > how poorly we can estimate the kernel's disk buffer size).  It would
> >  > > be interesting for Daniel to try a few different settings of
> >  > > effective_cache_size and see how much the EXPLAIN costs change.
> >  > 
> >  > Well, the number I told him (29370) should clearly prefer the index.
> >  > The estimate is very sensitive to this value :-(
> >  > With 29370 (=229 Mb) the index cost is 1,364 instead of 3,887 with the 
> >  > default of 1000 pages ==> index scan.
> > 
> > But... if I understand it right (effective_cache_size to be related to kernel 
> > buffer space). it turns out that the estimates are different with reality - my 
> > buffer cache is ca. 50 MB and I still get at least twice the performance with 
> > index scan instead of sequential scan - where as Tom explained things should 
> > be much worse.
> 
> Since pg only reads one 8k page at a time, the IO performance of a seq scan is
> probably not nearly a good as it could be when a lot of other IO is done on the
> same drive.
> 
> First thing you should verify is if there is actually a measurable difference
> in IO throughput on the pg drive during the seq scan and the index scan. (iostat)
> If there is not, then random_page_cost is too high in your scenario.
> (All assuming your data is not still clustered like Tom suggested)
> 
> Andreas
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: again on index usage

From
Daniel Kalchev
Date:
>>>"Zeugswetter Andreas SB SD" said:> First thing you should verify is if there is actually a measurable differenc
e>in IO throughput on the pg drive during the seq scan and the index scan. (io    stat)> If there is not, then
random_page_costis too high in your scenario.> (All assuming your data is not still clustered like Tom suggested)
 


At this idle time (got to have other emergency at 5am in the office :) here is 
what I have (sd0 is the 'system' drive, sd1 is where postgres data lives):
tin tout   sps  tps msps   sps  tps msps  usr nic sys int idl  0   39   831   12  2.0  8962  121  3.6    4  26   7   0
63 0   13   215    4  7.7  9917  122  3.7    5  24   5   0  66  0   13   216    3  6.1  7116  115  4.1    5  23   4   0
68  0   13   220    3  5.0  9401  128  5.0    5  17   4   0  74  0   13   226    3 12.2  9232  122  3.8    4  24   4
0 67  0   13   536   26  8.5  11353  147  4.4   13  16   9   0  62  0   13   259    5  5.8  12102  165  4.1    8  14
8  0  70  0   13   492   20  7.2  13913  186  4.5    8   9   6   0  76  0   13   185    2  4.7  11423  184  5.0   14
6  8   0  72
 

running index scan:
  0   13   274    8  4.9  5786  145  4.4   18  10   8   0  64  0   13   210    3  8.1  5707  153  3.9   20   9   6   0
64 0   13   286    8  7.7  6283  139  4.3   21   9   8   0  62  0   13   212    3  9.7  5900  133  3.3   22  13   7   0
58  0   13   222    3  6.0  5811  148  3.5   20  12   6   0  61  0   13   350   16  7.5  5640  134  4.1   22  12   7
0 58
 

(seems to be slowing down other I/O :)

running seq scan:
  0   13    50    4  1.9  4787  101  3.8   24  12   7   0  57  0   13    34    3  5.6  5533  105  3.4   24  12   6   0
58 0   13    42    4  3.1  5414  103  3.0   25  12   6   0  58  0   13    26    2  0.0  5542  102  3.9   28  12   6   0
54  0   13    52    5  2.8  5644  112  4.1   24  11   7   0  58  0   13    27    2  4.1  6462  122  4.0   26   8   7
0 60  0   13    36    3  2.0  5616  128  4.2   22   8   7   0  63
 

I can't seem to find any difference... Perhaps this is because the 
'sequential' data is anyway scattered all around the disk.

I have done this test first, now I will try the random() clustering Tom 
suggested (although... isn't random not so random to trust it in this 
scenario? :)

Daniel



Re: again on index usage

From
Daniel Kalchev
Date:
>>>Tom Lane said:> It would be interesting to check whether there is any correlation> between ipaddr and ipdate in your
testdata.  Perhaps clustering> on ipaddr might not destroy the ordering on ipdate as much as you> thought.  A more
clearlyrandom-order test would go:> > select * into iplog_test from iplog_gate200112 order by random();> create index
iplog_test_ipdate_idxon iplog_test(ipdate);> vacuum verbose analyze iplog_test;> << run queries >>
 

NOTICE:  --Relation iplog_test--
NOTICE:  Pages 17761: Changed 17761, reaped 0, Empty 0, New 0; Tup 1706202: 
Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 80, MaxLen 88; Re-using: 
Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 2.36s/0.32u sec.
NOTICE:  Index iplog_test_ipdate_idx: Pages 4681; Tuples 1706202. CPU 
0.26s/1.98u sec.
NOTICE:  --Relation pg_toast_275335644--
NOTICE:  Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0, Keep/VTL 
0/0, Crash 0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space 0/0; 
EndEmpty/Avail. Pages 0/0. CPU 0.00s/0.00u sec.
NOTICE:  Index pg_toast_275335644_idx: Pages 1; Tuples 0. CPU 0.00s/0.00u sec.
NOTICE:  Analyzing...


explain (with enable_seqscan to on)

Aggregate  (cost=56151.97..56151.97 rows=1 width=16) ->  Seq Scan on iplog_test  (cost=0.00..56150.54 rows=284
width=16)

average run time  4 minutes

(iostat before seq run)
tin tout   sps  tps msps   sps  tps msps  usr nic sys int idl  5   44    94    6  5.9  6842  120  3.3   25  11   6   0
58 1   14    27    1  5.9  5968  138  3.6   22  11   5   0  62  5   44    58    2  6.3  5647  117  3.0   27   9   6   0
58  1   13     7    1  5.0  5723  125  3.4   24  10   5   0  61  0   15    50    2  4.5  5606  110  3.1   27  10   5
0 58  5   44    90    5  9.1  4702   87  2.5   28  12   4   0  55  1   15    52    1  6.3  5045  114  4.1   24  10   4
0  61
 

(iostat during seq run)
tin tout   sps  tps msps   sps  tps msps  usr nic sys int idl  1   40    41    2  2.1  5847  123  3.6   25  11   4   0
60 1   16   164   13  6.2  7280  128  3.2   28   8   8   0  57  0   13    36    1  7.9  6059  147  3.9   25   8   5   0
62  0   13    48    3  5.3  6691  126  3.4   26   8   7   0  59  0   13    28    3  4.6  6473  103  2.3   28  11   7
0 54  0   13   138   11  7.6  10848  151  4.9   19   6   6   0  69  0   13    33    3  3.3  5568  100  3.6   21  16   3
 0  59  0   13    24    2  1.1  6752  144  3.4   22  12   2   0  64
 

(sometime at the end of query run)
tin tout   sps  tps msps   sps  tps msps  usr nic sys int idl  0   38    20    2  5.5  5621   57  1.2   23  23   3   0
51 0   13    89    7  7.7  8854  101  3.4   21  18   4   0  57  0   13    72    6  7.3  9929   88  2.2   18  21   4   0
57  0   13   129    6  9.6  4865   43  1.0   21  24   4   0  51  0   13    72    3  4.2  5421   46  0.5   24  22   4
0 50  0   13    52    2  3.5  5877   64  1.8   22  21   4   0  53  0   13    50    3  6.1  5561   54  1.7   19  26   3
0  52  0   13    42    1  3.8  5455   76  1.4   22  22   3   0  53
 

(see lower msps - perhaps other queries slowed down? - but then again returned 
to 'normal')
  0   13   244   20  6.6  6629  199  4.1   19   9   5   0  67  0   13    68    4  6.2  6080  191  4.3   14  14   3   0
70 0   13    75    3  5.9  6542  214  4.1   19   8   4   0  70  0   13   615   18  5.0  5454  129  4.1   20  18   3   0
59  0   13    88    2  5.7  3718   48  2.5   21  21   4   0  54  0   13    46    3  3.1  4533   75  2.9   20  19   5
0 56  0   13   143    7  5.1  4349   58  2.7   22  18   4   0  55  0   13    58    2  9.9  4038   45  2.0   20  24   4
0  52  0   13   111    4  5.8  4523   60  3.4   18  22   4   0  56
 

(with enable_seqscan to off)

Aggregate  (cost=85110.08..85110.08 rows=1 width=16) ->  Index Scan using iplog_test_ipdate_idx on iplog_test  
(cost=0.00..85108.66 rows=284 width=16)

average run time 2 minutes (worst 5 minutes, but this happened only once)

(iostat output)
tin tout   sps  tps msps   sps  tps msps  usr nic sys int idl  0   38     1    1  6.7  5110  224  3.2    7  16   3   0
73 0   13    21    2  3.6  5249  219  4.1   12  13   2   0  73  0   13     0    0 10.0  5341  216  3.9   12  11   4   0
73  0   13     6    0  9.9  5049  218  3.5   10  14   3   0  72  0   13     6    0  0.0  7654  216  3.8   10  10   2
0 78  0   13     2    1  4.0  8758  222  4.1    6  11   4   0  80  0   13     6    0  9.9  8594  219  4.4    6  10   3
0  81  0   13     4    1  0.0  7290  210  4.3    6  10   4   0  80  0   13    36    3  4.9  5912  196  4.7    9  10   4
 0  76  0   13     7    1 13.3  4787  209  3.7    9  17   2   0  72  0   13     0    0 10.0  4691  209  3.8    9  17
2  0  72
 

This sort of proves that Tom is right about the need for random data. <grin>

Having the change to play with it at such idle time in the morning, when 
nobody is supposed to be working with the databae I dared to shut down all 
other sessions

(idle iostat)

iostat 5     tty             sd0             sd1                % cputin tout   sps  tps msps   sps  tps msps  usr nic
sysint idl  0  559   148    2  7.7     8    1  3.3    0   0   0   0  99  0  559   235    4  4.1    30    2  2.0    0
0  1   0  98  0  559   189    2  8.7     3    0 10.0    0   0   1   0  99
 


seq scan (10 sec)
  0  575  12103  101  9.2  18798  164  1.4   19   0   8   0  73  0  575  10118   82 11.2  30507  243  0.9   33   0  11
0  55  0  559  12704  101  9.3  7642   61  0.7    9   0   8   0  83
 


index scan 45 sec
  0   38     8    1  4.0  6179  375  2.5    1   0   2   0  98  0   13     4    1  2.5  6208  378  2.4    1   0   2   0
97 0   13     7    1  3.3  6323  382  2.4    1   0   2   0  97  0   13     9    2  0.0  6310  389  2.4    1   0   2   0
97  0   13     0    0  0.0  3648  226  2.3    0   0   1   0  98
 

This proves to me 15000 RPM Cheetah drives are damn fast at sequential reads.
Why index scan wins at 'normal' database load??? I would expect that if the 
disk subsystem is overloaded, it will be more overloaded seek-wise, than 
troughput-wise. Perhaps this penalizes the 'expected to be faster' sequential 
reads much more than the random page index reads.

I will try 4.2 with the same data as soon as available and see the differences 
- on a separate machine with about the same configuration (same OS,same 
postgres install, same disks, only one 733 MHz CPU instead of two 550 MHz 
CPUs).

Daniel



Re: again on index usage

From
"Zeugswetter Andreas SB SD"
Date:
> This topic seems to come up a lot.  Is there something we are missing in
> the FAQ?

Most of the reports we seem to see are the usual, "but the seq scan is actually 
faster" case. Daniel actually has a case where the optimizer chooses a bad plan.

The difficulty seems to be, that the optimizer cooses a correct plan for an idle
system, but with his workload the index path would be far better (2 vs 4 Minutes).

This is one of the main problems of the current optimizer which imho rather 
aggressively chooses seq scans over index scans. During high load this does 
not pay off. My preference would actually be a way to make the optimizer
choose a plan that causes minimal workload, and not shortest runtime 
(which will obviously only be fast with low overall workload)
The reasoning behind this is, that during low workload your response times
will be good enough with a "bad" plan, but during high workload your response 
times will be best with a plan that produces the least additional workload.

Andreas


Re: again on index usage

From
Bruce Momjian
Date:
Don Baccus wrote:
> Zeugswetter Andreas SB SD wrote:
> 
> 
> > This is one of the main problems of the current optimizer which imho rather 
> > aggressively chooses seq scans over index scans. During high load this does 
> > not pay off.
> 
> 
> Bingo ... dragging huge tables through the buffer cache via a sequential 
> scan guarantees that a) the next query sequentially scanning the same 
> table will have to read every block again (if the table's longer than 
> available PG and OS cache) b) on a high-concurrency system other queries 
> end up doing extra I/O, too.
> 
> Oracle partially mitigates the second effect by refusing to trash its 
> entire buffer cache on any given sequential scan.  Or so I've been told 
> by people who know Oracle well.  A repeat of the sequential scan will 
> still have to reread the entire table but that's true anyway if the 
> table's at least one block longer than available cache.

That is on our TODO list, at least.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: again on index usage

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> My preference would actually be a way to make the optimizer
> choose a plan that causes minimal workload, and not shortest runtime 

?? I am not sure that I see the difference.

What I think you are saying is that when there's lots of competing work,
seqscans have less advantage over indexscans because the
sequential-access locality advantage is lost when the disk drive has to
go off and service some other request.  If that's the mechanism, I think
that the appropriate answer is just to reduce random_page_cost.  It's
true that the current default of 4.0 was chosen using measurements on
otherwise-unloaded systems.  If you assume that the system is (a) too
busy to do read-aheads for you, and (b) has to move the disk arm to
service other requests between each of your requests, then it's not
clear that sequential reads have any performance advantage at all :-(.
I don't think I'd go as far as to lower random_page_cost to 1.0, but
certainly there's a case for using an intermediate value.
        regards, tom lane


Re: again on index usage

From
Don Baccus
Date:
Zeugswetter Andreas SB SD wrote:


> This is one of the main problems of the current optimizer which imho rather 
> aggressively chooses seq scans over index scans. During high load this does 
> not pay off.


Bingo ... dragging huge tables through the buffer cache via a sequential 
scan guarantees that a) the next query sequentially scanning the same 
table will have to read every block again (if the table's longer than 
available PG and OS cache) b) on a high-concurrency system other queries 
end up doing extra I/O, too.

Oracle partially mitigates the second effect by refusing to trash its 
entire buffer cache on any given sequential scan.  Or so I've been told 
by people who know Oracle well.  A repeat of the sequential scan will 
still have to reread the entire table but that's true anyway if the 
table's at least one block longer than available cache.

Of course, Oracle picks sequential scans in horribly and obviously wrong 
cases as well.  On one project over the summer I had a query Oracle 
refused to use an available index on until I told it to do so explictly, 
and when I did it sped up by a factor of about 100.

All optimizers will fail miserably for certain queries and datasets.

-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org



Re: again on index usage

From
Don Baccus
Date:
Bruce Momjian wrote:


>>Oracle partially mitigates the second effect by refusing to trash its 
>>entire buffer cache on any given sequential scan.  Or so I've been told 
>>by people who know Oracle well.  A repeat of the sequential scan will 
>>still have to reread the entire table but that's true anyway if the 
>>table's at least one block longer than available cache.
>>
> 
> That is on our TODO list, at least.


I didn't realize this, it's good news.  (I don't follow PG development 
closely these days).

BTW overall I think the cost-estimating portion of the PG optimizer does 
about as well as Oracle's.   Oracle is a lot smarter about doing 
transformations of certain types of queries (turning "scalar in (select 
...)" into something akin to an "exists") but of course this has nothing 
to do with estimating the cost of index vs. sequential scans.


-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org



Re: again on index usage

From
Daniel Kalchev
Date:
>>>Tom Lane said:> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:> > My preference would actually be a
wayto make the optimizer> > choose a plan that causes minimal workload, and not shortest runtime > > ?? I am not sure
thatI see the difference.
 

There can be difference only if the optimizer takes into account already 
executing plans (by other backends).
> What I think you are saying is that when there's lots of competing work,> seqscans have less advantage over
indexscansbecause the> sequential-access locality advantage is lost when the disk drive has to> go off and service some
otherrequest.
 

This is exactly my point. The primary goal of the optimizer in my opinion 
should be to avoid trashing. :-) Now, it is not easy to figure out when the 
system starts trashing - at least not a portable way I can think of 
immediately.
> I don't think I'd go as far as to lower random_page_cost to 1.0, but> certainly there's a case for using an
intermediatevalue.
 

The question is: how does one find the proper value? That is, is it possible to design planner benchmarking utility to
aidin tuning Postgres? One that does not execute single query and optimize on idle system.
 

Daniel



Re: again on index usage

From
"Ross J. Reedstrom"
Date:
On Fri, Jan 11, 2002 at 11:42:43AM -0500, Bruce Momjian wrote:
> Don Baccus wrote:
> > Zeugswetter Andreas SB SD wrote:
> > 
> > 
> > > This is one of the main problems of the current optimizer which imho rather 
> > > aggressively chooses seq scans over index scans. During high load this does 
> > > not pay off.
> > 
> > 
> > Bingo ... dragging huge tables through the buffer cache via a sequential 
> > scan guarantees that a) the next query sequentially scanning the same 
> > table will have to read every block again (if the table's longer than 
> > available PG and OS cache) b) on a high-concurrency system other queries 
> > end up doing extra I/O, too.
> > 
> > Oracle partially mitigates the second effect by refusing to trash its 
> > entire buffer cache on any given sequential scan.  Or so I've been told 
> > by people who know Oracle well.  A repeat of the sequential scan will 
> > still have to reread the entire table but that's true anyway if the 
> > table's at least one block longer than available cache.
> 
> That is on our TODO list, at least.
> 

Hmm, on Linux this sort of behavior (skip the pg buffers for sequential
scans) would have interesting load senstive behavior: since Linux uses
all not-otherwise in use RAM as buffer cache, if you've got a low-load
system, even very large tables will be cached. Once other processes start
competing for RAM, your buffers go away. Bruce, what does xBSD do?

I like it, since anything that needs to be sensitive to system wide
information, like the total load on the machine, should be handled by
the system, not a particular app.

Ross


Re: again on index usage

From
Tom Lane
Date:
Daniel Kalchev <daniel@digsys.bg> writes:
>>> I don't think I'd go as far as to lower random_page_cost to 1.0, but
>>> certainly there's a case for using an intermediate value.

> The question is: how does one find the proper value? That is, is it
> possible to design planner benchmarking utility to aid in tuning
> Postgres?

The trouble is that getting trustworthy numbers requires huge test
cases, because you have to swamp out the effects of the kernel's own
buffer caching.  I spent about a week running 24-hour-constant-disk-
banging experiments when I came up with the 4.0 number we use now,
and even then I didn't feel that I had a really solid range of test
cases to back it up.

My advice to you is just to drop it to 2.0 and see if you like the plans
you get any better.
        regards, tom lane


Re: again on index usage

From
Don Baccus
Date:
Ross J. Reedstrom wrote:


> Hmm, on Linux this sort of behavior (skip the pg buffers for sequential
> scans) would have interesting load senstive behavior: since Linux uses
> all not-otherwise in use RAM as buffer cache, if you've got a low-load
> system, even very large tables will be cached. Once other processes start
> competing for RAM, your buffers go away. Bruce, what does xBSD do?


For people who run dedicated database services simply not using pg 
buffers for sequential scans is probably too simplistic.  Assuming one 
allocates a very large pg buffer space, as I tend to do.  Remember that 
shuffling data between a pg buffer and OS cache buffer takes cycles, and 
modern machines tend to be starved for memory bandwidth - perhaps 
another reason why Oracle requested and got writes that bypass the OS 
cache entirely?  This bypasses the byte-shuffling.

Of course, Oracle's preferred approach is to have you set up your OS 
environment so that Oracle pretty much eats the machine.  They tell you 
to set SHMAX to 4GB in the installation docs, for instance, then the 
installer by default will configure Oracle to use about 1/3 of your 
available RAM for its buffer cache.  Books on tuning generally tell you 
that's far too low.

Anyway, I've been told that Oracle's approach is more along the lines of 
"don't cache sequential scans that eat up more than N% of our own cache 
space".

Then shorter tables still get fully cached when sequentially scanned, 
while humongous tables don't wipe out the cache causing dirty pages to 
be flushed to the platter and other concurrent processes to do file I/O 
reads because everything but the huge table's disappeared.

Someone in an earlier post mentioned "thrashing" and that's what 
dragging a table bigger than cache causes on busy systems.


> 
> I like it, since anything that needs to be sensitive to system wide
> information, like the total load on the machine, should be handled by
> the system, not a particular app.
> 
> Ross
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org
> 
> .
> 
> 



-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org



Re: again on index usage

From
Bruce Momjian
Date:
Daniel Kalchev wrote:
> >>>Tom Lane said:
>  > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>  > > My preference would actually be a way to make the optimizer
>  > > choose a plan that causes minimal workload, and not shortest runtime 
>  > 
>  > ?? I am not sure that I see the difference.
> 
> There can be difference only if the optimizer takes into account already 
> executing plans (by other backends).
> 
>  > What I think you are saying is that when there's lots of competing work,
>  > seqscans have less advantage over indexscans because the
>  > sequential-access locality advantage is lost when the disk drive has to
>  > go off and service some other request.
> 
> This is exactly my point. The primary goal of the optimizer in my opinion 
> should be to avoid trashing. :-) Now, it is not easy to figure out when the 
> system starts trashing - at least not a portable way I can think of 
> immediately.

I have always felt some feedback mechanism from the executor back to the
optimizer was required but I was never sure quite how to implement it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: again on index usage

From
Bruce Momjian
Date:
> > That is on our TODO list, at least.
> > 
> 
> Hmm, on Linux this sort of behavior (skip the pg buffers for sequential
> scans) would have interesting load senstive behavior: since Linux uses
> all not-otherwise in use RAM as buffer cache, if you've got a low-load
> system, even very large tables will be cached. Once other processes start
> competing for RAM, your buffers go away. Bruce, what does xBSD do?

FreeBSD does, NetBSD will soon, not sure about the others.  I believe
NetBSD will be tunable because page cache vs. i/o cache is not always
best done with a FIFO setup.

Also, when we pull from the kernel cache we have to read into our shared
buffers;  much faster than disk i/o but slower than if they were already
in the cache.  For me the idea of doing non-cached sequential scans came
from a Solaris internals book I was reading.  I think it will be
possible to mark large sequential scan buffer i/o with lower priority
caching that may help performance.  However, if others are also doing
sequential scans of the same table _only_, our normal caching may be
best.  As you can see, this gets quite complicated and I am doubtful
there will be a general solution to this problem --- more of a feedback
loop may be the best bet --- determine which sequential scans are wiping
the cache with little other purpose and start not caching them.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: again on index usage

From
mlw
Date:
Tom Lane wrote:
> What I think you are saying is that when there's lots of competing work,
> seqscans have less advantage over indexscans because the
> sequential-access locality advantage is lost when the disk drive has to
> go off and service some other request.  If that's the mechanism, I think
> that the appropriate answer is just to reduce random_page_cost.  It's
> true that the current default of 4.0 was chosen using measurements on
> otherwise-unloaded systems.  If you assume that the system is (a) too
> busy to do read-aheads for you, and (b) has to move the disk arm to
> service other requests between each of your requests, then it's not
> clear that sequential reads have any performance advantage at all :-(.
> I don't think I'd go as far as to lower random_page_cost to 1.0, but
> certainly there's a case for using an intermediate value.

It is even more interesting than simple sequential vs random scan
thinking. Depending on the maker of the drive, even an unloaded system
will move the head randomly. Modern drives have almost no resemblance to
their predecessors. Sectors are mapped however the OEM sees fit. A
numerically sequential read from a hard disk may have the drive heads
bouncing all over the disk because the internal configuration of the
disk has almost nothing to do with the external representation.

Think about a RAID device. What does a sequential scan mean to a RAID
system? Very little depending on how the image is constructed. Storage
devices are now black boxes. The only predictable advantage a
"sequential scan" can have on a modern computer is OS level caching.


Re: again on index usage

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> ... Storage
> devices are now black boxes. The only predictable advantage a
> "sequential scan" can have on a modern computer is OS level caching.

You mean read-ahead.  True enough, but that "only advantage" is very
significant.  The 4.0 number did not come out of the air, it came
from actual measurements.

I think the real point in this thread is that measurements on an idle
system might not extrapolate very well to measurements on a heavily
loaded system.  I can see the point, but I don't really have time to
investigate it right now.  I'd be willing to reduce the default value of
random_page_cost to something around 2, if someone can come up with
experimental evidence justifying it ...
        regards, tom lane


Re: again on index usage

From
Don Baccus
Date:
Bruce Momjian wrote:


> 
> I have always felt some feedback mechanism from the executor back to the
> optimizer was required but I was never sure quite how to implement it.


The folks at DEC (rdb???) wrote a paper on it a long time ago (duh, back 
when DEC existed).   I ran across it in the Tuft's library about a year 
ago, back when my girlfriend was in grad school.


-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org



Re: again on index usage

From
Don Baccus
Date:
Bruce Momjian wrote:


> Also, when we pull from the kernel cache we have to read into our shared
> buffers;  much faster than disk i/o but slower than if they were already
> in the cache.


Yes ... this is one reason why folks like Oracle want to be able to 
bypass the kernel cache.

>  For me the idea of doing non-cached sequential scans came
> from a Solaris internals book I was reading.  I think it will be
> possible to mark large sequential scan buffer i/o with lower priority
> caching that may help performance.  However, if others are also doing
> sequential scans of the same table _only_, our normal caching may be
> best.  As you can see, this gets quite complicated and I am doubtful
> there will be a general solution to this problem --- more of a feedback
> loop may be the best bet --- determine which sequential scans are wiping
> the cache with little other purpose and start not caching them.


It would be interesting to learn more about the actual hueristic Oracle 
uses (straight percents of the buffer cache?  Something based on 
concurrency?  I have no idea).  The Oracle folks have got tons and tons 
of data on real-world big, busy db installations to draw from when they 
investigate such things.



-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org



Re: again on index usage

From
Daniel Kalchev
Date:
>>>Tom Lane said:> mlw <markw@mohawksoft.com> writes:> > ... Storage> > devices are now black boxes. The only
predictableadvantage a> > "sequential scan" can have on a modern computer is OS level caching.> > You mean read-ahead.
Trueenough, but that "only advantage" is very> significant.  The 4.0 number did not come out of the air, it came> from
actualmeasurements.
 

On what OS? Linux? Windows? BSD? OSF/1? System V? All these differ 
significantly in how buffer cache is managed. For example, the BSD 'soft 
updates' will not penalize large directory updates, but not do any good for 
sequential reads (considering what was said already about modern disks). SCSI 
tag queueing will significantly improve raw disk reads ('sequential' again) 
because of the low overhead of host<->SCSI subsystem communication - any 
decent SCSI host adapter will do bus-master DMA, without the interference of 
the processor (simplified as much as to illustrate it :). Todays IDE drives on 
PC hardware don't do that! Which is not to say that only SCSI drive 
controllers are intelligent enough - I still remember an older Motorola VME 
based UNIX system (that now can only server the purpose of coffee table :), 
where an MFM controller board had all the intelligence of the SCSI subsystem, 
although it operated with 'dump' MFM disks. So many examples can be given here.
> I think the real point in this thread is that measurements on an idle> system might not extrapolate very well to
measurementson a heavily> loaded system.  I can see the point, but I don't really have time to> investigate it right
now. I'd be willing to reduce the default value of> random_page_cost to something around 2, if someone can come up
with>experimental evidence justifying it ...
 

Agreed. My preference would be, that if you have reasonable enough test data, 
that can be shared, many people on different platforms can run performance 
tests and come up with an array of recommended values for their particular 
OS/hardware configuration. I believe these two items are most significant for 
the tuning of an installation.

Daniel Kalchev



Re: again on index usage

From
Hannu Krosing
Date:
Don Baccus wrote:
> 
> Zeugswetter Andreas SB SD wrote:
> 
> > This is one of the main problems of the current optimizer which imho rather
> > aggressively chooses seq scans over index scans. During high load this does
> > not pay off.
> 
> Bingo ... dragging huge tables through the buffer cache via a sequential
> scan guarantees that a) the next query sequentially scanning the same
> table will have to read every block again (if the table's longer than
> available PG and OS cache) b) on a high-concurrency system other queries
> end up doing extra I/O, too.
> 
> Oracle partially mitigates the second effect by refusing to trash its
> entire buffer cache on any given sequential scan.  Or so I've been told
> by people who know Oracle well.  A repeat of the sequential scan will
> still have to reread the entire table but that's true anyway if the
> table's at least one block longer than available cache.

One radical way to get better-than-average cache behaviour in such 
pathologigal casescases would be to discard a _random_ page instead of 
LRU page (perhaps tuned to not not select from 1/N of pages on that are
MRU)

-------------
Hannu


Re: again on index usage

From
Don Baccus
Date:
Hannu Krosing wrote:


> One radical way to get better-than-average cache behaviour in such 
> pathologigal casescases would be to discard a _random_ page instead of 
> LRU page (perhaps tuned to not not select from 1/N of pages on that are
> MRU)


Yep, that's one of the ways to improve performance when the same table's 
being scanned sequentially multiple times, or where different queries 
sometimes scan it sequentially, other times by index.  MRU would help if 
you're constantly doing sequential scans.

So would flipping the scan order depending on what's in the cache :)

But none of these would mitigate the effects on other concurrent queries 
that don't query the large table at all.



-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org



Re: again on index usage

From
"Zeugswetter Andreas SB SD"
Date:
Tom wrote:
> > The question is: how does one find the proper value? That is, is it
> > possible to design planner benchmarking utility to aid in tuning
> > Postgres?
> 
> The trouble is that getting trustworthy numbers requires huge test
> cases, because you have to swamp out the effects of the kernel's own
> buffer caching.  I spent about a week running 24-hour-constant-disk-
> banging experiments when I came up with the 4.0 number we use now,
> and even then I didn't feel that I had a really solid range of test
> cases to back it up.

Well, I certainly think you did a great job on your test cases :-)
If you look at Daniel's idle system he has a measured factor of ~4.8.

The number is also (imho correctly) rather an underestimate than an 
overestimate. That is, I haven't been able to empirically proove mlw's
point about modern disks not beeing sensitive to scan with larger blocks
vs. random 8k.
(My tests typically used raw devices circumventing the OS completely,
since I do the tests for Informix servers)

> My advice to you is just to drop it to 2.0 and see if you like the plans
> you get any better.

Yup, maybe even lower. I cannot see a difference in disk IO troughput 
during seq or index scan in Daniel's test case during his normal workload test. 
This is because his system had a CPU bottleneck during his normal workload
test. (Shows again how bad OS's distribute processes (1 CPU serves 2 backends
that have a CPU bottleneck, 1 CPU is idle))

Andreas


Re: again on index usage

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Daniel Kalchev
> 
> I tried this:
> 
> db=# select * into iplog_test from iplog_gate200112;
> SELECT
> db=# create index iplog_test_ipaddr_idx on iplog_test(ipaddr);
> CREATE
> db=# cluster iplog_test_ipaddr_idx on iplog_test;
> CLUSTER
> db=# create index iplog_test_ipdate_idx on iplog_test(ipdate);
> CREATE
> db=# explain
> db-# SELECT sum(input), sum(output) FROM iplog_test 

> db-# WHERE 
> db-# '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 
> 00:00:00+02' 

Is there only one ipdate value which satisfies the above where clause ?

regards,
Hiroshi Inoue


Re: again on index usage

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From:  Hiroshi Inoue
> 
> > -----Original Message-----
> > From: Daniel Kalchev
> > 
> > I tried this:
> > 
> > db=# explain
> > db-# SELECT sum(input), sum(output) FROM iplog_test 
> 
> > db-# WHERE 
> > db-# '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 
> > 00:00:00+02' 
> 
> Is there only one ipdate value which satisfies the above where clause ?

If '2001-12-01 00:00:00+02' is the unique ipdate value which satsifies
'2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' 
and CREATE INDEX preserves the physical order of the same key,
the IndexScan would see physically ordered tuples. There's no strangeness
even if the scan is faster than sequential scan.

regards,
Hiroshi Inoue 


Re: again on index usage

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Daniel Kalchev [mailto:daniel@digsys.bg]
> 
> Well, we already played around this by first creating new table 
> ordered by 
> random().

This seem to have little meaing if there's only one ipdate value
which satisfies the WHERE clause and the index order matches
the physical order of heap tuples.

I tried a very simple test case.
I have a table which has 100000 tuples and made a index on a
column whose value is always ''(i.e the same value).
Different from my expectation CREATE INDEX doesn't seem to
preserve the physical(input) order. However it seems sufficiently
ordered and probably we could expect the scan using the index
is as fast as sequential scan.

The result(distribution of corrsponding CTIDs) is as follows.

(3448, 17)
(3542, 26) -- (5172, 24)  consecutive 94539 tuples 
(3542, 25) -- (3448, 18)  consecutive 5460 tuples

regards,
Hiroshi Inoue