Thread: Index Scans become Seq Scans after VACUUM ANALYSE

Index Scans become Seq Scans after VACUUM ANALYSE

From

Louis-David Mitterrand

Date:

16 April 2002, 10:03:59


Hello,

While trying to optimise a query I found that running VACUUM ANALYSE
changed all the Index Scans to Seq Scans and that the only way to revert
to Index Scans was the add "enable_seqscan = 0" in postgresql.conf.

Seq Scans are much slower for that specific query. Why does Postgres
switch to that method?
PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.4
(1 row)

Output with "enable_seqscan = 0":
gesci5=# explain select p.id_prospect, p.position_prospect, initcap(p1.nom) as nom, initcap(p1.prenom) as prenom,
a1.no_tel,a1.no_portable, p.dernier_contact, cn.id_contact, cn.id_vendeur, cn.id_operation, case when p.dernier_contact
isnot null then cn.date_contact::abstime::int4 else p.cree_le::abstime::int4 end as date_contact, cn.type_contact,
cn.nouveau_rdv::abstime::int4as nouveau_rdv, cn.date_echeance::abstime::int4 as date_echeance,
cn.date_reponse::abstime::int4as date_reponse from prospect p left join personne p1 on (p1.id_personne =
p.id_personne1)left join adresse a1 on (a1.id_adresse = p1.id_adresse_principale) left join contact cn on
(p.dernier_contact= cn.id_contact) where (p.abandon is null or p.abandon != 'O')  order by cn.date_contact desc;NOTICE:
QUERY PLAN:
 
Sort  (cost=49442.99..49442.99 rows=24719 width=123)  ->  Hash Join  (cost=14146.79..46656.05 rows=24719 width=123)
  ->  Merge Join  (cost=9761.33..40724.83 rows=24719 width=66)              ->  Sort  (cost=9761.33..9761.33 rows=24719
width=49)                   ->  Merge Join  (cost=0.00..7485.53 rows=24719 width=49)                          ->  Index
Scanusing prospect_personne1 on prospect p  (cost=0.00..4322.18 rows=24719 width=22)                          ->  Index
Scanusing personne_pkey on personne p1  (cost=0.00..2681.90 rows=44271 width=27)              ->  Index Scan using
adresse_pkeyon adresse a1  (cost=0.00..30354.16 rows=95425 width=17)        ->  Hash  (cost=3242.09..3242.09 rows=30224
width=57)             ->  Index Scan using contact_pkey on contact cn  (cost=0.00..3242.09 rows=30224 width=57)
 

Output with "enable_seqscan = 1": 
Sort  (cost=18622.67..18622.67 rows=24719 width=123)  ->  Hash Join  (cost=10034.30..15835.73 rows=24719 width=123)
  ->  Hash Join  (cost=8074.99..12330.65 rows=24719 width=66)              ->  Hash Join  (cost=2088.54..4629.65
rows=24719width=49)                    ->  Seq Scan on prospect p  (cost=0.00..1289.35 rows=24719 width=22)
      ->  Hash  (cost=1106.71..1106.71 rows=44271 width=27)                          ->  Seq Scan on personne p1
(cost=0.00..1106.71rows=44271 width=27)              ->  Hash  (cost=2561.25..2561.25 rows=95425 width=17)
     ->  Seq Scan on adresse a1  (cost=0.00..2561.25 rows=95425 width=17)        ->  Hash  (cost=1036.24..1036.24
rows=30224width=57)              ->  Seq Scan on contact cn  (cost=0.00..1036.24 rows=30224 width=57)
 



--    OENONE: Rebelle à tous nos soins, sourde à tous nos discours,           Voulez-vous sans pitié laisser finir vos
jours?                                         (Phèdre, J-B Racine, acte 1, scène 3)

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

16 April 2002, 10:42:12

Louis-David Mitterrand <vindex@apartia.org> writes:
> While trying to optimise a query I found that running VACUUM ANALYSE
> changed all the Index Scans to Seq Scans and that the only way to revert
> to Index Scans was the add "enable_seqscan = 0" in postgresql.conf.

EXPLAIN ANALYZE output would be more interesting than just EXPLAIN.
Also, what does the pg_stats view show for these tables?
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Louis-David Mitterrand

Date:

16 April 2002, 14:25:53

On Tue, Apr 16, 2002 at 10:41:57AM -0400, Tom Lane wrote:
> Louis-David Mitterrand <vindex@apartia.org> writes:
> > While trying to optimise a query I found that running VACUUM ANALYSE
> > changed all the Index Scans to Seq Scans and that the only way to revert
> > to Index Scans was the add "enable_seqscan = 0" in postgresql.conf.
>
> EXPLAIN ANALYZE output would be more interesting than just EXPLAIN.
> Also, what does the pg_stats view show for these tables?

Thanks, pg_stats output is rather big so I attached it in a separate
file. Here are the EXPLAIN ANALYZE ouputs:

*********************************
* 1) With "enable_seqscan = 0": *
*********************************

gesci5=# explain analyse select p.id_prospect, p.position_prospect, initcap(p1.nom) as nom, initcap(p1.prenom) as
prenom,a1.no_tel, a1.no_portable, p.dernier_contact, cn.id_contact, cn.id_vendeur, cn.id_operation, case when
p.dernier_contactis not null then cn.date_contact::abstime::int4 else p.cree_le::abstime::int4 end as date_contact,
cn.type_contact,cn.nouveau_rdv::abstime::int4 as nouveau_rdv, cn.date_echeance::abstime::int4 as date_echeance,
cn.date_reponse::abstime::int4as date_reponse from prospect p left join personne p1 on (p1.id_personne =
p.id_personne1)left join adresse a1 on (a1.id_adresse = p1.id_adresse_principale) left join contact cn on
(p.dernier_contact= cn.id_contact) where (p.abandon is null or p.abandon != 'O')  order by cn.date_contact desc; 
NOTICE:  QUERY PLAN:

Sort  (cost=49442.99..49442.99 rows=24719 width=123) (actual time=7281.98..7319.91 rows=23038 loops=1)
  ->  Hash Join  (cost=14146.79..46656.05 rows=24719 width=123) (actual time=2619.85..6143.47 rows=23038 loops=1)
        ->  Merge Join  (cost=9761.33..40724.83 rows=24719 width=66) (actual time=2061.31..3362.49 rows=23038 loops=1)
              ->  Sort  (cost=9761.33..9761.33 rows=24719 width=49) (actual time=1912.73..1961.61 rows=23038 loops=1)
                    ->  Merge Join  (cost=0.00..7485.53 rows=24719 width=49) (actual time=42.98..1264.63 rows=23038
loops=1)
                          ->  Index Scan using prospect_personne1 on prospect p  (cost=0.00..4322.18 rows=24719
width=22)(actual time=0.28..528.42 rows=23038 loops=1) 
                          ->  Index Scan using personne_pkey on personne p1  (cost=0.00..2681.90 rows=44271 width=27)
(actualtime=0.18..384.11 rows=44302 loops=1) 
              ->  Index Scan using adresse_pkey on adresse a1  (cost=0.00..30354.16 rows=95425 width=17) (actual
time=0.44..738.99rows=95456 loops=1) 
        ->  Hash  (cost=3242.09..3242.09 rows=30224 width=57) (actual time=557.04..557.04 rows=0 loops=1)
              ->  Index Scan using contact_pkey on contact cn  (cost=0.00..3242.09 rows=30224 width=57) (actual
time=0.26..457.97rows=30224 loops=1) 
Total runtime: 7965.74 msec

EXPLAIN

*********************************
* 2) With "enable_seqscan = 1": *
*********************************

NOTICE:  QUERY PLAN:

Sort  (cost=18622.67..18622.67 rows=24719 width=123) (actual time=10329.09..10367.06 rows=23039 loops=1)
  ->  Hash Join  (cost=10034.30..15835.73 rows=24719 width=123) (actual time=1644.04..9397.53 rows=23039 loops=1)
        ->  Hash Join  (cost=8074.99..12330.65 rows=24719 width=66) (actual time=1110.05..6475.65 rows=23039 loops=1)
              ->  Hash Join  (cost=2088.54..4629.65 rows=24719 width=49) (actual time=385.33..2763.91 rows=23039
loops=1)
                    ->  Seq Scan on prospect p  (cost=0.00..1289.35 rows=24719 width=22) (actual time=0.34..361.31
rows=23039loops=1) 
                    ->  Hash  (cost=1106.71..1106.71 rows=44271 width=27) (actual time=381.91..381.91 rows=0 loops=1)
                          ->  Seq Scan on personne p1  (cost=0.00..1106.71 rows=44271 width=27) (actual
time=0.15..246.32rows=44272 loops=1) 
              ->  Hash  (cost=2561.25..2561.25 rows=95425 width=17) (actual time=723.15..723.15 rows=0 loops=1)
                    ->  Seq Scan on adresse a1  (cost=0.00..2561.25 rows=95425 width=17) (actual time=0.17..452.55
rows=95427loops=1) 
        ->  Hash  (cost=1036.24..1036.24 rows=30224 width=57) (actual time=532.87..532.87 rows=0 loops=1)
              ->  Seq Scan on contact cn  (cost=0.00..1036.24 rows=30224 width=57) (actual time=2.54..302.49 rows=30225
loops=1)
Total runtime: 10901.85 msec

EXPLAIN


--
 HIPPOLYTE: Mais quels soins désormais peuvent me retarder ?
            Assez dans les forêts mon oisive jeunesse
            Sur de vils ennemis a montré son adresse.
                                          (Phèdre, J-B Racine, acte 3, scène 5)

Attachment

pg_stats.txt

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

16 April 2002, 20:23:01

Reading all of this discussion lately about how the planner seems to 
prefer seqscan's in alot of places where indexes would be better starts 
making me wonder if some of the assumptions or cals made to figure costs 
are wrong...

Anyone have any ideas?

Louis-David Mitterrand wrote:

>On Tue, Apr 16, 2002 at 10:41:57AM -0400, Tom Lane wrote:
>
>>Louis-David Mitterrand <vindex@apartia.org> writes:
>>
>>>While trying to optimise a query I found that running VACUUM ANALYSE
>>>changed all the Index Scans to Seq Scans and that the only way to revert
>>>to Index Scans was the add "enable_seqscan = 0" in postgresql.conf.
>>>
>>EXPLAIN ANALYZE output would be more interesting than just EXPLAIN.
>>Also, what does the pg_stats view show for these tables?
>>
>
>Thanks, pg_stats output is rather big so I attached it in a separate
>file. Here are the EXPLAIN ANALYZE ouputs:
>
>... SNIP ...
>
>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

16 April 2002, 20:31:12

I know I know, replying to myself is bad and probably means I'm going 
insane but thought of one other thing...

Realistically the system should choos *ANY* index over a sequential 
table scan.  Above a fairly low number of records any indexed query 
should be much faster than a seqscan.  Am I right, or did I miss 
something?  (wouldn't be the first time I missed something)...  Right 
now the planner seems to think that index queries are more expensive 
with a larger width than doing a seqscan on (possibly) more rows with a 
narrower width.

Michael Loftis wrote:

> Reading all of this discussion lately about how the planner seems to 
> prefer seqscan's in alot of places where indexes would be better 
> starts making me wonder if some of the assumptions or cals made to 
> figure costs are wrong...
>
>
> Anyone have any ideas?
>
> Louis-David Mitterrand wrote:
>
>> On Tue, Apr 16, 2002 at 10:41:57AM -0400, Tom Lane wrote:
>>
>>> Louis-David Mitterrand <vindex@apartia.org> writes:
>>>
>>>> While trying to optimise a query I found that running VACUUM ANALYSE
>>>> changed all the Index Scans to Seq Scans and that the only way to 
>>>> revert
>>>> to Index Scans was the add "enable_seqscan = 0" in postgresql.conf.
>>>>
>>> EXPLAIN ANALYZE output would be more interesting than just EXPLAIN.
>>> Also, what does the pg_stats view show for these tables?
>>>
>>
>> Thanks, pg_stats output is rather big so I attached it in a separate
>> file. Here are the EXPLAIN ANALYZE ouputs:
>>
>> ... SNIP ...
>>
>>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Stephan Szabo

Date:

16 April 2002, 20:47:55

On Tue, 16 Apr 2002, Michael Loftis wrote:

> I know I know, replying to myself is bad and probably means I'm going
> insane but thought of one other thing...
>
> Realistically the system should choos *ANY* index over a sequential
> table scan.  Above a fairly low number of records any indexed query
> should be much faster than a seqscan.  Am I right, or did I miss
> something?  (wouldn't be the first time I missed something)...  Right

Because the validity information is stored with the row and not the index
you have to read rows for any potential hit in the index.  Depending on
the clustering of the table, the width of the rows and the percentage of
the table being hit by the scan (or estimated to be hit) you may read
most or all of the table as well as the index and be paying a penalty for
doing it randomly as opposed to be sequentially.  IIRC, there are some
settings in the configuration that let you play around with the relative
costs the estimator uses (the random page cost and cpu costs for dealing
with index entries and such).

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

16 April 2002, 23:53:50

Michael Loftis <mloftis@wgops.com> writes:
> Reading all of this discussion lately about how the planner seems to 
> prefer seqscan's in alot of places where indexes would be better starts 
> making me wonder if some of the assumptions or cals made to figure costs 
> are wrong...

Could well be.  The sources are open, feel free to take a look ...
src/backend/optimizer/path/costsize.c is the epicenter ...
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

16 April 2002, 23:58:49

Michael Loftis <mloftis@wgops.com> writes:
> Realistically the system should choos *ANY* index over a sequential 
> table scan.

Sorry, I do not accept that.  You might as well say that we should
rip out any attempt at cost estimation, and instead put in two or
three lines of brain-dead heuristics.  If it were that simple we'd
all be using MySQL ;-)

> Above a fairly low number of records any indexed query 
> should be much faster than a seqscan.

Isn't that exactly backwards?
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

17 April 2002, 00:44:29

Louis-David Mitterrand <vindex@apartia.org> writes:
> While trying to optimise a query I found that running VACUUM ANALYSE
> changed all the Index Scans to Seq Scans and that the only way to revert
> to Index Scans was the add "enable_seqscan = 0" in postgresql.conf.
>> 
>> EXPLAIN ANALYZE output would be more interesting than just EXPLAIN.
>> Also, what does the pg_stats view show for these tables?

> Thanks, pg_stats output is rather big so I attached it in a separate
> file. Here are the EXPLAIN ANALYZE ouputs:

Tell you the truth, I'm having a real hard time getting excited over
a bug report that says the planner chose a plan taking 10.90 seconds
in preference to one taking 7.96 seconds.

Any time the planner's estimates are within a factor of 2 of reality,
I figure it's done very well.  The inherent unknowns are so large that
that really amounts to divination.  We can't expect to choose a perfect
plan every time --- if we can avoid choosing a truly stupid plan (say,
one that takes a couple orders of magnitude more time than the best
possible plan) then we ought to be happy.

But having said that, it would be interesting to see if adjusting some
of the planner cost parameters would yield better results in your
situation.  The coarsest of these is random_page_cost, which is
presently 4.0 by default.  Although I have done some moderately
extensive measurements to get that figure, other folks have reported
that lower numbers like 3.0 or even less seem to suit their platforms
better.  In general a lower random_page_cost will favor indexscans...
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

17 April 2002, 01:06:42

Let me add people's expections of the optimizer and the "it isn't using
the index" questions are getting very old.  I have beefed up the FAQ
item on this a month ago, but that hasn't reduced the number of
questions.  I almost want to require people to read a specific FAQ item
4.8 before we will reply to anything.

Maybe that FAQ item needs more info.  Tom can't be running around trying
to check all these optimizer reports when >90% are just people not
understanding the basics of optimization or query performance.

Maybe we need an optimizer FAQ that will answer the basic questions for
people.

---------------------------------------------------------------------------

Tom Lane wrote:
> Louis-David Mitterrand <vindex@apartia.org> writes:
> > While trying to optimise a query I found that running VACUUM ANALYSE
> > changed all the Index Scans to Seq Scans and that the only way to revert
> > to Index Scans was the add "enable_seqscan = 0" in postgresql.conf.
> >> 
> >> EXPLAIN ANALYZE output would be more interesting than just EXPLAIN.
> >> Also, what does the pg_stats view show for these tables?
> 
> > Thanks, pg_stats output is rather big so I attached it in a separate
> > file. Here are the EXPLAIN ANALYZE ouputs:
> 
> Tell you the truth, I'm having a real hard time getting excited over
> a bug report that says the planner chose a plan taking 10.90 seconds
> in preference to one taking 7.96 seconds.
> 
> Any time the planner's estimates are within a factor of 2 of reality,
> I figure it's done very well.  The inherent unknowns are so large that
> that really amounts to divination.  We can't expect to choose a perfect
> plan every time --- if we can avoid choosing a truly stupid plan (say,
> one that takes a couple orders of magnitude more time than the best
> possible plan) then we ought to be happy.
> 
> But having said that, it would be interesting to see if adjusting some
> of the planner cost parameters would yield better results in your
> situation.  The coarsest of these is random_page_cost, which is
> presently 4.0 by default.  Although I have done some moderately
> extensive measurements to get that figure, other folks have reported
> that lower numbers like 3.0 or even less seem to suit their platforms
> better.  In general a lower random_page_cost will favor indexscans...
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 01:28:08

Bruce Momjian wrote:
> 
> Let me add people's expections of the optimizer and the "it isn't using
> the index" questions are getting very old.  I have beefed up the FAQ
> item on this a month ago, but that hasn't reduced the number of
> questions.  I almost want to require people to read a specific FAQ item
> 4.8 before we will reply to anything.
> 
> Maybe that FAQ item needs more info.  Tom can't be running around trying
> to check all these optimizer reports when >90% are just people not
> understanding the basics of optimization or query performance.
> 
> Maybe we need an optimizer FAQ that will answer the basic questions for
> people.

I think you are missing a huge point, people are confused by the operation of
PostgreSQL. You admit that there are a lot of questions about this topic. This
means that something is happening which is non-intuitive. Bruce, you are an
expert in PostgreSQL, but most people who use it are not. The unexpected
behavior is just that, unexpected, or a surprise.

Business people, accountants, and engineers do not like surprises. PostgreSQL's
behavior on index usage is totally confusing. If I can paraphase correctly,
PostgreSQL wants to have a good reason to use an index. Most people expect a
database to have an undeniable reason NOT to use an index. I would also say, if
a DBA created an index, there is a strong indication that there is a need for
one! (DBA knowledge vs statistics)

That is the difference, in another post Tom said he could not get excited about
10.9 second execution time over a 7.96 execution time. Damn!!! I would. That is
wrong.

I have bitched about the index stuff for a while, and always have bumped up
against this problem. If I can sway anyone's opinion, I would say, unless
(using Tom's words) a "factor of 2" planner difference against, I would use an
index. Rather than needing clear evidence to use an index, I would say you need
clear evidence not too.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

17 April 2002, 01:40:59

mlw <markw@mohawksoft.com> writes:
> That is the difference, in another post Tom said he could not get
> excited about 10.9 second execution time over a 7.96 execution
> time. Damn!!! I would. That is wrong.

Sure.  Show us how to make the planner's estimates 2x more accurate
(on average) than they are now, and I'll get excited too.

But forcing indexscan to be chosen over seqscan does not count as
making it more accurate.  (If you think it does, then you don't
need to be in this thread at all; set enable_seqscan = 0 and
stop bugging us ;-))
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Christopher Kings-Lynne"

Date:

17 April 2002, 01:41:58

> I have bitched about the index stuff for a while, and always have
> bumped up
> against this problem. If I can sway anyone's opinion, I would say, unless
> (using Tom's words) a "factor of 2" planner difference against, I
> would use an
> index. Rather than needing clear evidence to use an index, I
> would say you need
> clear evidence not too.

I spend a lot of time answering questions on various database forums and I
find that the single thing that most newbies just cannot understand is that
a sequential scan is often a lot faster than an index scan.  They just
cannot comprehend that an index can be slower.  Ever.  For any query.  That
is not our problem...

What we could offer tho, is more manual control over the planner.  People
can do this to a mild extend by disabling sequential scans, but it looks
like it should be extended...

Chris

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 01:52:42

Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > That is the difference, in another post Tom said he could not get
> > excited about 10.9 second execution time over a 7.96 execution
> > time. Damn!!! I would. That is wrong.
> 
> Sure.  Show us how to make the planner's estimates 2x more accurate
> (on average) than they are now, and I'll get excited too.
> 
> But forcing indexscan to be chosen over seqscan does not count as
> making it more accurate.  (If you think it does, then you don't
> need to be in this thread at all; set enable_seqscan = 0 and
> stop bugging us ;-))

Oh, come on Tom, surely I have been around long enough to lend credence that
wish to have a positive affect on PostgreSQL development.

enable_seqscan=0, disallows sequential scan, that is not what I am saying. This
is a problem I (and others) have been yapping about for a long time. 

(Please remember, I USE PostgreSQL, I have a vested interest in it being the
best RDBMS available.)

I just think there is sufficient evidence to suggest that if a DBA creates an
index, there is strong evidence (better than statistics) that the index need be
used. In the event that an index exists, there is a strong indication that,
without overwhelming evidence, that the index should be used. You have admitted
that statistics suck, but the existence of an index must weight (heavily) on
the evaluation on whether or not to use an index.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 02:02:06

Christopher Kings-Lynne wrote:
> 
> > I have bitched about the index stuff for a while, and always have
> > bumped up
> > against this problem. If I can sway anyone's opinion, I would say, unless
> > (using Tom's words) a "factor of 2" planner difference against, I
> > would use an
> > index. Rather than needing clear evidence to use an index, I
> > would say you need
> > clear evidence not too.
> 
> I spend a lot of time answering questions on various database forums and I
> find that the single thing that most newbies just cannot understand is that
> a sequential scan is often a lot faster than an index scan.  They just
> cannot comprehend that an index can be slower.  Ever.  For any query.  That
> is not our problem...

Here is the problem, in a single paragraph.

If the DBA notices that there is a problem with a query, he adds an index, he
notices that there is no difference, then he notices that PostgreSQL is not
using his index. First and foremost he gets mad at PostgreSQL for not using his
index. If PostgreSQL decided to use an index which increases execution time,
the DBA would delete the index. If PostgreSQL does not use an index, he has to
modify the posgresql.conf file, which disallows PostgreSQL from using an index
when it would be a clear loser.

My assertion is this: "If a DBA creates an index, he has a basis for his
actions."

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Oliver Elphick

Date:

17 April 2002, 02:07:27

On Wed, 2002-04-17 at 06:51, mlw wrote:
> I just think there is sufficient evidence to suggest that if a DBA creates an
> index, there is strong evidence (better than statistics) that the index need be
> used. In the event that an index exists, there is a strong indication that,
> without overwhelming evidence, that the index should be used. You have admitted
> that statistics suck, but the existence of an index must weight (heavily) on
> the evaluation on whether or not to use an index.

But indexes are not, for the most part, there because of a specific
choice to have an index, but as the implementation of PRIMARY KEY and
UNIQUE.  Therefore the main part of your argument fails.

--
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839  932A 614D 4C34 3E1D 0C1C
    "But as many as received him, to them gave he power to      become the sons of God, even to them that believe on
 his name."    John 1:12

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Oliver Elphick

Date:

17 April 2002, 02:07:28

On Wed, 2002-04-17 at 06:51, mlw wrote:
> I just think there is sufficient evidence to suggest that if a DBA creates an
> index, there is strong evidence (better than statistics) that the index need be
> used. In the event that an index exists, there is a strong indication that,
> without overwhelming evidence, that the index should be used. You have admitted
> that statistics suck, but the existence of an index must weight (heavily) on
> the evaluation on whether or not to use an index.

But indexes are not, for the most part, there because of a specific
choice to have an index, but as the implementation of PRIMARY KEY and
UNIQUE.  Therefore the main part of your argument fails.

--
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839  932A 614D 4C34 3E1D 0C1C
    "But as many as received him, to them gave he power to      become the sons of God, even to them that believe on
 his name."    John 1:12

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 02:17:00

Oliver Elphick wrote:
> 
> On Wed, 2002-04-17 at 06:51, mlw wrote:
> > I just think there is sufficient evidence to suggest that if a DBA creates an
> > index, there is strong evidence (better than statistics) that the index need be
> > used. In the event that an index exists, there is a strong indication that,
> > without overwhelming evidence, that the index should be used. You have admitted
> > that statistics suck, but the existence of an index must weight (heavily) on
> > the evaluation on whether or not to use an index.
> 
> But indexes are not, for the most part, there because of a specific
> choice to have an index, but as the implementation of PRIMARY KEY and
> UNIQUE.  Therefore the main part of your argument fails.

Let's talk about the primary key, that will not exhibit the borderline behavior
that we see. I have had first hand experience (and frustration) on PostgreSQL's
choice of using an index.

The primary key and UNIQUE constraint will only exhibit reduced performance on
REALLY small tables, in which case, the reduced performance is minimal if not
nonexistent.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Christopher Kings-Lynne"

Date:

17 April 2002, 02:53:14

> If the DBA notices that there is a problem with a query, he adds
> an index, he
> notices that there is no difference, then he notices that
> PostgreSQL is not
> using his index. First and foremost he gets mad at PostgreSQL for
> not using his
> index. If PostgreSQL decided to use an index which increases
> execution time,
> the DBA would delete the index. If PostgreSQL does not use an
> index, he has to
> modify the posgresql.conf file, which disallows PostgreSQL from
> using an index
> when it would be a clear loser.
>
> My assertion is this: "If a DBA creates an index, he has a basis for his
> actions."

What about a GUC parameter

prefer_indexes = yes/no

Which when set to yes, assumes the DBA knows what he's doing.  Unless the
table is really small, in which case it'll still scan.

But then again, if the dba sets up a huge table (million rows) and does a
select over an indexed field that will return 1/6 of all the rows, then
postgres would be nuts to use the index...

But then if the DBA does a query to return just 1 of the rows, postgres
would be nuts NOT to use the index.  How do you handle this situation?

Chris

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Luis Alberto Amigo Navarro"

Date:

17 April 2002, 03:25:42

> > On Wed, 2002-04-17 at 06:51, mlw wrote:
> > > I just think there is sufficient evidence to suggest that if a DBA
creates an
> > > index, there is strong evidence (better than statistics) that the
index need be
> > > used. In the event that an index exists, there is a strong indication
that,
> > > without overwhelming evidence, that the index should be used. You have
admitted
> > > that statistics suck, but the existence of an index must weight
(heavily) on
> > > the evaluation on whether or not to use an index.

On my own few experience I think this could be solved decreasing
random_page_cost, if you would prefer to use indexes than seq scans, then
you can lower random_page_cost to a point in which postgres works as you
want. So the planner would prefer indexes when in standard conditions it
would prefer seq scans.

Regards

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

17 April 2002, 05:17:50

My opinion.

Expose some of the cost factors via run-time settings (or start-time 
settings).

This would allow those who wanted to 'tweak' the planner to do so and 
those that felt the defaults were fine or didn't know to leave them alone.

Comments?

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

17 April 2002, 05:45:00


Oliver Elphick wrote:

>On Wed, 2002-04-17 at 06:51, mlw wrote:
>
>>I just think there is sufficient evidence to suggest that if a DBA creates an
>>index, there is strong evidence (better than statistics) that the index need be
>>used. In the event that an index exists, there is a strong indication that,
>>without overwhelming evidence, that the index should be used. You have admitted
>>that statistics suck, but the existence of an index must weight (heavily) on
>>the evaluation on whether or not to use an index.
>>
>
>But indexes are not, for the most part, there because of a specific
>choice to have an index, but as the implementation of PRIMARY KEY and
>UNIQUE.  Therefore the main part of your argument fails.
>
That is not my experience.  Wholly 3/4's of the indices in PeopleSoft, 
SAP, and Clarify (on top of Oracle 8 and 8i backends) are there solely 
for perfomance reasons, the remaining 1/4 are there because of 
uniqueness and primary key responsibilities.

In many of the cases where it is a primary key it is also there to 
ensure fast lookups when referenced as a foreign key.  Or for joins.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Jean-Paul ARGUDO

Date:

17 April 2002, 06:15:28

> Here is the problem, in a single paragraph.
> 
> If the DBA notices that there is a problem with a query, he adds an index, he
> notices that there is no difference, then he notices that PostgreSQL is not
> using his index. First and foremost he gets mad at PostgreSQL for not using his
> index. If PostgreSQL decided to use an index which increases execution time,
> the DBA would delete the index. If PostgreSQL does not use an index, he has to
> modify the posgresql.conf file, which disallows PostgreSQL from using an index
> when it would be a clear loser.
> 
> My assertion is this: "If a DBA creates an index, he has a basis for his
> actions."

I agree with Mark.

I jump on this thread to ask some questions:

1) When a DBA creates an index, it is mainly to optimize. But when an
index is created, we need to make a vacuum --analyze in order to give PG
optimizer (totally guessed, Tom may correct this affirmation?) knowledge
of it. My 1st question is : wouldn't we create a kind of trigger to make
an automatic vacuum --analyze on the table when a new index is created
on it?

Here an example on a pratical optimisation day: (taken from a
optimisation journal I make every time I need to make an optimisation on
a customer' database):

« Line 962

EXPLAIN SELECT t12_bskid, t12_pnb, t12_lne, t12_tck
FROM T12_20011231
WHERE t12_bskid >= 1
ORDER BY t12_bskid, t12_pnb, t12_tck, t12_lne;

Sort  (cost=1348.70..1348.70 rows=8565 width=16) ->  Seq Scan on t12_20011231  (cost=0.00..789.20 rows=8565 width=16)


dbkslight=# create index t12_bskid_pnb_tck_lne on t12_20011231
(t12_bskid, t12_pnb, t12_tck, t12_lne);
CREATE
dbkslight=# EXPLAIN SELECT t12_bskid, t12_pnb, t12_lne, t12_tck
dbkslight-# FROM T12_20011231
dbkslight-# WHERE t12_bskid >= 1
dbkslight-# ORDER BY t12_bskid, t12_pnb, t12_tck, t12_lne;

NOTICE:  QUERY PLAN:

Sort  (cost=1348.70..1348.70 rows=8565 width=16) ->  Seq Scan on t12_20011231  (cost=0.00..789.20 rows=8565 width=16)

EXPLAIN

dbkslight=# vacuum analyze t12_20011231;
VACUUM

dbkslight=# EXPLAIN SELECT t12_bskid, t12_pnb, t12_lne, t12_tck
FROM T12_20011231
WHERE t12_bskid >= 1
ORDER BY t12_bskid, t12_pnb, t12_tck, t12_lne;dbkslight-# dbkslight-#
dbkslight-# ;
NOTICE:  QUERY PLAN:

Index Scan using t12_bskid_pnb_tck_lne on t12_20011231
(cost=0.00..2232.11 rows=25693 width=16)

;-))

» end of example............

2) We all know that indices on small tables have to be dropped, because
the seq scan is always cheaper. I wonder about middle tables often
accessed: data are mainly in the PG buffers. So seq scan a table whose
data pages are in the buffer is always cheaper too :) 

Then, depending the memory allowed to PG, can we say indices on medium
tables have also to be dropped? I think so, because index maintenace has
a cost too.

3) I have to say that queries sometimes have to be rewrited. It is very
well explained in the "PostgreSQL Developper Handbook" I have at home...
(at work at the moment, will post complete references later, but surely
you can find this book at techdocs.postgresql.org ..).

I experienced myself many times, joints have to be rewrited...
This is really true for outter joins (LEFT/RIGHT join). And it has to be
tested with explain plans.

Hope this helps.

Regards,


-- 
Jean-Paul ARGUDO                                IDEALX S.A.S
Consultant bases de données                     15-17, av. de Ségur
http://www.idealx.com                           F-75007 PARIS

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

tycho@fruru.com

Date:

17 April 2002, 06:55:25

On Wed, 17 Apr 2002, Bruce Momjian wrote:

> 
> Let me add people's expections of the optimizer and the "it isn't using
> the index" questions are getting very old.  I have beefed up the FAQ
> item on this a month ago, but that hasn't reduced the number of
> questions.  I almost want to require people to read a specific FAQ item
> 4.8 before we will reply to anything.
> 
> Maybe we need an optimizer FAQ that will answer the basic questions for
> people.

Perhaps you could also include some kind of (possibly nonsensical) keyword
like "flerbage" in the faq, asking the user to include it in his question
to prove he has actually read the relevant section ?

Just my 0,02 Euro

Cheers,
Tycho

-- 
Tycho Fruru            tycho.fruru@conostix.com
"Prediction is extremely difficult. Especially about the future." - Niels Bohr

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Hannu Krosing

Date:

17 April 2002, 08:51:15

On Wed, 2002-04-17 at 11:00, mlw wrote:
>
> Here is the problem, in a single paragraph.
> 
> If the DBA notices that there is a problem with a query, he adds an index, he
> notices that there is no difference, then he notices that PostgreSQL is not
> using his index. First and foremost he gets mad at PostgreSQL for not using his
> index.

Perhaps a notice from backend:

NOTICE: I see the DBA has created a useless index ...

;) 

Or would this make the DBA even madder ;) ;)

> If PostgreSQL decided to use an index which increases execution time,
> the DBA would delete the index. If PostgreSQL does not use an index, he has to
> modify the posgresql.conf file,

Or just do

set enable_seqscan to off;

> which disallows PostgreSQL from using an index when it would be a clear loser.
> 
> My assertion is this: "If a DBA creates an index, he has a basis for his
> actions."

The basis can be that "his boss told him to" ?

------------------
Hannu

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Thomas Lockhart

Date:

17 April 2002, 09:52:00

...
> Perhaps you could also include some kind of (possibly nonsensical) keyword
> like "flerbage" in the faq, asking the user to include it in his question
> to prove he has actually read the relevant section ?

*rofl*

But now we'll have to choose some other word.
                  - Thomas

huge runtime difference between 2 almost identical queries (was: Re: Index Scans become Seq Scans after VACUUM ANALYSE)

From

Louis-David Mitterrand

Date:

17 April 2002, 09:52:26

On Wed, Apr 17, 2002 at 12:44:24AM -0400, Tom Lane wrote:
> Louis-David Mitterrand <vindex@apartia.org> writes:
> > While trying to optimise a query I found that running VACUUM ANALYSE
> > changed all the Index Scans to Seq Scans and that the only way to revert
> > to Index Scans was the add "enable_seqscan = 0" in postgresql.conf.
> >> 
> >> EXPLAIN ANALYZE output would be more interesting than just EXPLAIN.
> >> Also, what does the pg_stats view show for these tables?
> 
> > Thanks, pg_stats output is rather big so I attached it in a separate
> > file. Here are the EXPLAIN ANALYZE ouputs:
> 
> Tell you the truth, I'm having a real hard time getting excited over
> a bug report that says the planner chose a plan taking 10.90 seconds
> in preference to one taking 7.96 seconds.

Now using a reduced test request I have a huge difference in runtime
(2317ms vs 4ms) on two almost identitcal queries. In both cases the
where clause uses the same table and pattern, however the slower query's
where-clause table appears at the end of the join:

gesci5=# explain analyse select p1.titre, p1.nom, p1.prenom,  p2.titre, p2.nom, p2.prenom from personne p1 join
prospectp on (p.id_personne1 = p1.id_personne) join  personne p2 on (p.id_personne2 = p2.id_personne) join contact cn
on(p.dernier_contact = cn.id_contact) where lower(p2.nom) like 'marl%' order by date_contact desc;
 
NOTICE:  QUERY PLAN:

Sort  (cost=5944.88..5944.88 rows=137 width=82) (actual time=2317.45..2317.45 rows=5 loops=1) ->  Nested Loop
(cost=2168.52..5940.00rows=137 width=82) (actual time=1061.58..2317.28 rows=5 loops=1)       ->  Hash Join
(cost=2168.52..5208.38rows=137 width=70) (actual time=1061.23..2316.01 rows=5 loops=1)             ->  Hash Join
(cost=1406.52..4238.55rows=27482 width=41) (actual time=355.60..2267.44 rows=27250 loops=1)                   ->  Seq
Scanon personne p1  (cost=0.00..1102.00 rows=44100 width=29) (actual time=0.12..303.42 rows=44100 loops=1)
    ->  Hash  (cost=1216.82..1216.82 rows=27482 width=12) (actual time=354.64..354.64 rows=0 loops=1)
     ->  Seq Scan on prospect p  (cost=0.00..1216.82 rows=27482 width=12) (actual time=0.11..257.51 rows=27482 loops=1)
           ->  Hash  (cost=761.45..761.45 rows=220 width=29) (actual time=0.33..0.33 rows=0 loops=1)
-> Index Scan using personne_nom on personne p2  (cost=0.00..761.45 rows=220 width=29) (actual time=0.07..0.29 rows=16
loops=1)      ->  Index Scan using contact_pkey on contact cn  (cost=0.00..5.31 rows=1 width=12) (actual
time=0.22..0.23rows=1 loops=5)
 
Total runtime: 2317.77 msec

EXPLAIN

gesci5=# explain analyse select p1.titre, p1.nom, p1.prenom,  p2.titre, p2.nom, p2.prenom from personne p1 join
prospectp on (p.id_personne1 = p1.id_personne) join  personne p2 on (p.id_personne2 = p2.id_personne) join contact cn
on(p.dernier_contact = cn.id_contact) where lower(p1.nom) like 'marl%' order by date_contact desc;
 
NOTICE:  QUERY PLAN:

Sort  (cost=3446.49..3446.49 rows=137 width=82) (actual time=3.85..3.85 rows=5 loops=1) ->  Nested Loop
(cost=0.00..3441.61rows=137 width=82) (actual time=1.86..3.55 rows=5 loops=1)       ->  Nested Loop
(cost=0.00..2709.99rows=137 width=70) (actual time=1.81..3.32 rows=5 loops=1)             ->  Nested Loop
(cost=0.00..2018.40rows=137 width=41) (actual time=0.58..2.41 rows=10 loops=1)                   ->  Index Scan using
personne_nomon personne p1  (cost=0.00..761.45 rows=220 width=29) (actual time=0.30..0.55 rows=16 loops=1)
    ->  Index Scan using prospect_personne1 on prospect p  (cost=0.00..5.69 rows=1 width=12) (actual time=0.10..0.11
rows=1loops=16)             ->  Index Scan using personne_pkey on personne p2  (cost=0.00..5.02 rows=1 width=29)
(actualtime=0.08..0.08 rows=0 loops=10)       ->  Index Scan using contact_pkey on contact cn  (cost=0.00..5.31 rows=1
width=12)(actual time=0.03..0.03 rows=1 loops=5)
 
Total runtime: 4.17 msec


--    PHEDRE: Il n'est plus temps. Il sait mes ardeurs insensées.           De l'austère pudeur les bornes sont
passées.                                        (Phèdre, J-B Racine, acte 3, scène 1)

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

17 April 2002, 09:56:15

mlw <markw@mohawksoft.com> writes:
> If the DBA notices that there is a problem with a query, he adds an
> index, he notices that there is no difference, then he notices that
> PostgreSQL is not using his index. First and foremost he gets mad at
> PostgreSQL for not using his index. If PostgreSQL decided to use an
> index which increases execution time, the DBA would delete the
> index.

I don't buy that argument at all.  It might be a unique index that he
must have in place for data integrity reasons.  It might be an index
that he needs for a *different* query.

If the table has more than one index available that might be usable
with a particular query, how does your argument help?  It doesn't.
We still have to trust to statistics and cost estimates.  So I intend
to proceed on the path of improving the estimator, not in the direction
of throwing it out in favor of rules-of-thumb.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

17 April 2002, 10:08:25


Tom Lane wrote:

>mlw <markw@mohawksoft.com> writes:
>
>>If the DBA notices that there is a problem with a query, he adds an
>>index, he notices that there is no difference, then he notices that
>>PostgreSQL is not using his index. First and foremost he gets mad at
>>PostgreSQL for not using his index. If PostgreSQL decided to use an
>>index which increases execution time, the DBA would delete the
>>index.
>>
>
>I don't buy that argument at all.  It might be a unique index that he
>must have in place for data integrity reasons.  It might be an index
>that he needs for a *different* query.
>
>If the table has more than one index available that might be usable
>with a particular query, how does your argument help?  It doesn't.
>We still have to trust to statistics and cost estimates.  So I intend
>to proceed on the path of improving the estimator, not in the direction
>of throwing it out in favor of rules-of-thumb.
>
On this point I fully agree.  A cost-estimator helps massively in cases 
where there are multiple candidate indices.  My only current complaint 
is the current coster needs either a little optimisation work, or 
perhaps some of it's internals (like random_page_cost) exposed in a 
malleable way.

What is valid for one application will not (necessarily) be valid for 
another application.  OR set of applications. OLTP has entirely 
different goals from Data Warehouseing.  Throwing out the coster would 
be throwing the baby out with the bathwater.  The coster is *not* satan. It may be the root of a few evils though ]:>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

17 April 2002, 10:11:57

OK so maybe I'm on some crack, but looking at the docs most of the 
planners internal cost estimator constants are exposed...  It seems that 
a few short FAQ entries may just need to be in order to point out that 
you can flex these values a little to get the desired results....

I guess I should RTFM more :)

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Thomas Lockhart

Date:

17 April 2002, 10:15:25

> Oh, come on Tom, surely I have been around long enough to lend credence that
> wish to have a positive affect on PostgreSQL development.

:) Tom does have a way with words sometimes, eh?

> enable_seqscan=0, disallows sequential scan, that is not what I am saying. This
> is a problem I (and others) have been yapping about for a long time.
> I just think there is sufficient evidence to suggest that if a DBA creates an
> index, there is strong evidence (better than statistics) that the index need be
> used. In the event that an index exists, there is a strong indication that,
> without overwhelming evidence, that the index should be used. You have admitted
> that statistics suck, but the existence of an index must weight (heavily) on
> the evaluation on whether or not to use an index.

Tom is a mathematician by training, and is trying to balance the
optimizer decisions right on the transition between best and next-best
possibility. Biasing it to one decision or another when all of his test
cases clearly show the *other* choice would be better puts it in the
realm of an arbitrary choice *not* supported by the data!

afaict there are *two* areas which might benefit from analysis or
adjustments, but they have to be supported by real test cases.

1) the cost estimates used for each property of the data and storage.

2) the statistical sampling done on actual data during analysis.

The cost numbers have been tested on what is hopefully a representative
set of machines. It may be possible to have a test suite which would
allow folks to run the same data on many different platforms, and to
contribute other test cases for consideration. Perhaps you would like to
organize a test suite? Tom, do you already have cases you would like to
see in this test?

The statistical sampling (using somewhere between 1 and 10% of the data
afaicr) *could* get fooled by pathalogical storage topologies. So for
cases which seem to have reached the "wrong" conclusion we should show
*what* values for the above would have arrived at the correct result
*without* biasing other potential results in the wrong direction.

Systems which have optimizing planners can *never* be guaranteed to
generate the actual lowest-cost query plan. Any impression that Oracle,
for example, actually does do that may come from a lack of visibility
into the process, and a lack of forum for discussing these edge cases.

Please don't take Tom's claim that he doesn't get excited about wrong
planner choices which are not too wrong as an indication that he isn't
interested. The point is that for *edge cases* the "correct answer" can
never be known until the query is actually run two different ways. And
the planner is *never* allowed to do that. So tuning the optimizer over
time is the only way to improve things, and with edge cases a factor of
two in timing is, statistically, an indication that the results are
close to optimal.

We rarely get reports that the planner made the best choice for a plan,
but of course people usually don't consider optimal performance to be a
reportable problem ;)
                     - Thomas

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

17 April 2002, 10:24:01


Thomas Lockhart wrote:

>
>Systems which have optimizing planners can *never* be guaranteed to
>generate the actual lowest-cost query plan. Any impression that Oracle,
>for example, actually does do that may come from a lack of visibility
>into the process, and a lack of forum for discussing these edge cases.
>
I wholly agree...  Oracle has some fairly *sick* ideas at times about 
what to do in the face of partial ambiguity.  (I've got a small set of 
queries that will drive any machine with PeopleSoft DBs loaded to near 
catatonia...) :)

As far as the 'planner benchmark suite' so we cans tart gathering more 
statistical data about what costs should be, or are better at, that's an 
excellent idea.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 10:32:45

Thomas Lockhart wrote:
> Systems which have optimizing planners can *never* be guaranteed to
> generate the actual lowest-cost query plan. Any impression that Oracle,
> for example, actually does do that may come from a lack of visibility
> into the process, and a lack of forum for discussing these edge cases.

And here in lies the crux of the problem. It isn't a purely logical/numerical
formula. It is a probability estimate, nothing more. Currently, the statistics
are used to calculate a probable best query, not a guaranteed best query. The
presence of an index should be factored into the probability of a best query,
should it not?

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Thomas Lockhart

Date:

17 April 2002, 10:34:55

...
> I experienced myself many times, joints have to be rewrited...
> This is really true for outter joins (LEFT/RIGHT join). And it has to be
> tested with explain plans.

It is particularly true for "join syntax" as used in outer joins because
*that bypasses the optimizer entirely*!!! I'd like to see that changed,
since the choice of syntax should have no effect on performance. And
although the optimizer can adjust query plans to choose the best one
(most of the time anyway ;) there is likely to be *only one correct way
to write a query using join syntax*. *Every* other choice for query will
be wrong, from a performance standpoint.

That is a bigger "foot gun" than the other things we are talking about,
imho.
                 - Thomas

foot gun (n.) A tool built solely for shooting oneself or another person
in the foot.

Re: huge runtime difference between 2 almost identical queries (was: Re: Index Scans become Seq Scans after VACUUM ANALYSE)

From

Tom Lane

Date:

17 April 2002, 10:38:15

Louis-David Mitterrand <vindex@apartia.org> writes:
> gesci5=# explain analyse select p1.titre, p1.nom, p1.prenom,  p2.titre, p2.nom, p2.prenom from personne p1 join
prospectp on (p.id_personne1 = p1.id_personne) join  personne p2 on (p.id_personne2 = p2.id_personne) join contact cn
on(p.dernier_contact = cn.id_contact) where lower(p2.nom) like 'marl%' order by date_contact desc;

> gesci5=# explain analyse select p1.titre, p1.nom, p1.prenom,  p2.titre, p2.nom, p2.prenom from personne p1 join
prospectp on (p.id_personne1 = p1.id_personne) join  personne p2 on (p.id_personne2 = p2.id_personne) join contact cn
on(p.dernier_contact = cn.id_contact) where lower(p1.nom) like 'marl%' order by date_contact desc;

But these aren't at *all* the same query --- the useful constraint is on
p2 in the first case, and p1 in the second.  Given the way you've
written the join, the constraint on p2 can't be applied until after
the p1/p join is formed --- see 
http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/explicit-joins.html

I've always thought of the follow-the-join-structure rule as a stopgap
measure until we think of something better; it's not intuitive that
writing queries using INNER JOIN/ON isn't equivalent to writing FROM/WHERE.
On the other hand it provides a useful "out" for those people who are
joining umpteen tables and need to short-circuit the planner's search
heuristics.  If I take it out, I'll get beat up by the same camp that
thinks they should be able to override the planner's ideas about whether
to use an index ;-)

The EXPLAINs also remind me that we don't currently have any statistics
that can be applied for clauses like "lower(p2.nom) like 'marl%'".
We've talked in the past about having the system gather and use stats
on the values of functional indexes --- for example, if you have an
index on lower(p2.nom) then this would allow a rational estimate to be
made about the selectivity of "lower(p2.nom) like 'marl%'".  But I
haven't had any time to pursue it myself.  Anyway it doesn't appear
that that's causing a bad choice of plan in this case.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Lincoln Yeoh

Date:

17 April 2002, 10:48:15

At 07:15 AM 4/17/02 -0700, Thomas Lockhart wrote:

>Tom is a mathematician by training, and is trying to balance the
>optimizer decisions right on the transition between best and next-best
>possibility. Biasing it to one decision or another when all of his test
>cases clearly show the *other* choice would be better puts it in the
>realm of an arbitrary choice *not* supported by the data!

I do agree with Mark that most cases an index is added to increase 
performance, if the index is used but doesn't improve performance DBAs will 
drop them to improve insert/update performance (or they will leave it for 
when the table grows big). Thus the bias should be towards using the index 
(which may already be the case for most situations).

My guess on why one hears many complaints about Postgresql not using the 
index is because when things work fine you don't hear complaints :). I also 
suspect when Postgresql wrongly uses the index instead of sequential scans 
not as many people bother dropping the index to test for a performance 
increase.

But it may well be that the cost of wrongly using the index is typically 
not as high as wrongly doing a sequential scan, and that is why people 
don't get desperate enough to drop the index and grumble about it.

Weighing these factors, perhaps once we get one or two complaining about 
postgresql using an index vs 20 complaining about not using an index, then 
the optimizer values have reached a good compromise :). But maybe the ratio 
should be 1 vs 100?

What do you think? ;).

Cheerio,
Link.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Thomas Lockhart

Date:

17 April 2002, 10:55:36

> > Systems which have optimizing planners can *never* be guaranteed to
> > generate the actual lowest-cost query plan. Any impression that Oracle,
> > for example, actually does do that may come from a lack of visibility
> > into the process, and a lack of forum for discussing these edge cases.
> And here in lies the crux of the problem. It isn't a purely logical/numerical
> formula. It is a probability estimate, nothing more. Currently, the statistics
> are used to calculate a probable best query, not a guaranteed best query. The
> presence of an index should be factored into the probability of a best query,
> should it not?

Well, it is already. I *think* what you are saying is that the numbers
should be adjusted to bias the choice toward an index; that *choosing*
the index even if the statistics (and hence the average result) will
produce a slower query is preferred to trying to choose the lowest cost
plan.

afaict we could benefit from more test cases run on more machines.
Perhaps we could also benefit from being able to (easily) run multiple
versions of plans, so folks can see whether the system is actually
choosing the correct one. But until we get better coverage of more test
cases on more platforms, adjusting the planner based on a small number
of "problem queries" is likely to lead to "problem queries" which
weren't problems before!

That is why Tom gets excited about "factor of 10 problems", but not
about factors of two. Because he knows that there are lots of queries
which happen to fall on the other side of the fence, misestimating the
costs by a factor of two *in the other direction*, which you will not
notice because that happens to choose the correct plan anyway.
                   - Thomas

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 10:58:43

Tom Lane wrote:
> If the table has more than one index available that might be usable
> with a particular query, how does your argument help?  It doesn't.
> We still have to trust to statistics and cost estimates.  So I intend
> to proceed on the path of improving the estimator, not in the direction
> of throwing it out in favor of rules-of-thumb.

I'm not saying ignore the statistics. The cost analyzer is trying to create a
good query based on the information about the table. Since the statistics are a
summation of table characteristics they will never be 100% accurate.

Complex systems have behaviors which are far more unpredictable in practice,
numbers alone will not predict behavior. Theoretically they can, of course, but
the amount and complexity of the information which would need to be processed
to make a good prediction would be prohibitive. Think about the work being done
in weather prediction. "rules-of-thumb" are quite important. PostgreSQL already
has a number of them, what do you think cpu_tuple_cost and random_page_cost
are?

How about a configuration option? Something like an index_weight ratio.

A setting of 1.0 would tell the optimizer that if the index and the non-index
lookup are the same, it would use the index.

A setting of 2.0 Would tell the optimizer that the index cost would need to be
twice that of the non-index lookup to avoid using the index.

How about that?

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Thomas Lockhart

Date:

17 April 2002, 11:01:05

...
> Weighing these factors, perhaps once we get one or two complaining about
> postgresql using an index vs 20 complaining about not using an index, then
> the optimizer values have reached a good compromise :). But maybe the ratio
> should be 1 vs 100?

:)

So we should work on collecting those statistics, rather than statistics
on data. What do you think Tom; should we work on a "mailing list based
planner" which adjusts numbers from, say, a web site? That is just too
funny :)))
                  - Thomas

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Christopher Kings-Lynne"

Date:

17 April 2002, 11:04:19

> Systems which have optimizing planners can *never* be guaranteed to
> generate the actual lowest-cost query plan. Any impression that Oracle,
> for example, actually does do that may come from a lack of visibility
> into the process, and a lack of forum for discussing these edge cases.

Hmmm...with PREPARE and EXECUTE, would it be possible to somehow get the
planner to actually run a few different selects and then actually store the
actual fastest plan?  I'm being very fanciful here, of course...

Chris

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

17 April 2002, 11:11:14

Y'all are having entirely too much fun with this :P

Though the headling ...
'PostgreSQL with its proprietary bitch-rant-rating query planner storms 
the DB front.'

does have a certain...entertainment value.

Thomas Lockhart wrote:

>...
>
>>Weighing these factors, perhaps once we get one or two complaining about
>>postgresql using an index vs 20 complaining about not using an index, then
>>the optimizer values have reached a good compromise :). But maybe the ratio
>>should be 1 vs 100?
>>
>
>:)
>
>So we should work on collecting those statistics, rather than statistics
>on data. What do you think Tom; should we work on a "mailing list based
>planner" which adjusts numbers from, say, a web site? That is just too
>funny :)))
>
>                   - Thomas
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

17 April 2002, 11:16:19

"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> writes:
> On my own few experience I think this could be solved decreasing
> random_page_cost, if you would prefer to use indexes than seq scans, then
> you can lower random_page_cost to a point in which postgres works as you
> want. So the planner would prefer indexes when in standard conditions it
> would prefer seq scans.

It's entirely possible that the default value of random_page_cost is too
high, at least for many modern machines.  The experiments I did to get
the 4.0 figure were done a couple years ago, on hardware that wasn't
exactly new at the time.  I have not heard of anyone else trying to
measure it though.

I don't think I have the source code I used anymore, but the principle
is simple enough:

1. Make a large file (several times the size of your machine's RAM, to
ensure you swamp out kernel disk buffering effects).  Fill with random
data. (NB: do not fill with zeroes, some filesystems optimize this away.)

2. Time reading the file sequentially, 8K per read request.
Repeat enough to get a statistically trustworthy number.

3. Time reading randomly-chosen 8K pages from the file.  Repeat
enough to get a trustworthy number (the total volume of pages read
should be several times the size of your RAM).

4. Divide.

The only tricky thing about this is making sure you are measuring disk
access times and not being fooled by re-accessing pages the kernel still
has cached from a previous access.  (The PG planner does try to account
for caching effects, but that's a separate estimate; the value of
random_page_cost isn't supposed to include caching effects.)  AFAIK the
only good way to do that is to use a large test, which means it takes
awhile to run; and you need enough spare disk space for a big test file.

It'd be interesting to get some numbers for this across a range of
hardware, filesystems, etc ...
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 11:24:05

From:         mlw <markw@mohawksoft.com>
       
 
11:05
Subject:         Re: Index Scans become Seq Scans after VACUUM ANALYSE    To:         Thomas Lockhart
<thomas@fourpalms.org>




Thomas Lockhart wrote:
> 
> ...
> > Weighing these factors, perhaps once we get one or two complaining about
> > postgresql using an index vs 20 complaining about not using an index, then
> > the optimizer values have reached a good compromise :). But maybe the ratio
> > should be 1 vs 100?
> 
> :)
> 
> So we should work on collecting those statistics, rather than statistics
> on data. What do you think Tom; should we work on a "mailing list based
> planner" which adjusts numbers from, say, a web site? That is just too
> funny :)))

No, you miss the point!

On borderline conditions, wrongly using an index does not result in as bad
performance as wrongly not using an index, thus usage of an index should be
weighted higher because the risk of not using the index out weighs the risk of
using it.

Re: huge runtime difference between 2 almost identical queries (was: Re: Index Scans become Seq Scans after VACUUM ANALYSE)

From

Louis-David Mitterrand

Date:

17 April 2002, 11:26:37

On Wed, Apr 17, 2002 at 10:38:15AM -0400, Tom Lane wrote:
> 
> But these aren't at *all* the same query --- the useful constraint is on
> p2 in the first case, and p1 in the second.  Given the way you've
> written the join, the constraint on p2 can't be applied until after
> the p1/p join is formed --- see 
> http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/explicit-joins.html
> 
> I've always thought of the follow-the-join-structure rule as a stopgap
> measure until we think of something better; it's not intuitive that
> writing queries using INNER JOIN/ON isn't equivalent to writing FROM/WHERE.
> On the other hand it provides a useful "out" for those people who are
> joining umpteen tables and need to short-circuit the planner's search
> heuristics.  If I take it out, I'll get beat up by the same camp that
> thinks they should be able to override the planner's ideas about whether
> to use an index ;-)

Hmm, since 7.1 released we have religiously converted all our joins to
the new syntax, thinking it more politically correct ;-). But now all
our beliefs are put into question. Back to old joins, in certain cases.

Here is the rule of thumb we deduct from your message: only use explicit
join syntax if a left|right|full join is involved OR if the
conditional(s) can go into the ON() clause, ELSE use the old join
syntax.

Is that more or less correct?

Preliminary tests converting the query I previously sent you to the old
syntax are indeed very impressive: now in both cases (comparaison on p1
or p2 take ~ 1ms).

THANKS A LOT FOR THE HEADS UP!

--    THESEE: Il fallait, en fuyant, ne pas abandonner           Le fer qui dans ses mains aide à te condamner ;
                                (Phèdre, J-B Racine, acte 4, scène 2)

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Mike Mascari

Date:

17 April 2002, 11:32:19

Thomas Lockhart wrote:
> 
> ...
> > I experienced myself many times, joints have to be rewrited...
> > This is really true for outter joins (LEFT/RIGHT join). And it has to be
> > tested with explain plans.
> 
> It is particularly true for "join syntax" as used in outer joins because
> *that bypasses the optimizer entirely*!!! I'd like to see that changed,
> since the choice of syntax should have no effect on performance. And
> although the optimizer can adjust query plans to choose the best one
> (most of the time anyway ;) there is likely to be *only one correct way
> to write a query using join syntax*. *Every* other choice for query will
> be wrong, from a performance standpoint.
> 
> That is a bigger "foot gun" than the other things we are talking about,
> imho.

Please keep in mind that before the explicit join syntax "optimization"
was available, my 20-way joins took PostgreSQL forever to complete,
regardless of whether the genetic query optimizer was enabled or not.
Using the explicit join syntax, they return results essentially
instantaneously. Perhaps the optimizer can be used for LEFT/RIGHT joins,
but I would still like the option to use explicit join orders in order
to prohibit the exponential growth in time spend in the optimizer. I'd
prefer to have a gun to shoot myself in the foot with than no gun at
all. Or rather, I'd prefer to have a "foot gun" than a "server gun"

server gun (n.) A tool built solely for shooting a PostgreSQL server
when waiting for the completion of a query using a large number of joins

;-)

Mike Mascari
mascarm@mascari.com

> 
>                   - Thomas
> 
> foot gun (n.) A tool built solely for shooting oneself or another person
> in the foot.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

17 April 2002, 11:46:41

Thomas Lockhart <thomas@fourpalms.org> writes:
> It is particularly true for "join syntax" as used in outer joins because
> *that bypasses the optimizer entirely*!!! I'd like to see that changed,
> since the choice of syntax should have no effect on performance.

For an OUTER join, we do have to join in the specified order, don't we?
(A left join B) left join C doesn't necessarily produce the same set
of rows as A left join (B left join C).

The fact that INNER JOIN syntax is currently treated the same way is
partly an artifact of implementation convenience, but also partly a
response to trouble reports we were receiving about the amount of
planning time spent on many-way joins.  If we cause the planner to
treat A INNER JOIN B the same way it treats "FROM A,B WHERE", I think
we'll be back in the soup with the folks using dozen-table joins.

Would it make sense to flatten out INNER JOINs only when the total
number of tables involved is less than some parameter N?  N
around six or eight would probably keep the complex-query crowd
happy, while not causing unintuitive behavior for simple queries.
Anybody who really likes the current behavior could set N=1 to force
the system to obey his join order.

(There already is a comparable heuristic used for deciding whether
to pull up subqueries, but its control parameter isn't separately
exposed at the moment.)
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

17 April 2002, 12:10:50

mlw <markw@mohawksoft.com> writes:
> On borderline conditions, wrongly using an index does not result in as bad
> performance as wrongly not using an index,

You're arguing from a completely false premise.  It might be true on the
particular cases you've looked at, but in general an indexscan-based
plan can be many times worse than a seqscan plan.

In particular this is likely to hold when the plan has to access most or
all of the table.  I still remember the first time I got my nose rubbed
in this unfortunate fact.  I had spent a lot of work improving the
planner's handling of sort ordering to the point where it could use an
indexscan in place of seqscan-and-sort to handle ORDER BY queries.
I proudly committed it, and immediately got complaints that ORDER BY was
slower than before on large tables.  Considering how slow a large sort
operation is, that should give you pause.

As for "borderline conditions", how is the planner supposed to know what
is borderline?

I cannot see any rational justification for putting a thumb on the
scales on the side of indexscan (or any other specific plan type)
as you've proposed.  Thomas correctly points out that you'll just move
the planner failures from one area to another.

If we can identify a reason why the planner tends to overestimate the
costs of indexscan vs seqscan, by all means let's fix that.  But let's
not derive cost estimates that are the best we know how to make and
then ignore them.
        regards, tom lane

Re: huge runtime difference between 2 almost identical queries (was: Re: Index Scans become Seq Scans after VACUUM ANALYSE)

From

Tom Lane

Date:

17 April 2002, 12:14:47

Louis-David Mitterrand <vindex@apartia.org> writes:
> Hmm, since 7.1 released we have religiously converted all our joins to
> the new syntax, thinking it more politically correct ;-). But now all
> our beliefs are put into question. Back to old joins, in certain cases.

> Here is the rule of thumb we deduct from your message: only use explicit
> join syntax if a left|right|full join is involved OR if the
> conditional(s) can go into the ON() clause, ELSE use the old join
> syntax.

I don't see that the ON clause has anything to do with it.  You must use
the JOIN syntax for any kind of outer join, of course.  For an inner
join, the planner currently has a better shot at choosing the right plan
if you don't use JOIN syntax.

See nearby thread for some discussion about tweaking this aspect of the
planner's behavior.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Hannu Krosing

Date:

17 April 2002, 12:15:28

On Wed, 2002-04-17 at 17:16, Tom Lane wrote:
> 
> It's entirely possible that the default value of random_page_cost is too
> high, at least for many modern machines.  The experiments I did to get
> the 4.0 figure were done a couple years ago, on hardware that wasn't
> exactly new at the time.  I have not heard of anyone else trying to
> measure it though.
> 
> I don't think I have the source code I used anymore, but the principle
> is simple enough:
> 
> 1. Make a large file (several times the size of your machine's RAM, to
> ensure you swamp out kernel disk buffering effects).  Fill with random
> data. (NB: do not fill with zeroes, some filesystems optimize this away.)

People running postgres often already have large files of random data
under $PGDATA directory :)

> 2. Time reading the file sequentially, 8K per read request.
> Repeat enough to get a statistically trustworthy number.
> 
> 3. Time reading randomly-chosen 8K pages from the file.  Repeat
> enough to get a trustworthy number (the total volume of pages read
> should be several times the size of your RAM).
> 
> 4. Divide.
> 
> The only tricky thing about this is making sure you are measuring disk
> access times and not being fooled by re-accessing pages the kernel still
> has cached from a previous access.  (The PG planner does try to account
> for caching effects, but that's a separate estimate; the value of
> random_page_cost isn't supposed to include caching effects.)  AFAIK the
> only good way to do that is to use a large test, which means it takes
> awhile to run; and you need enough spare disk space for a big test file.

If you have the machine all for yourself you can usually tell it to use
less RAM at boot time.

On linux it is append=" mem=32M" switch in lilo.conf or just mem=32M on
lilo boot command line.

> It'd be interesting to get some numbers for this across a range of
> hardware, filesystems, etc ...

---------------
Hannu

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Hannu Krosing

Date:

17 April 2002, 12:27:51

On Wed, 2002-04-17 at 19:15, Hannu Krosing wrote:
> On Wed, 2002-04-17 at 17:16, Tom Lane wrote:
> > 
> > It's entirely possible that the default value of random_page_cost is too
> > high, at least for many modern machines.  The experiments I did to get
> > the 4.0 figure were done a couple years ago, on hardware that wasn't
> > exactly new at the time.  I have not heard of anyone else trying to
> > measure it though.
> > 
> > I don't think I have the source code I used anymore, but the principle
> > is simple enough:
> > 
> > 1. Make a large file (several times the size of your machine's RAM, to
> > ensure you swamp out kernel disk buffering effects).  Fill with random
> > data. (NB: do not fill with zeroes, some filesystems optimize this away.)
> 
> People running postgres often already have large files of random data
> under $PGDATA directory :)

OTOH, it is also important where the file is on disk. As seen from disk
speed test graphs on http://www.tomshardware.com , the speed difference
of sequential reads is 1.5 to 2.5 between inner and outer tracks. 

> > 2. Time reading the file sequentially, 8K per read request.
> > Repeat enough to get a statistically trustworthy number.
> > 
> > 3. Time reading randomly-chosen 8K pages from the file.  Repeat
> > enough to get a trustworthy number (the total volume of pages read
> > should be several times the size of your RAM).
> > 
> > 4. Divide.
> > 
> > The only tricky thing about this is making sure you are measuring disk
> > access times and not being fooled by re-accessing pages the kernel still
> > has cached from a previous access.  (The PG planner does try to account
> > for caching effects, but that's a separate estimate;

Will it make the random and seq read cost equal when cache size >
database size and enough queries are performed to assume that all data
is in cache.

Also, can it distinguish between data in pg internal cache (shared
memory) and data in OS filesystem cache ?

> > the value of
> > random_page_cost isn't supposed to include caching effects.)  AFAIK the
> > only good way to do that is to use a large test, which means it takes
> > awhile to run; and you need enough spare disk space for a big test file.
> 
> If you have the machine all for yourself you can usually tell it to use
> less RAM at boot time.
> 
> On linux it is append=" mem=32M" switch in lilo.conf or just mem=32M on
> lilo boot command line.
> 
> > It'd be interesting to get some numbers for this across a range of
> > hardware, filesystems, etc ...
> 
> ---------------
> Hannu
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 12:36:56

Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > On borderline conditions, wrongly using an index does not result in as bad
> > performance as wrongly not using an index,
> 
> You're arguing from a completely false premise.  It might be true on the
> particular cases you've looked at, but in general an indexscan-based
> plan can be many times worse than a seqscan plan.

OK, I'll grant you that, but I am talking about the space between when it is
clear that an index is useful and when it is clear that it is not. For some
reason you seem to think I am saying "always use an index," when, in fact, I am
saying more preference should be given to using an index than it currently has.

> As for "borderline conditions", how is the planner supposed to know what
> is borderline?

It need not know about borderline conditions.

> 
> I cannot see any rational justification for putting a thumb on the
> scales on the side of indexscan (or any other specific plan type)
> as you've proposed.  Thomas correctly points out that you'll just move
> the planner failures from one area to another.

I don't think this is true, and you yourself had said you are not too worried
about a 10 vs 8 second difference. I have seen many instances of when
PostgreSQL refuses to use an index because the data distribution is uneven.
Making it more difficult for the planer to ignore an index would solve
practically all the problems I have seen, and I bet the range of instances
where it would incorrectly use an index would not impact performance as badly
as those instances where it doesn't.

> 
> If we can identify a reason why the planner tends to overestimate the
> costs of indexscan vs seqscan, by all means let's fix that.  But let's
> not derive cost estimates that are the best we know how to make and
> then ignore them.

I don't think you can solve this with statistics. It is a far more complex
problem than that. There are too many variables, there is no way a standardized
summation will accurately characterize all possible tables. There must be a way
to add heuristics to the cost based analyzer.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

17 April 2002, 13:43:47

Hannu Krosing <hannu@tm.ee> writes:
> OTOH, it is also important where the file is on disk. As seen from disk
> speed test graphs on http://www.tomshardware.com , the speed difference
> of sequential reads is 1.5 to 2.5 between inner and outer tracks. 

True.  But if we use the same test file for both the sequential and
random-access timings, hopefully the absolute speed of access will
cancel out.  (Again, it's the sort of thing that could use some
real-world testing...)

> (The PG planner does try to account
> for caching effects, but that's a separate estimate;

> Will it make the random and seq read cost equal when cache size >
> database size and enough queries are performed to assume that all data
> is in cache.

There isn't any attempt to account for the effects of data having been
read into cache by previous queries.  I doubt that it would improve the
model to try to keep track of what the recent queries were --- for one
thing, do you really want your plans changing on the basis of activity
of other backends?

One place where this does fall down is in nestloops with inner index
scans --- if we know that the inner query will be evaluated multiple
times, then we should give it some kind of discount for cache effects.
Investigating this is on the todo list...

> Also, can it distinguish between data in pg internal cache (shared
> memory) and data in OS filesystem cache ?

Currently we treat those alike.  Yeah, the OS cache is slower to get to,
but in comparison to a physical disk read I think the difference is
insignificant.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

17 April 2002, 13:46:30

mlw <markw@mohawksoft.com> writes:
> ... I have seen many instances of when
> PostgreSQL refuses to use an index because the data distribution is uneven.

This is fixed, or at least attacked, in 7.2.  Again, I do not see this
as an argument for making the planner stupider instead of smarter.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 13:51:06

Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > ... I have seen many instances of when
> > PostgreSQL refuses to use an index because the data distribution is uneven.
> 
> This is fixed, or at least attacked, in 7.2.  Again, I do not see this
> as an argument for making the planner stupider instead of smarter.
> 

You completely ignored the point I was trying to make. Statistics are a
summation of the data, not the actual data. As such, it can not possibly
represent all possible configurations of tables.

Adding huristics, such as weighting for index scans, is not making the planner
stupider. It is making it smarter and more flexable.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Andrew Sullivan

Date:

17 April 2002, 14:09:11

On Wed, Apr 17, 2002 at 12:35:13PM -0400, mlw wrote:

> about a 10 vs 8 second difference. I have seen many instances of when
> PostgreSQL refuses to use an index because the data distribution is uneven.
> Making it more difficult for the planer to ignore an index would solve
> practically all the problems I have seen, and I bet the range of instances
> where it would incorrectly use an index would not impact performance as badly
> as those instances where it doesn't.

You bet, eh?  Numbers, please.  

The best evidence that anyone has been able to generate is _already_
the basis for the choices the planner makes.  If you can come up with
other cases where it consistently makes the wrong choice, good:
that's data to work with.  Maybe it'll expose whatever it is that's
wrong.  But it is not a general case, anyway, so you can't draw any
conclusion at all about other cases from your case.  And Tom Lane is
right: the repair is _not_ to use some rule of thumb that an index is
probably there for a reason.

Given the apparent infrequency of docs-consultation, I am
considerably less sanguine than you are about the correctness of the
choices many DBAs make.  Poking at the planner to make it use an
index more often strikes me as at least as likely to cause worse
performance.

> I don't think you can solve this with statistics.  It is a far more
> complex problem than that. 

Aw, you just need to take more stats courses ;)

A
-- 
----
Andrew Sullivan                               87 Mowat Avenue 
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3                                        +1 416 646 3304
x110

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Peter Eisentraut

Date:

17 April 2002, 14:28:20

mlw writes:

> Adding huristics, such as weighting for index scans, is not making the planner
> stupider. It is making it smarter and more flexable.

If life was as simple as index or no index then this might make some
sense.  But in general the planner has a whole bunch of choices of join
plans, sorts, scans, and the cost of an individual index scan is hidden
down somewhere in the leaf nodes, so you can't simply say that plans of
type X should be preferred when the cost estimates are close.

-- 
Peter Eisentraut   peter_e@gmx.net

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 14:42:45

Andrew Sullivan wrote:
> Given the apparent infrequency of docs-consultation, I am
> considerably less sanguine than you are about the correctness of the
> choices many DBAs make.  Poking at the planner to make it use an
> index more often strikes me as at least as likely to cause worse
> performance.

I disagree :-)

> 
> > I don't think you can solve this with statistics.  It is a far more
> > complex problem than that.
> 
> Aw, you just need to take more stats courses ;)

You need to move a away form the view that everything calculable and
deterministic and move over to the more chaotic perspective where "more likely
than not" is about the best one can hope for.

The cost based optimizer is just such a system. There are so many things that
can affect the performance of a query that there is no way to adequately model
them. Disk performance, inner/outer tracks, RAID systems, concurrent system
activity, and so on.

Look at the pgbench utility. I can't run that program without a +- 10%
variation from run to run, no mater how many times I run vacuum and checkpoint.

When the estimated cost ranges of the different planner strategies overlap, I
think that is a case where two approximations with indeterminate precision must
be evaluated. In such cases, the variance between the numbers have little or no
absolute relevance to one another. This is where heuristics and a bit of
fuzziness needs to be applied. Favoring an index scan over a sequential scan
would probably generate a better query.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 14:52:32

Peter Eisentraut wrote:
> 
> mlw writes:
> 
> > Adding huristics, such as weighting for index scans, is not making the planner
> > stupider. It is making it smarter and more flexable.
> 
> If life was as simple as index or no index then this might make some
> sense.  But in general the planner has a whole bunch of choices of join
> plans, sorts, scans, and the cost of an individual index scan is hidden
> down somewhere in the leaf nodes, so you can't simply say that plans of
> type X should be preferred when the cost estimates are close.
> 
No doubt, no one is arguing that it is easy, but as I said in a branch of this
discussion, when the planner has multiple choices, and the cost ranges
overlapp, the relative numbers are not so meaningful that huristics would not
improve the algorithm.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Rod Taylor"

Date:

17 April 2002, 15:15:48

The fact that an index exists adds a choice -- so by no means is the
index ignored.

But just because a Freeway exists across town doesn't make it faster
than the sideroads.  It depends on the day of week, time of day, and
uncontrollable anomolies (accidents).

--
Rod Taylor

Your eyes are weary from staring at the CRT. You feel sleepy. Notice
how restful it is to watch the cursor blink. Close your eyes. The
opinions stated above are yours. You cannot imagine why you ever felt
otherwise.

----- Original Message -----
From: "mlw" <markw@mohawksoft.com>
To: "Thomas Lockhart" <thomas@fourpalms.org>
Cc: "Tom Lane" <tgl@sss.pgh.pa.us>; "Bruce Momjian"
<pgman@candle.pha.pa.us>; "Louis-David Mitterrand"
<vindex@apartia.org>; <pgsql-hackers@postgresql.org>
Sent: Wednesday, April 17, 2002 10:31 AM
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM
ANALYSE


> Thomas Lockhart wrote:
> > Systems which have optimizing planners can *never* be guaranteed
to
> > generate the actual lowest-cost query plan. Any impression that
Oracle,
> > for example, actually does do that may come from a lack of
visibility
> > into the process, and a lack of forum for discussing these edge
cases.
>
> And here in lies the crux of the problem. It isn't a purely
logical/numerical
> formula. It is a probability estimate, nothing more. Currently, the
statistics
> are used to calculate a probable best query, not a guaranteed best
query. The
> presence of an index should be factored into the probability of a
best query,
> should it not?
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org
>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 16:29:25

Andrew Sullivan wrote: 
> You haven't shown anything except a couple of anecdotal reports as
> evidence against his view.  Anyone who asks you for more evidence
> gets treated to a remark that statistics won't do everything in this
> case. 

I do not, currently, have access to systems which exhibit the behavior, but I
have worked with PostgreSQL quite a bit, and have done a number of projects
with it, and have seen the issue first hand and have had to work around it. I
have posted detailed data to this list in the past.

> You'll need something stronger than an antipathy for
> statistical methods to support your position.

I do not have an antipathy for statistics at all, however statistics are a
reduction of data. They represent a number of properties obtained from a larger
group. For "statistics" to be useful, the trends or characteristics you
calculate must apply to the problem.

Oracle has a cost based optimizer, and they allow you to override it, offer
hints as to what it should do, or use the rules based optimizer. They know that
a cost based optimizer can not generate the best query all the time.

>  The heuristic model
> you propose is a fundamental shift from the current attempts to make
> the planner choose better plans on the basis of what's in the
> database.  You're saying it can never know enough.  And I say, prove
> it.

I say it is obvious it can never know enough, since statistics are a summation
of the data set from which they were obtained, thus they can not contain all
the information about that data set unless they are at least as large as the
data set.
> 
> > to one another. This is where heuristics and a bit of fuzziness
> > needs to be applied. Favoring an index scan over a sequential scan
> > would probably generate a better query.
> 
> Tom has argued, several times, with reference to actual cases, why
> that is false.  Repeating your view doesn't make it so.

For some reason, it seems that Tom is under the impression that I am saying
"always use and index" when that is not what I am saying at all.

Here is the logical argument: (correct me if I am wrong)

(1) The table statistics are a summation of the properties of the table which,
among other things, are thought to affect query performance.

(2) The planner uses the statistics to create estimates about how a strategy
will perform.

(3) The "estimates" based on the table statistics have a degree of uncertainty,
because they are based on the statistical information about the table not the
table itself. The only way to be 100% sure you get the best query is to try all
the permutations of the query.

(4) Since the generated estimates have a degree of uncertainty, when multiple
query paths are evaluated, the planner will choose a suboptimal query once in a
while.

Now my argument, based on my personal experience, and based on Tom's own
statement that +- 2 seconds on a 10 second query is not something the gets
excited about, is this:

When the planner is presented with a number of possible plans, it must weigh
the cost estimates. If there is a choice between two plans which are within
some percent range of each other, we can fall into a risk analysis.

For instance: say we have two similarly performing plans, close to one another,
say within 20%, one plan uses an index, and one does not. It is unlikely that
the index plan will perform substantially worse than the non-index plan, right?
That is the point of the cost estimates, right?

Now, given the choice of the two strategies on a table, both pretty close to
one another, the risk of poor performance for using the index scan is minimal
based on the statistics, but the risk of poor performance for using the
sequential scan is quite high on a large table.

Does anyone disagree?

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Andrew Sullivan

Date:

17 April 2002, 16:55:26

On Wed, Apr 17, 2002 at 04:28:03PM -0400, mlw wrote:

> Oracle has a cost based optimizer, and they allow you to override
> it, offer hints as to what it should do, or use the rules based
> optimizer. They know that a cost based optimizer can not generate
> the best query all the time.

Oracle's the wrong competition to cite here.  IBM's optimiser and
planner in DB2 is rather difficult to override; IBM actively
discourages doing so.  That's because it's the best there is.  It's
_far_ better than Oracle's, and has ever been so.  It just about
_always_ gets it right.  Without presuming to speak for him, I'd
suggest that Tom probably wants to get the planner to that level,
rather than adding band-aids.

> I say it is obvious it can never know enough, since statistics are

Enough for what?  The idea is that the statistics will get you the
best-bet plan.  You're trying to redefine what the best bet is; and
Tom and others have suggested that a simple rule of thumb, "All else
being more or less equal, prefer an index," is not a good one.

> Now, given the choice of the two strategies on a table, both pretty
> close to one another, the risk of poor performance for using the
> index scan is minimal based on the statistics, but the risk of poor
> performance for using the sequential scan is quite high on a large
> table.

I thought that's what the various cost estimates were there to cover. 
If this is all you're saying, then the feature is already there.

-- 
----
Andrew Sullivan                               87 Mowat Avenue 
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3                                        +1 416 646 3304
x110

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

17 April 2002, 16:56:20

mlw wrote:
> Now, given the choice of the two strategies on a table, both pretty close to
> one another, the risk of poor performance for using the index scan is minimal
> based on the statistics, but the risk of poor performance for using the
> sequential scan is quite high on a large table.

Wow, what did I start here?

OK, let me see if I can explain why the idea of an index being present
is not significant, and also explain why doing a sequential scan is
_less_ risky than a index scan.

First, if an admin creates an index, it does mean he thinks it will
help, but is he right?  You could say that if they create the index, use
it, and if the admin finds it makes the query slower, he can then remove
it, and this does give him some control over the optimizer.

However, this assumes two things.  First, it assumes the admin will
actually check to see if the index helps, and if it doesn't remove it,
but more importantly, it assumes there is only one type of query for
that table.  That is the biggest fallacy.  If I do:
SELECT * FROM tab WHERE col = 0;

I may be selecting 70% of the table, and an index scan will take
forever if 70% of the table (plus index pages) is significancy larger
than the cache size;  every row lookup will have to hit the disk! 

However, if I do:
SELECT * FROM tab WHERE col = 89823;

and 89823 is a rare value, perhaps only one row in the table, then an
index would be good to use, so yes, indexes can be added by admins to
improve performance, but the admin is creating the index probably for
the second query, and certainly doesn't want the index used for the
first query.

Also, these are simple queries.  Add multiple tables and join methods,
and the idea that an admin creating an index could in any way control
these cases is implausible.

My second point, that index scan is more risky than sequential scan, is
outlined above.  A sequential scan reads each page once, and uses the
file system read-ahead code to prefetch the disk buffers.  Index scans
are random, and could easily re-read disk pages to plow through a
significant portion of the table, and because the reads are random,
the file system will not prefetch the rows so the index scan will have
to wait for each non-cache-resident row to come in from disk.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 17:11:46

Andrew Sullivan wrote:

> > Now, given the choice of the two strategies on a table, both pretty
> > close to one another, the risk of poor performance for using the
> > index scan is minimal based on the statistics, but the risk of poor
> > performance for using the sequential scan is quite high on a large
> > table.
> 
> I thought that's what the various cost estimates were there to cover.
> If this is all you're saying, then the feature is already there.

The point is that if the index plan is < 20% more costly than the sequential
scan, it is probably less risky.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

17 April 2002, 17:13:12

mlw wrote:
> Andrew Sullivan wrote:
> 
> > > Now, given the choice of the two strategies on a table, both pretty
> > > close to one another, the risk of poor performance for using the
> > > index scan is minimal based on the statistics, but the risk of poor
> > > performance for using the sequential scan is quite high on a large
> > > table.
> > 
> > I thought that's what the various cost estimates were there to cover.
> > If this is all you're saying, then the feature is already there.
> 
> The point is that if the index plan is < 20% more costly than the sequential
> scan, it is probably less risky.

I just posted on this topic.  Index scan is more risky, no question
about it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 17:17:46

Bruce Momjian wrote:
> 
> mlw wrote:
> > Now, given the choice of the two strategies on a table, both pretty close to
> > one another, the risk of poor performance for using the index scan is minimal
> > based on the statistics, but the risk of poor performance for using the
> > sequential scan is quite high on a large table.

> My second point, that index scan is more risky than sequential scan, is
> outlined above.  A sequential scan reads each page once, and uses the
> file system read-ahead code to prefetch the disk buffers.  Index scans
> are random, and could easily re-read disk pages to plow through a
> significant portion of the table, and because the reads are random,
> the file system will not prefetch the rows so the index scan will have
> to wait for each non-cache-resident row to come in from disk.

That is a very interesting point, but shouldn't that be factored into the cost
(random_tuple_cost?) In which case my point still stands.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

17 April 2002, 17:39:51

mlw wrote:
> Bruce Momjian wrote:
> > 
> > mlw wrote:
> > > Now, given the choice of the two strategies on a table, both pretty close to
> > > one another, the risk of poor performance for using the index scan is minimal
> > > based on the statistics, but the risk of poor performance for using the
> > > sequential scan is quite high on a large table.
> 
> > My second point, that index scan is more risky than sequential scan, is
> > outlined above.  A sequential scan reads each page once, and uses the
> > file system read-ahead code to prefetch the disk buffers.  Index scans
> > are random, and could easily re-read disk pages to plow through a
> > significant portion of the table, and because the reads are random,
> > the file system will not prefetch the rows so the index scan will have
> > to wait for each non-cache-resident row to come in from disk.
> 
> That is a very interesting point, but shouldn't that be factored into the cost
> (random_tuple_cost?) In which case my point still stands.

Yes, I see your point.  I think on the high end that index scans can get
very expensive if you start to do lots of cache misses and have to wait
for i/o.  I know the random cost is 4, but I think that number is not
linear.  It can be much higher for lots of cache misses and waiting for
I/O, and think that is why it feels more risky to do an index scan on a
sample size that is not perfectly known.

Actually, you pretty much can know sequential scan size because you know
the number of blocks in the table.  It is index scan that is more
unknown because you don't know how many index lookups you will need, and
how well they will stay in the cache.

Does that help?  Wow, this _is_ confusing.  I am still looking for that
holy grail that will allow this all to be codified so others can learn
from it and we don't have to rehash this repeatedly, but frankly, this
whole discussion is covering new ground that we haven't covered yet. 

(Maybe TODO.detail this discussion and point to it from the FAQ.)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

17 April 2002, 17:42:05

Michael Loftis wrote:
> As far as the 'planner benchmark suite' so we cans tart gathering more 
> statistical data about what costs should be, or are better at, that's an 
> excellent idea.

People with different hardware have different random page costs,
clearly.  Even different workloads affect it.  Added to TODO:
* Add utility to compute accurate random_page_cost value

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 17:45:37

Bruce Momjian wrote:
> My second point, that index scan is more risky than sequential scan, is
> outlined above.  A sequential scan reads each page once, and uses the
> file system read-ahead code to prefetch the disk buffers.  Index scans
> are random, and could easily re-read disk pages to plow through a
> significant portion of the table, and because the reads are random,
> the file system will not prefetch the rows so the index scan will have
> to wait for each non-cache-resident row to come in from disk.

It took a bike ride to think about this one. The supposed advantage of a
sequential read over an random read, in an active multitasking system, is a
myth. If you are executing one query and the system is doing only that query,
you may be right.

Execute a number of queries at the same time, the expected benefit of a
sequential scan goes out the window. The OS will be fetching blocks, more or
less, at random.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

17 April 2002, 17:54:32

mlw wrote:
> Bruce Momjian wrote:
> > My second point, that index scan is more risky than sequential scan, is
> > outlined above.  A sequential scan reads each page once, and uses the
> > file system read-ahead code to prefetch the disk buffers.  Index scans
> > are random, and could easily re-read disk pages to plow through a
> > significant portion of the table, and because the reads are random,
> > the file system will not prefetch the rows so the index scan will have
> > to wait for each non-cache-resident row to come in from disk.
> 
> It took a bike ride to think about this one. The supposed advantage of a
> sequential read over an random read, in an active multitasking system, is a
> myth. If you are executing one query and the system is doing only that query,
> you may be right.
> 
> Execute a number of queries at the same time, the expected benefit of a
> sequential scan goes out the window. The OS will be fetching blocks, more or
> less, at random.

OK, yes, sequential scan _can_ be as slow as index scan, but sometimes
it is faster.  Can you provide reasoning why index scan should be
preferred, other than the admin created it, which I already addressed?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Doug McNaught

Date:

17 April 2002, 17:55:43

mlw <markw@mohawksoft.com> writes:

> It took a bike ride to think about this one. The supposed advantage of a
> sequential read over an random read, in an active multitasking system, is a
> myth. 

Disagree.

> Execute a number of queries at the same time, the expected benefit of a
> sequential scan goes out the window. The OS will be fetching blocks, more or
> less, at random.

If readahead is active (and it should be for sequential reads) there
is still a pretty good chance that the next few disk blocks will be in
cache next time you get scheduled.

If your disk is thrashing that badly, you need more RAM and/or more
spindles; using an index will just put even more load on the i/o
system.

-Doug
-- 
Doug McNaught       Wireboard Industries      http://www.wireboard.com/
     Custom software development, systems and network consulting.     Java PostgreSQL Enhydra Python Zope Perl Apache
LinuxBSD...

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"D'Arcy J.M. Cain"

Date:

17 April 2002, 17:56:39

On April 17, 2002 05:44 pm, mlw wrote:
> It took a bike ride to think about this one. The supposed advantage of a
> sequential read over an random read, in an active multitasking system, is a
> myth. If you are executing one query and the system is doing only that
> query, you may be right.
>
> Execute a number of queries at the same time, the expected benefit of a
> sequential scan goes out the window. The OS will be fetching blocks, more
> or less, at random.

If it does you should look for another OS.  A good OS will work with your 
access requests to keep them as linear as possible.  Of course it has a 
slight effect the other way as well but generally lots of sequential reads 
will be faster than lots of random ones.  If you don't believe that then just 
run the test that Tom suggested to calculate random_tuple_cost on your own 
system.  I bet your number is higher than 1.

And when you are done, just plug the number into your configuration and get 
the plans that you are looking for.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 17:57:46

Bruce Momjian wrote:
> 
> mlw wrote:
> > Bruce Momjian wrote:
> > >
> > > mlw wrote:
> > > > Now, given the choice of the two strategies on a table, both pretty close to
> > > > one another, the risk of poor performance for using the index scan is minimal
> > > > based on the statistics, but the risk of poor performance for using the
> > > > sequential scan is quite high on a large table.
> >
> > > My second point, that index scan is more risky than sequential scan, is
> > > outlined above.  A sequential scan reads each page once, and uses the
> > > file system read-ahead code to prefetch the disk buffers.  Index scans
> > > are random, and could easily re-read disk pages to plow through a
> > > significant portion of the table, and because the reads are random,
> > > the file system will not prefetch the rows so the index scan will have
> > > to wait for each non-cache-resident row to come in from disk.
> >
> > That is a very interesting point, but shouldn't that be factored into the cost
> > (random_tuple_cost?) In which case my point still stands.
> 
> Yes, I see your point.  I think on the high end that index scans can get
> very expensive if you start to do lots of cache misses and have to wait
> for i/o.  I know the random cost is 4, but I think that number is not
> linear.  It can be much higher for lots of cache misses and waiting for
> I/O, and think that is why it feels more risky to do an index scan on a
> sample size that is not perfectly known.

In an active system, sequential scans are still OS random access to a file. Two
or more queries running at the same time will blow out most of the expected
gain.

> 
> Actually, you pretty much can know sequential scan size because you know
> the number of blocks in the table.  It is index scan that is more
> unknown because you don't know how many index lookups you will need, and
> how well they will stay in the cache.

Again, shouldn't that be factored into the cost?

> 
> Does that help?  Wow, this _is_ confusing.  I am still looking for that
> holy grail that will allow this all to be codified so others can learn
> from it and we don't have to rehash this repeatedly, but frankly, this
> whole discussion is covering new ground that we haven't covered yet.

Path planning by probabilities derived from statistical analysis is always big
science, regardless of application. The cost based optimizer will *never* be
finished because it can never be perfect.

When all is said and done, it could very well be as good as it ever needs to
be, and that a method for giving hints to the optimizer, ala Oracle, is the
answer.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Dann Corbit"

Date:

17 April 2002, 18:00:12

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Wednesday, April 17, 2002 2:39 PM
To: mlw
Cc: Andrew Sullivan; PostgreSQL-development; Tom Lane
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

mlw wrote:
> Bruce Momjian wrote:
> >
> > mlw wrote:
> > > Now, given the choice of the two strategies on a table, both
pretty close to
> > > one another, the risk of poor performance for using the index scan
is minimal
> > > based on the statistics, but the risk of poor performance for
using the
> > > sequential scan is quite high on a large table.
>
> > My second point, that index scan is more risky than sequential scan,
is
> > outlined above.  A sequential scan reads each page once, and uses
the
> > file system read-ahead code to prefetch the disk buffers.  Index
scans
> > are random, and could easily re-read disk pages to plow through a
> > significant portion of the table, and because the reads are random,
> > the file system will not prefetch the rows so the index scan will
have
> > to wait for each non-cache-resident row to come in from disk.
>
> That is a very interesting point, but shouldn't that be factored into
the cost
> (random_tuple_cost?) In which case my point still stands.

Yes, I see your point.  I think on the high end that index scans can get
very expensive if you start to do lots of cache misses and have to wait
for i/o.  I know the random cost is 4, but I think that number is not
linear.  It can be much higher for lots of cache misses and waiting for
I/O, and think that is why it feels more risky to do an index scan on a
sample size that is not perfectly known.

Actually, you pretty much can know sequential scan size because you know
the number of blocks in the table.  It is index scan that is more
unknown because you don't know how many index lookups you will need, and
how well they will stay in the cache.

Does that help?  Wow, this _is_ confusing.  I am still looking for that
holy grail that will allow this all to be codified so others can learn
from it and we don't have to rehash this repeatedly, but frankly, this
whole discussion is covering new ground that we haven't covered yet.

(Maybe TODO.detail this discussion and point to it from the FAQ.)
>>
General rules of thumb (don't know if they apply to postgres or not):

Index scans are increasingly costly when the data is only a few types.
For instance, an index on a single bit makes for a very expensive scan.
After 5% of the data, it would be cheaper to scan the whole table
without using the index.

Now, if you have a clustered index (where the data is *physically*
ordered by the index order) you can use index scans much more cheaply
than when the data is not ordered in this way.

A golden rule of thumb when accessing data in a relational database
is to use the unique clustered index whenever possible (even if it
is not the primary key).

These decisions are always heuristic in nature.  If a table is small
it may be cheapest to load the whole table into memory and sort it.

If the vacuum command could categorize statistically (some RDBMS
systems do this) then you can look at the statistical data and make
much smarter choices for how to use the index relations.  The
statistical information saved could be as simple as the min, max, mean,
median, mode, and standard deviation, or it might also have quartiles
or deciles, or some other measure to show even more data about the
actual distribution.  You could save what is essentially a binned
histogram of the data that is present in the table.  A bit of
imagination will quickly show how useful this could be (some commercial
database systems actually do this).

Another notion is to super optimize some queries.  By this, I mean
that if someone says that a particular query is very important, it
might be worthwhile to actually try a dozen or so different plans
(of the potentially billions that are possible) against the query
and store the best one.  They could also run the super optimize
feature again later or even automatically if the vacuum operation
detects that the data distributions or cardinality have changed in
some significant manner.

Better yet, let them store the plan and hand edit it, if need be.
Rdb is an example of a commercial database that allows this.
A maintenance nightmare when dimwits do it, of course, but such is
life.
<<

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

17 April 2002, 18:03:17

Bruce Momjian wrote:

> 
> OK, yes, sequential scan _can_ be as slow as index scan, but sometimes
> it is faster.  Can you provide reasoning why index scan should be
> preferred, other than the admin created it, which I already addressed?

If you have a choice between two or more sub-plans, similar in cost, say within
20% of one another. Choosing a plan which uses an index has a chance of
improved performance if the estimates are wrong where as choosing the
sequential scan will always have the full cost.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

17 April 2002, 18:04:17

mlw wrote:
> Bruce Momjian wrote:
> 
> > 
> > OK, yes, sequential scan _can_ be as slow as index scan, but sometimes
> > it is faster.  Can you provide reasoning why index scan should be
> > preferred, other than the admin created it, which I already addressed?
> 
> If you have a choice between two or more sub-plans, similar in cost, say within
> 20% of one another. Choosing a plan which uses an index has a chance of
> improved performance if the estimates are wrong where as choosing the
> sequential scan will always have the full cost.

And the chance of reduced performance if the estimate was too low.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

17 April 2002, 18:06:04

D'Arcy J.M. Cain wrote:
> On April 17, 2002 05:44 pm, mlw wrote:
> > It took a bike ride to think about this one. The supposed advantage of a
> > sequential read over an random read, in an active multitasking system, is a
> > myth. If you are executing one query and the system is doing only that
> > query, you may be right.
> >
> > Execute a number of queries at the same time, the expected benefit of a
> > sequential scan goes out the window. The OS will be fetching blocks, more
> > or less, at random.
> 
> If it does you should look for another OS.  A good OS will work with your 
> access requests to keep them as linear as possible.  Of course it has a 
> slight effect the other way as well but generally lots of sequential reads 
> will be faster than lots of random ones.  If you don't believe that then just 
> run the test that Tom suggested to calculate random_tuple_cost on your own 
> system.  I bet your number is higher than 1.

The two backends would have to be hitting the same table at different
spots to turn off read-ahead, but it is possible.  If the backends are
hitting different tables, then they don't turn off read-ahead.  Of
course, for both backends to be hitting the disk, they both would have
not found their data in the postgres or kernel cache.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Hannu Krosing

Date:

17 April 2002, 18:08:11

On Wed, 2002-04-17 at 22:43, Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > OTOH, it is also important where the file is on disk. As seen from disk
> > speed test graphs on http://www.tomshardware.com , the speed difference
> > of sequential reads is 1.5 to 2.5 between inner and outer tracks.
>
> True.  But if we use the same test file for both the sequential and
> random-access timings, hopefully the absolute speed of access will
> cancel out.  (Again, it's the sort of thing that could use some
> real-world testing...)

What I was trying to say was thet if you test on one end you will get
wrong data for the other end of the same disk.

> > (The PG planner does try to account
> > for caching effects, but that's a separate estimate;
>
> > Will it make the random and seq read cost equal when cache size >
> > database size and enough queries are performed to assume that all data
> > is in cache.
>
> There isn't any attempt to account for the effects of data having been
> read into cache by previous queries.  I doubt that it would improve the
> model to try to keep track of what the recent queries were

Perhaps some simple thing, like
number of pages read * cache size / database size

Or perhaps use some additional bookkeeping in cache logic, perhaps even
on per-table basis. If this can be made to use the same locks  ás cache
loading/invalidation it may be quite cheap.

It may even exist in some weird way already inside the LRU mechanism.

>--- for one
> thing, do you really want your plans changing on the basis of activity
> of other backends?

If I want the best plans then yes. The other backends do affect
performance so the best plan would be to account for their activities.

If other backend is swapping like crazy the best plan may even be to
wait for it to finish before proceeding :)

----------------
Hannu

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bill Cunningham

Date:

17 April 2002, 18:21:41

Tom Lane wrote:

>mlw <markw@mohawksoft.com> writes:
>
>>That is the difference, in another post Tom said he could not get
>>excited about 10.9 second execution time over a 7.96 execution
>>time. Damn!!! I would. That is wrong.
>>
>
>Sure.  Show us how to make the planner's estimates 2x more accurate
>(on average) than they are now, and I'll get excited too.
>
>But forcing indexscan to be chosen over seqscan does not count as
>making it more accurate.  (If you think it does, then you don't
>need to be in this thread at all; set enable_seqscan = 0 and
>stop bugging us ;-))
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
Do we have a tool that can analyze a table and indexes to allow the DBA 
to choose when to add an index or when not too?

DB2 has an index analyizer like this. Given a specific query and the 
current table stats it can tell you which indexes would be
most beneficial. Do we have something like this already?

At least we could point those DBA's to a utility like this and then they 
would not be too suprised when the optimizer didn't use the index.

- Bill

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Sean Chittenden

Date:

17 April 2002, 20:54:53

> > ... I have seen many instances of when
> > PostgreSQL refuses to use an index because the data distribution is uneven.
> 
> This is fixed, or at least attacked, in 7.2.  Again, I do not see
> this as an argument for making the planner stupider instead of
> smarter.

Could someone fork out some decent criteria for these "stats" that way
someone could generate a small app that would recommend these values
on a per site basis.  Having them hardwired and stuffed into a system
catalog does no good to the newbie DBA.  Iterating over a set of SQL
statements, measuring the output, and then sending the user the
results in the form of recommended values would be huge.
<dumb_question>Where could I look for an explanation of all of these
values?</dumb_question>

-sc

-- 
Sean Chittenden

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Mark Pritchard

Date:

17 April 2002, 21:49:07

I threw together the attached program (compiles fine with gcc 2.95.2 on
Solaris 2.6 and egcs-2.91.66 on RedHat Linux 6.2) and ran it a few
times. Data is below. Usual disclaimers about hastily written code etc
:)

Machine = ghoul (generic intel, 384mb ram, dual p3-800, ide disk running
dma)

Sequential
Bytes Read    Time    Bytes / Sec
536870912       27.14   19783933.74
536870912       27.14   19783990.60
536870912       27.11   19801872.14
536870912       26.92   19942928.41
536870912       27.31   19657408.43
                        19794026.66 (avg)

Random
Bytes Read    Time    Bytes / Sec
1073741824      519.57  2066589.21
1073741824      517.78  2073751.44
1073741824      516.92  2077193.23
1073741824      513.18  2092333.29
1073741824      510.68  2102579.88
                        2082489.41 (avg)

Machine = jedi (Sun E420, 3gb ram, dual 400s, test on single scsi disk)

Sequential
Bytes Read    Time    Bytes / Sec
2097152000      65.19   32167675.28
2097152000      65.22   32154114.65
2097152000      65.16   32182561.99
2097152000      65.12   32206105.12
2097152000      64.67   32429463.26
                        32227984.06 (avg)

Random
Bytes Read    Time    Bytes / Sec
4194304000      1522.22 2755394.79
4194304000      278.18  15077622.05
4194304000      91.43   45874730.07
4194304000      61.43   68273795.19
4194304000      54.55   76890231.51
                        41774354.72

If I interpret Tom's "divide" instruction correctly, is that a factor of
10 on the linux box?

On Thu, 2002-04-18 at 01:16, Tom Lane wrote:
> "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> writes:
> > On my own few experience I think this could be solved decreasing
> > random_page_cost, if you would prefer to use indexes than seq scans, then
> > you can lower random_page_cost to a point in which postgres works as you
> > want. So the planner would prefer indexes when in standard conditions it
> > would prefer seq scans.
>
> It's entirely possible that the default value of random_page_cost is too
> high, at least for many modern machines.  The experiments I did to get
> the 4.0 figure were done a couple years ago, on hardware that wasn't
> exactly new at the time.  I have not heard of anyone else trying to
> measure it though.
>
> I don't think I have the source code I used anymore, but the principle
> is simple enough:
>
> 1. Make a large file (several times the size of your machine's RAM, to
> ensure you swamp out kernel disk buffering effects).  Fill with random
> data. (NB: do not fill with zeroes, some filesystems optimize this away.)
>
> 2. Time reading the file sequentially, 8K per read request.
> Repeat enough to get a statistically trustworthy number.
>
> 3. Time reading randomly-chosen 8K pages from the file.  Repeat
> enough to get a trustworthy number (the total volume of pages read
> should be several times the size of your RAM).
>
> 4. Divide.
>
> The only tricky thing about this is making sure you are measuring disk
> access times and not being fooled by re-accessing pages the kernel still
> has cached from a previous access.  (The PG planner does try to account
> for caching effects, but that's a separate estimate; the value of
> random_page_cost isn't supposed to include caching effects.)  AFAIK the
> only good way to do that is to use a large test, which means it takes
> awhile to run; and you need enough spare disk space for a big test file.
>
> It'd be interesting to get some numbers for this across a range of
> hardware, filesystems, etc ...
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

Attachment

rndpgcst.c

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Thomas Lockhart

Date:

17 April 2002, 22:06:49

> Would it make sense to flatten out INNER JOINs only when the total
> number of tables involved is less than some parameter N?  N
> around six or eight would probably keep the complex-query crowd
> happy, while not causing unintuitive behavior for simple queries.
> Anybody who really likes the current behavior could set N=1 to force
> the system to obey his join order.

I'd like to see the "reorder, or not to reorder" to happen as a settable
parameter, *not* as a side effect of choosing a particular
should-be-equivalent syntax for a query.

If that were exposed, then folks could have additional control over the
optimizer no matter what syntax they prefer to use. And in fact could
alter the behavior without having to completely rewrite their query.

One could also think about a threshold mechanism as you mention above,
but istm that allowing explicit control over reordering (fundamentally
different than, say, control over whether particular kinds of scans are
used) is the best first step. Not solely continuing to hide that control
behind heuristics involving query style and numbers of tables.
                   - Thomas

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Justin Clift

Date:

17 April 2002, 22:34:42

Thomas Lockhart wrote:
> 
<snip> 
> If that were exposed, then folks could have additional control over the
> optimizer no matter what syntax they prefer to use. And in fact could
> alter the behavior without having to completely rewrite their query.
> 
> One could also think about a threshold mechanism as you mention above,
> but istm that allowing explicit control over reordering (fundamentally
> different than, say, control over whether particular kinds of scans are
> used) is the best first step. Not solely continuing to hide that control
> behind heuristics involving query style and numbers of tables.

A la Oracle... here we come....

:-/

If we go down this track, although it would be beneficial in the short
term, is it the best long term approach?

I'm of a belief that *eventually* we really can take enough of the
variables into consideration for planning the best query every time.  I
didn't say it was gunna be soon, nor easy though.

+ Justin

>                     - Thomas
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."  - Indira Gandhi

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Thomas Lockhart

Date:

17 April 2002, 22:38:06

...
> I'm of a belief that *eventually* we really can take enough of the
> variables into consideration for planning the best query every time.  I
> didn't say it was gunna be soon, nor easy though.

I agree. But I'd like to eliminate the optimizer variability which
depends solely on the syntactical differences between traditional and
"join syntax" inner join queries. If the reason for these differences
are to allow explicit control over join order, let's get another
mechanism for doing that.
                  - Thomas

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Justin Clift

Date:

17 April 2002, 22:43:28

Thomas Lockhart wrote:
> 
> ...
> > I'm of a belief that *eventually* we really can take enough of the
> > variables into consideration for planning the best query every time.  I
> > didn't say it was gunna be soon, nor easy though.
> 
> I agree. But I'd like to eliminate the optimizer variability which
> depends solely on the syntactical differences between traditional and
> "join syntax" inner join queries. If the reason for these differences
> are to allow explicit control over join order, let's get another
> mechanism for doing that.

Ok.  I see what you mean now.

That makes more sense.  :)

+ Justin


>                    - Thomas

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."  - Indira Gandhi

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Luis Alberto Amigo Navarro"

Date:

18 April 2002, 03:13:15

>
> Look at the pgbench utility. I can't run that program without a +- 10%
> variation from run to run, no mater how many times I run vacuum and
checkpoint.
>

It's pgbench's fault, TPC-B was replaced with TPC-C because it is not
accurate enough, we run a pseudo TPC-H and it has almost no variations from
one run to another.

Regards

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Maarten.Boekhold@reuters.com

Date:

18 April 2002, 04:14:07

<br /><font face="sans-serif" size="2">On 04/17/2002 01:44:46 PM Michael Loftis wrote:<br /> > In many of the cases
whereit is a primary key it is also there to<br /> > ensure fast lookups when referenced as a foreign key.  Or for
joins.<br/></font><br /><font face="sans-serif" size="2">Don't know if the optimizer takes this into consideration, but
aquery that uses a primary and/or unique key in the where-clause, should always choose to use the related indices
(assumingthe table size is above a certain threshold). Since a primary key/unique index always restricts the resultset
toa single row.....</font><br /><br /><font face="sans-serif" size="2">Somebody else mentioned that after creating an
index,he still had to run analyze in order to get the optimizer to choose to use the index. I thought that 'create
index'also updated pg_stats?</font><br /><br /><font face="sans-serif" size="2">Maarten</font><br /><font
face="sans-serif"size="2"><br /> ----<br /><br /> Maarten Boekhold, maarten.boekhold@reuters.com<br /><br /> Reuters
Consulting<br/> Dubai Media City<br /> Building 1, 5th Floor<br /> PO Box 1426<br /> Dubai, United Arab Emirates<br />
tel:+971(0)43918300 ext 249<br /> fax:+971(0)4 3918333<br /> mob:+971(0)505526539</font> <code><font size="3"><br /><br
/>------------------------------------------------------------- ---<br /> Visit our Internet site at
http://www.reuters.com<br/><br /> Any views expressed in this message are those of the individual<br /> sender, except
wherethe sender specifically states them to be<br /> the views of Reuters Ltd.<br /></font></code>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

tycho@fruru.com

Date:

18 April 2002, 04:41:26

On Thu, 18 Apr 2002 Maarten.Boekhold@reuters.com wrote:

> 
> On 04/17/2002 01:44:46 PM Michael Loftis wrote:
> > In many of the cases where it is a primary key it is also there to
> > ensure fast lookups when referenced as a foreign key.  Or for joins.
> 
> Don't know if the optimizer takes this into consideration, but a query that uses a primary and/or unique key in the
where-clause,should always choose to use
 
> the related indices (assuming the table size is above a certain threshold). Since a primary key/unique index always
restrictsthe resultset to a single row.....
 

I don't think so.

eg. table with primary key "pk", taking values from 1 to 1000000 (so
1000000 records)

select * from table where pk > 5

should probably not use the index ...

Cheers
Tycho

-- 
Tycho Fruru            tycho.fruru@conostix.com
"Prediction is extremely difficult. Especially about the future." - Niels Bohr

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Mark Pritchard

Date:

18 April 2002, 08:21:59

Apologies for the naff double post, but I meant to add that obviously
the figures for the solaris box are bogus after the first run...imagine
a file system cache of an entire 2gb file. I tried creating a file of
4gb on this box, but it bombed with a "file too large error".
Unfortunately, I can't rip memory out of this box as I don't have
exclusive access.

On Thu, 2002-04-18 at 11:49, Mark Pritchard wrote:
> I threw together the attached program (compiles fine with gcc 2.95.2 on
> Solaris 2.6 and egcs-2.91.66 on RedHat Linux 6.2) and ran it a few
> times. Data is below. Usual disclaimers about hastily written code etc
> :)
> 
> Machine = ghoul (generic intel, 384mb ram, dual p3-800, ide disk running
> dma)
> 
> Sequential
> Bytes Read    Time    Bytes / Sec
> 536870912       27.14   19783933.74
> 536870912       27.14   19783990.60
> 536870912       27.11   19801872.14
> 536870912       26.92   19942928.41
> 536870912       27.31   19657408.43
>                         19794026.66 (avg)
> 
> Random        
> Bytes Read    Time    Bytes / Sec
> 1073741824      519.57  2066589.21
> 1073741824      517.78  2073751.44
> 1073741824      516.92  2077193.23
> 1073741824      513.18  2092333.29
> 1073741824      510.68  2102579.88
>                         2082489.41 (avg)
> 
> Machine = jedi (Sun E420, 3gb ram, dual 400s, test on single scsi disk)
> 
> Sequential        
> Bytes Read    Time    Bytes / Sec
> 2097152000      65.19   32167675.28
> 2097152000      65.22   32154114.65
> 2097152000      65.16   32182561.99
> 2097152000      65.12   32206105.12
> 2097152000      64.67   32429463.26
>                         32227984.06 (avg)
> 
> Random        
> Bytes Read    Time    Bytes / Sec
> 4194304000      1522.22 2755394.79
> 4194304000      278.18  15077622.05
> 4194304000      91.43   45874730.07
> 4194304000      61.43   68273795.19
> 4194304000      54.55   76890231.51
>                         41774354.72
> 
> If I interpret Tom's "divide" instruction correctly, is that a factor of
> 10 on the linux box?
> 
> On Thu, 2002-04-18 at 01:16, Tom Lane wrote:
> > "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> writes:
> > > On my own few experience I think this could be solved decreasing
> > > random_page_cost, if you would prefer to use indexes than seq scans, then
> > > you can lower random_page_cost to a point in which postgres works as you
> > > want. So the planner would prefer indexes when in standard conditions it
> > > would prefer seq scans.
> > 
> > It's entirely possible that the default value of random_page_cost is too
> > high, at least for many modern machines.  The experiments I did to get
> > the 4.0 figure were done a couple years ago, on hardware that wasn't
> > exactly new at the time.  I have not heard of anyone else trying to
> > measure it though.
> > 
> > I don't think I have the source code I used anymore, but the principle
> > is simple enough:
> > 
> > 1. Make a large file (several times the size of your machine's RAM, to
> > ensure you swamp out kernel disk buffering effects).  Fill with random
> > data. (NB: do not fill with zeroes, some filesystems optimize this away.)
> > 
> > 2. Time reading the file sequentially, 8K per read request.
> > Repeat enough to get a statistically trustworthy number.
> > 
> > 3. Time reading randomly-chosen 8K pages from the file.  Repeat
> > enough to get a trustworthy number (the total volume of pages read
> > should be several times the size of your RAM).
> > 
> > 4. Divide.
> > 
> > The only tricky thing about this is making sure you are measuring disk
> > access times and not being fooled by re-accessing pages the kernel still
> > has cached from a previous access.  (The PG planner does try to account
> > for caching effects, but that's a separate estimate; the value of
> > random_page_cost isn't supposed to include caching effects.)  AFAIK the
> > only good way to do that is to use a large test, which means it takes
> > awhile to run; and you need enough spare disk space for a big test file.
> > 
> > It'd be interesting to get some numbers for this across a range of
> > hardware, filesystems, etc ...
> > 
> >             regards, tom lane
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> > 
> 
> 
> ----
> 

> #include <errno.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
> #include <sys/stat.h>
> #include <sys/time.h>
> 
> /**
>  * Constants
>  */
> 
> #define BLOCK_SIZE        (8192)
> 
> /**
>  * Prototypes
>  */
> 
>     // Creates the test file filled with random data
>     void createTestFile(char *testFileName, long long fileSize);
> 
>     // Handles runtime errors by displaying the function, activity and error number
>     void handleError(char *functionName, char *activity);
> 
>     // Standard entry point
>     int main(int argc, char *args[]);
> 
>     // Prints correct usage and quits
>     void printUsageAndQuit();
> 
>     // Tests performance of random reads of the given file
>     void testRandom(char *testFileName, long long amountToRead);
> 
>     // Tests performance of sequential reads of the given file
>     void testSeq(char *testFileName);
> 
> /**
>  * Definitions
>  */
> 
> /**
>  * createTestFile()
>  */
> void createTestFile(char *testFileName, long long fileSize)
> {
>     FILE *testFile;
>     long long reps, i, j, bufferReps;
>     time_t timetmp;
>     long long *buffer;
>     size_t written;
> 
>     // Indicate op
>     printf("Creating test file %s of %lld mb\n",testFileName,fileSize);
> 
>     // Adjust file size to bytes
>     fileSize *= (1024*1024);
> 
>     // Allocate a buffer for writing out random long longs
>     if (!(buffer = malloc(BLOCK_SIZE)))
>         handleError("createTestFile()","malloc");
> 
>     // Open the file for writing
>     if (!(testFile = fopen(testFileName, "wb")))
>         handleError("createTestFile()","fopen");
> 
>     // Initialise the random number generator
>     srandom(time(NULL));
> 
>     // Write data
>     reps         = fileSize / BLOCK_SIZE;
>     bufferReps    = BLOCK_SIZE / sizeof(long long);
>     for (i = 0; i < reps; i++)
>     {
>         // Fill buffer with random data
>         for (j = 0; j < bufferReps; j++)
>             buffer[j] = random();
> 
>         // Write
>         written = fwrite(buffer, sizeof(long long), bufferReps, testFile);
>         if (written != bufferReps)
>             handleError("createTestFile()","fwrite");
>     }
> 
>     // Flush and close
>     if (fflush(testFile))
>         handleError("createTestFile()","fflush");
>     if (fclose(testFile))
>         handleError("createTestFile()","fclose");
> 
>     // Free buffer
>     free(buffer);
> }
> 
> /**
>  * handleError()
>  */
> void handleError(char *functionName, char *activity)
> {
>     fprintf(stderr, "Error in %s while attempting %s. Error %d (%s)\n", functionName, activity, errno,
strerror(errno));
>     exit(1);
> }
> 
> /**
>  * main()
>  */
> int main(int argc, char *argv[])
> {
>     // Print usage and quit if argument count is definitely incorrect
>     if (argc < 3)
>     {
>         // Definitely wrong
>         printUsageAndQuit();
>     }
>     else
>     {
>         // Dispatch
>         if (!strcmp(argv[1], "create"))
>         {
>             if (argc != 4)
>                 printUsageAndQuit();
> 
>             // Create the test file of the specified size
>             createTestFile(argv[2], atol(argv[3]));
>         }
>         else if (!strcmp(argv[1], "seqtest"))
>         {
>             if (argc != 3)
>                 printUsageAndQuit();
> 
>             // Test performance of sequential reads
>             testSeq(argv[2]);
>         }
>         else if (!strcmp(argv[1], "rndtest"))
>         {
>             if (argc != 4)
>                 printUsageAndQuit();
> 
>             // Test performance of random reads
>             testRandom(argv[2], atol(argv[3]));
>         }
>         else
>         {
>             // Unknown command
>             printUsageAndQuit();
>         }
>     }
> 
>     return 0;
> }
> 
> /**
>  * printUsageAndQuit()
>  */
> void printUsageAndQuit()
> {
>     puts("USAGE: rndpgcst [create <file> <size_in_mb>] | [seqtest <file>] | [rndtest <file> <read_in_mb>]");
> 
>     exit(1);
> }
> 
> /**
>  * testSeq()
>  */
> void testSeq(char *testFileName)
> {
>     FILE *testFile;
>     char *buffer;
>     long long reps, totalRead, thisRead, timeTaken;
>     struct timeval startTime, endTime;
>     struct timezone timezoneDiscard;
> 
>     // Indicate op
>     printf("Sequential read test of %s\n",testFileName);
> 
>     // Grab a buffer
>     buffer = malloc(BLOCK_SIZE);
> 
>     // Open the file for reading
>     if (!(testFile = fopen(testFileName, "rb")))
>         handleError("testSeq()","fopen");
> 
>     // Start timer
>     if (gettimeofday(&startTime, &timezoneDiscard) == -1)
>         handleError("testSeq()", "gettimeofday start");
> 
>     // Read all data from file
>     totalRead = 0;
>     while ((thisRead = fread(buffer, 1, BLOCK_SIZE, testFile)) != 0)
>         totalRead += thisRead;
> 
>     // End timer
>     if (gettimeofday(&endTime, &timezoneDiscard) == -1)
>         handleError("testSeq()", "gettimeofday start");
> 
>     // Close
>     if (fclose(testFile))
>         handleError("testSeq()","fclose");
> 
>     // Free the buffer
>     free(buffer);
> 
>     // Display time taken
>     timeTaken = (endTime.tv_sec - startTime.tv_sec) * 1000000;
>     timeTaken += (endTime.tv_usec - startTime.tv_usec);
>     printf("%lld bytes read in %f seconds\n", totalRead, (double) timeTaken / (double) 1000000);
> }
> 
> /**
>  * testRandom()
>  */
> void testRandom(char *testFileName, long long amountToRead)
> {
>     FILE *testFile;
>     long long reps, i, fileSize, timeTaken, totalRead, readPos, thisRead, offsetMax;
>     struct stat fileStats;
>     char *buffer;
>     struct timeval startTime, endTime;
>     struct timezone timezoneDiscard;
> 
>     // Indicate op
>     printf("Random read test of %s for %lld mb\n", testFileName, amountToRead);
> 
>     // Initialise the random number generator
>     srandom(time(NULL));
> 
>     // Adjust amount to read
>     amountToRead *= (1024*1024);
> 
>     // Determine file size
>     if (stat(testFileName, &fileStats) == -1)
>         handleError("testRandom()", "stat");
>     fileSize = fileStats.st_size;
> 
>     // Grab a buffer
>     buffer = malloc(BLOCK_SIZE);
> 
>     // Open the file for reading
>     if (!(testFile = fopen(testFileName, "rb")))
>         handleError("testRandom()","fopen");
> 
>     // Start timer
>     if (gettimeofday(&startTime, &timezoneDiscard) == -1)
>         handleError("testRandom()", "gettimeofday start");
> 
>     // Read data from file
>     reps         = amountToRead / BLOCK_SIZE;
>     offsetMax    = fileSize / BLOCK_SIZE;
>     for (i = 0; i < reps; i++)
>     {
>         // Determine read position
>         readPos = (random() % offsetMax) * BLOCK_SIZE;
> 
>         // Seek and read
>         if (fseek(testFile, readPos, SEEK_SET) == -1)
>             handleError("testRandom()","fseek");
>         if ((thisRead = fread(buffer, 1, BLOCK_SIZE, testFile)) != BLOCK_SIZE)
>             handleError("testRandom()","fread");
>     }
> 
>     // End timer
>     if (gettimeofday(&endTime, &timezoneDiscard) == -1)
>         handleError("testRandom()", "gettimeofday start");
> 
>     // Close
>     if (fclose(testFile))
>         handleError("testRandom()","fclose");
> 
>     // Free the buffer
>     free(buffer);
> 
>     // Display time taken
>     timeTaken = (endTime.tv_sec - startTime.tv_sec) * 1000000;
>     timeTaken += (endTime.tv_usec - startTime.tv_usec);
>     printf("%lld bytes read in %f seconds\n", amountToRead, (double) timeTaken / (double) 1000000);
> }

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

18 April 2002, 08:22:02

mlw <markw@mohawksoft.com> writes:
> For instance: say we have two similarly performing plans, close to one another,
> say within 20%, one plan uses an index, and one does not. It is unlikely that
> the index plan will perform substantially worse than the non-index plan, right?

This seems to be the crux of the argument ... but I've really seen no
evidence to suggest that it's true.  The downside of improperly picking
an indexscan plan is *not* any less than the downside of improperly
picking a seqscan plan, in my experience.

It does seem (per Thomas' earlier observation) that we get more
complaints about failure to use an index scan than the other case.
Prior to 7.2 it was usually pretty obviously traceable to overestimates
of the number of rows to be retrieved (due to inadequate data
statistics).  In 7.2 that doesn't seem to be the bottleneck anymore.
I think now that there may be some shortcoming in the planner's cost
model or in the adjustable parameters for same.  But my reaction
to that is to try to figure out how to fix the cost model.  I certainly
do not feel that we've reached a dead end in which the only answer is
to give up and stop trusting the cost-based optimization approach.

> Now, given the choice of the two strategies on a table, both pretty close to
> one another, the risk of poor performance for using the index scan is minimal
> based on the statistics, but the risk of poor performance for using the
> sequential scan is quite high on a large table.

You keep asserting that, and you keep providing no evidence.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

18 April 2002, 08:22:14

Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > For instance: say we have two similarly performing plans, close to one another,
> > say within 20%, one plan uses an index, and one does not. It is unlikely that
> > the index plan will perform substantially worse than the non-index plan, right?
> 
> This seems to be the crux of the argument ... but I've really seen no
> evidence to suggest that it's true.  The downside of improperly picking
> an indexscan plan is *not* any less than the downside of improperly
> picking a seqscan plan, in my experience.

Our experiences differ. I have fought with PostgreSQL on a number of occasions
when it would not use an index. Inevitably, I would have to set "enable_seqscan
= false." I don't like doing that because it forces the use of an index when it
doesn't make sense.

I don't think we will agree, we have seen different behaviors, and our
experiences seem to conflict. This however does not mean that either of us is
in error, it just may mean that we use data with very different
characteristics.

This thread is kind of frustrating for me because over the last couple years I
have seen this problem many times and the answer is always the same, "The
statistics need to be improved." Tom, you and I have gone back and forth about
this more than once.

I submit to you that the statistics will probably *never* be right. They will
always need improvement here and there. Perhaps instead of fighting over an
algorithmic solution, and forcing the users to work around problems with
choosing an index, should we not just allow the developer to place hints in the
SQL, as:

select /*+ INDEX(a_id, b_id) */ * from a, b where a.id = b.id;

That way if there is a performance issue with using or not using an index, the
developer can have better control over the evaluation of the query.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

18 April 2002, 08:22:15

mlw <markw@mohawksoft.com> writes:
> should we not just allow the developer to place hints in the
> SQL, as:

> select /*+ INDEX(a_id, b_id) */ * from a, b where a.id = b.id;

<<itch>>  People have suggested that sort of thing from time to time,
but I have a couple of problems with it:

1.  It's unobvious how to tag the source in a way that is helpful
for any but the most trivial queries.  Moreover, reasonable sorts
of tags would provide only partial specification of the exact
query plan, which is a recipe for trouble --- an upgraded optimizer
might make different choices, leading to a pessimized plan if some
points are pinned down when others aren't.

2.  The tag approach presumes that the query programmer is smarter
than the planner.  This might be true under ideal circumstances,
but I have a hard time crediting that the planner looking at today's
stats is dumber than the junior programmer who left two years ago,
and no one's updated his query since then.  The planner may not be
very bright, but it doesn't get bored, tired, or sick, nor move on
to the next opportunity.  It will pick the best plan it can on the
basis of current statistics and the specific values appearing in
the given query.  Every time.  A tag-forced query plan doesn't
have that adaptability.

By and large this argument reminds me of the "compiler versus hand-
programmed assembler" argument.  Which was pretty much a dead issue
when I was an undergrad, more years ago than I care to admit in a
public forum.  Yes, a competent programmer who's willing to work
hard can out-code a compiler over small stretches of code.  But no
one tries to write large systems in assembler anymore.  Hand-tuned
SQL is up against that same John-Henry-vs-the-steam-hammer logic.
Maybe the current PG optimizer isn't quite in the steam hammer
league yet, but it will get there someday.  I'm more interested
in revving up the optimizer than in betting on John Henry.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Adrian 'Dagurashibanipal' von Bidder

Date:

18 April 2002, 08:26:58

On Wed, 2002-04-17 at 19:43, Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > OTOH, it is also important where the file is on disk. As seen from disk
> > speed test graphs on http://www.tomshardware.com , the speed difference
> > of sequential reads is 1.5 to 2.5 between inner and outer tracks.
>
> True.  But if we use the same test file for both the sequential and
> random-access timings, hopefully the absolute speed of access will
> cancel out.  (Again, it's the sort of thing that could use some
> real-world testing...)

Not so sure about that. Random access basically measures latency,
sequential access measures transfer speed. I'd argue that latency is
more or less constant across the disk as it depends on head movement and
the spindle turning.

cheers
-- vbi

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

18 April 2002, 08:30:29

Tom Lane wrote:

> By and large this argument reminds me of the "compiler versus hand-
> programmed assembler" argument.  Which was pretty much a dead issue
> when I was an undergrad, more years ago than I care to admit in a
> public forum.  Yes, a competent programmer who's willing to work
> hard can out-code a compiler over small stretches of code.  But no
> one tries to write large systems in assembler anymore.  Hand-tuned
> SQL is up against that same John-Henry-vs-the-steam-hammer logic.
> Maybe the current PG optimizer isn't quite in the steam hammer
> league yet, but it will get there someday.  I'm more interested
> in revving up the optimizer than in betting on John Henry.

I am not suggesting that anyone is going to write each and every query with
hints, but a few select queries, yes, people will want to hand tune them.

You are right no one uses assembler to create big systems, but big systems
often have spot optimizations in assembler. Even PostgreSQL has assembler in
it. 

No generic solution can be perfect for every specific application. There will
always be times when hand tuning a query will produce better results, and
sometimes that will make the difference between using PostgreSQL or use
something else.

For the two years I have been subscribed to this list, this is a fairly
constant problem, and the answer is always the same, in effect, "we're working
on it." If PostgreSQL had the ability to accept hints, one could say, "We are
always working to improve it, but in your case you may want to give the
optimizer a hint as to what you expect it to do."

It may not be the "best" solution in your mind, but speaking as a long time
user of PostgreSQL, it would be a huge help to me, and I'm sure I am not alone.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

18 April 2002, 09:35:23

mlw wrote:
> I don't think we will agree, we have seen different behaviors, and our
> experiences seem to conflict. This however does not mean that either of us is
> in error, it just may mean that we use data with very different
> characteristics.
> 
> This thread is kind of frustrating for me because over the last couple years I
> have seen this problem many times and the answer is always the same, "The
> statistics need to be improved." Tom, you and I have gone back and forth about
> this more than once.
> 

Have you tried reducing 'random_page_cost' in postgresql.conf.  That
should solve most of your problems if you would like more index scans.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

18 April 2002, 09:39:16

Tom Lane wrote:
> 2.  The tag approach presumes that the query programmer is smarter
> than the planner.  This might be true under ideal circumstances,
> but I have a hard time crediting that the planner looking at today's
> stats is dumber than the junior programmer who left two years ago,
> and no one's updated his query since then.  The planner may not be
> very bright, but it doesn't get bored, tired, or sick, nor move on
> to the next opportunity.  It will pick the best plan it can on the
> basis of current statistics and the specific values appearing in
> the given query.  Every time.  A tag-forced query plan doesn't
> have that adaptability.

Add to this that hand tuning would happem mostly queries where the two
cost estimates are fairly close, and add the variability of a multi-user
environment, a hard-coded plan may turn out to be faster only some of
the time, and could change very quickly into something longer if the
table changes.

My point is that very close cases are the ones most likely to change
over time.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

18 April 2002, 10:35:52

Finally someone writes down whats been itching at my brain for a while.

In a multi-tasking system it's always cheaper to fetch less blocks, no 
matter where they are.  Because, as you said, it will end up more or 
less random onf a system experiencing a larger number of queries.

mlw wrote:

>Bruce Momjian wrote:
>
>>My second point, that index scan is more risky than sequential scan, is
>>outlined above.  A sequential scan reads each page once, and uses the
>>file system read-ahead code to prefetch the disk buffers.  Index scans
>>are random, and could easily re-read disk pages to plow through a
>>significant portion of the table, and because the reads are random,
>>the file system will not prefetch the rows so the index scan will have
>>to wait for each non-cache-resident row to come in from disk.
>>
>
>It took a bike ride to think about this one. The supposed advantage of a
>sequential read over an random read, in an active multitasking system, is a
>myth. If you are executing one query and the system is doing only that query,
>you may be right.
>
>Execute a number of queries at the same time, the expected benefit of a
>sequential scan goes out the window. The OS will be fetching blocks, more or
>less, at random.
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/users-lounge/docs/faq.html
>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

18 April 2002, 10:43:56

Somethings wrong with the random numbers from the sun... re-run them, 
that first sample is insane....  Caching looks like it's affecctign your 
results alot...

Mark Pritchard wrote:

>I threw together the attached program (compiles fine with gcc 2.95.2 on
>Solaris 2.6 and egcs-2.91.66 on RedHat Linux 6.2) and ran it a few
>times. Data is below. Usual disclaimers about hastily written code etc
>:)
>
>Machine = ghoul (generic intel, 384mb ram, dual p3-800, ide disk running
>dma)
>
>Sequential
>Bytes Read    Time    Bytes / Sec
>536870912       27.14   19783933.74
>536870912       27.14   19783990.60
>536870912       27.11   19801872.14
>536870912       26.92   19942928.41
>536870912       27.31   19657408.43
>                        19794026.66 (avg)
>
>Random        
>Bytes Read    Time    Bytes / Sec
>1073741824      519.57  2066589.21
>1073741824      517.78  2073751.44
>1073741824      516.92  2077193.23
>1073741824      513.18  2092333.29
>1073741824      510.68  2102579.88
>                        2082489.41 (avg)
>
>Machine = jedi (Sun E420, 3gb ram, dual 400s, test on single scsi disk)
>
>Sequential        
>Bytes Read    Time    Bytes / Sec
>2097152000      65.19   32167675.28
>2097152000      65.22   32154114.65
>2097152000      65.16   32182561.99
>2097152000      65.12   32206105.12
>2097152000      64.67   32429463.26
>                        32227984.06 (avg)
>
>Random        
>Bytes Read    Time    Bytes / Sec
>4194304000      1522.22 2755394.79
>4194304000      278.18  15077622.05
>4194304000      91.43   45874730.07
>4194304000      61.43   68273795.19
>4194304000      54.55   76890231.51
>                        41774354.72
>
>If I interpret Tom's "divide" instruction correctly, is that a factor of
>10 on the linux box?
>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

18 April 2002, 10:50:07

Bruce Momjian wrote:
> 
> mlw wrote:
> > I don't think we will agree, we have seen different behaviors, and our
> > experiences seem to conflict. This however does not mean that either of us is
> > in error, it just may mean that we use data with very different
> > characteristics.
> >
> > This thread is kind of frustrating for me because over the last couple years I
> > have seen this problem many times and the answer is always the same, "The
> > statistics need to be improved." Tom, you and I have gone back and forth about
> > this more than once.
> >
> 
> Have you tried reducing 'random_page_cost' in postgresql.conf.  That
> should solve most of your problems if you would like more index scans.

My random page cost is 1 :-)

I had a database where I had to have "enable_seqscan=false" in the config file.
The nature of the data always makes the statistics bogus, and it always refused
to use the index. 

It is frustrating because sometimes it *is* a problem for some unknown number
of users (including myself), as evidenced by the perenial "why isn't postgres
using my index" posts, and for the last two years you guys keep saying it isn't
a problem, or that the statistics just need improvement. Sorry for my tone, but
I have pulled out my hair numerous times on this very problem.

This whole process has lead me to change my mind. I don't think adding weight
to an index scan is the answer, I think having the ability to submit hints to
the planner is the only way to really address this or any future issues.

Just so you understand my perspective, I am not thinking of the average web
monkey. I am thinking of the expert DBA or archetect who want to deploy a
system, and needs to have real control over performance in critical areas.

My one most important experience (I've had more than one) with this whole topic
is DMN's music database, when PostgreSQL uses the index, the query executes in
a fraction of a second. When "enable_seqscan=true" PostgreSQL refuses to use
the index, and the query takes a about a minute. No matter how much I analyze,
I have to disable sequential scan for the system to work correctly. 

cdinfo=# set enable_seqscan=false ;
SET VARIABLE
cdinfo=# explain select * from ztitles, zsong where ztitles.muzenbr =
zsong.muzenbr and ztitles.artistid = 100 ;
NOTICE:  QUERY PLAN:

Merge Join  (cost=3134.95..242643.42 rows=32426 width=356) ->  Sort  (cost=3134.95..3134.95 rows=3532 width=304)
-> Index Scan using ztitles_artistid on ztitles  (cost=0.00..3126.62

rows=3532 width=304) ->  Index Scan using zsong_muzenbr on zsong  (cost=0.00..237787.51
rows=4298882 width=52)

EXPLAIN
cdinfo=# set enable_seqscan=true ;
SET VARIABLE
cdinfo=# explain select * from ztitles, zsong where ztitles.muzenbr =
zsong.muzenbr and ztitles.artistid = 100 ;
NOTICE:  QUERY PLAN:

Hash Join  (cost=3126.97..61889.37 rows=32426 width=356) ->  Seq Scan on zsong  (cost=0.00..52312.66 rows=4298882
width=52)->  Hash  (cost=3126.62..3126.62 rows=3532 width=304)       ->  Index Scan using ztitles_artistid on ztitles
(cost=0.00..3126.62
rows=3532 width=304)

EXPLAIN
cdinfo=# select count(*) from zsong ; count
---------4298882
(1 row)

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

18 April 2002, 10:58:10

Michael Loftis <mloftis@wgops.com> writes:
> Somethings wrong with the random numbers from the sun... re-run them, 
> that first sample is insane....  Caching looks like it's affecctign your 
> results alot...

Yeah; it looks like the test case is not large enough to swamp out
caching effects on the Sun box.  It is on the Linux box, evidently,
since the 10:1 ratio appears very repeatable.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Thomas Lockhart

Date:

18 April 2002, 11:16:56

...
> My one most important experience (I've had more than one) with this whole topic
> is DMN's music database, when PostgreSQL uses the index, the query executes in
> a fraction of a second. When "enable_seqscan=true" PostgreSQL refuses to use
> the index, and the query takes a about a minute. No matter how much I analyze,
> I have to disable sequential scan for the system to work correctly.

How about contributing the data and a query? We've all got things that
we would like to change or adjust in the PostgreSQL feature set. If you
can't contribute code, how about organizing some choice datasets for
testing purposes? If the accumulated set is too big for postgresql.org
(probably not, but...) I can host them on my machine.

Most folks seem to not have to manipulate the optimizer to get good
results nowadays. So to make more progress we need to have test cases...
                      - Thomas

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

18 April 2002, 11:17:21

mlw wrote:
> Bruce Momjian wrote:
> > 
> > mlw wrote:
> > > I don't think we will agree, we have seen different behaviors, and our
> > > experiences seem to conflict. This however does not mean that either of us is
> > > in error, it just may mean that we use data with very different
> > > characteristics.
> > >
> > > This thread is kind of frustrating for me because over the last couple years I
> > > have seen this problem many times and the answer is always the same, "The
> > > statistics need to be improved." Tom, you and I have gone back and forth about
> > > this more than once.
> > >
> > 
> > Have you tried reducing 'random_page_cost' in postgresql.conf.  That
> > should solve most of your problems if you would like more index scans.
> 
> My random page cost is 1 :-)

Have you tried < 1.  Seems that may work well for your case.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

18 April 2002, 11:43:22

mlw <markw@mohawksoft.com> writes:
> My one most important experience (I've had more than one) with this
> whole topic is DMN's music database, when PostgreSQL uses the index,
> the query executes in a fraction of a second. When
> "enable_seqscan=true" PostgreSQL refuses to use the index, and the
> query takes a about a minute. No matter how much I analyze, I have to
> disable sequential scan for the system to work correctly.

It would be useful to see "explain analyze" not just "explain" for these
cases.  Also, what stats does pg_stats show for the variables used?
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

18 April 2002, 12:11:09

Numbers being run on a BSD box now...

FreeBSD 4.3-p27 512MB RAM 2xPiii600 Xeon ona  4 disk RAID 5 ARRAY on a 
dedicated ICP Vortex card.  Sorry no single drives on this box, I have 
an outboard Silicon Gear Mercury on a motherboard based Adaptec 
controller I can test as well.  I'll post when the tests on the Vortex 
are done.  I'm using 2Gb files ATM, I'll look at the code and see if it 
can be made to work with large files.  Atleast for FreeBSD the change 
will be mostly taking doing s/fseek/fseeko/g s/size_t/off_t/g or 
something similar.  FreeBSD seems ot prefer teh Open Unix standard in 
this regard...

This will make it usable for much larger test files.

Tom Lane wrote:

>Michael Loftis <mloftis@wgops.com> writes:
>
>>Somethings wrong with the random numbers from the sun... re-run them, 
>>that first sample is insane....  Caching looks like it's affecctign your 
>>results alot...
>>
>
>Yeah; it looks like the test case is not large enough to swamp out
>caching effects on the Sun box.  It is on the Linux box, evidently,
>since the 10:1 ratio appears very repeatable.
>
>            regards, tom lane
>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

18 April 2002, 12:24:22

Adrian 'Dagurashibanipal' von Bidder wrote:
> 
> On Wed, 2002-04-17 at 19:43, Tom Lane wrote:
> > Hannu Krosing <hannu@tm.ee> writes:
> > > OTOH, it is also important where the file is on disk. As seen from disk
> > > speed test graphs on http://www.tomshardware.com , the speed difference
> > > of sequential reads is 1.5 to 2.5 between inner and outer tracks.
> >
> > True.  But if we use the same test file for both the sequential and
> > random-access timings, hopefully the absolute speed of access will
> > cancel out.  (Again, it's the sort of thing that could use some
> > real-world testing...)
> 
> Not so sure about that. Random access basically measures latency,
> sequential access measures transfer speed. I'd argue that latency is
> more or less constant across the disk as it depends on head movement and
> the spindle turning.

The days when "head movement" is relevant are long over. Not a single drive
sold today, or in the last 5 years, is a simple spindle/head system. Many are
RLE encoded, some have RAID features across the various platters inside the
drive. Many have dynamic remapping of sectors, all of them have internal
caching, some of them have predictive read ahead, some even use compression. 

The assumption that sequentially reading a file from a modern disk drive means
that the head will move less often is largely bogus. Now, factor in a full RAID
system where you have 8 of these disks. Random access of a drive may be slower
than sequential access, but this has less to do with the drive, and more to do
with OS level caching and I/O channel hardware.

Factor in a busy multitasking system, you have no way to really predict the
state of a drive from one read to the next.

(Rotational speed of the drive is still important in that it affects internal
rotational alignment and data transfer.)

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Mark Pritchard

Date:

18 April 2002, 17:41:55

Indeed - I had a delayed post (sent from the wrong email address) which
mentioned that the cache is obviously at play here. I still find it
amazing that the file system would cache 2gb :) The numbers are
definitely correct though...they are actually the second set.

I'm running a test with a larger file size to remove the cache effects
(having realise that ulimit is the biz). Will post again shortly.

Tom - have you had a change to look at the test prg I wrote? Is it
working as desired?

Cheers,

Mark

On Fri, 2002-04-19 at 00:56, Tom Lane wrote:
> Michael Loftis <mloftis@wgops.com> writes:
> > Somethings wrong with the random numbers from the sun... re-run them, 
> > that first sample is insane....  Caching looks like it's affecctign your 
> > results alot...
> 
> Yeah; it looks like the test case is not large enough to swamp out
> caching effects on the Sun box.  It is on the Linux box, evidently,
> since the 10:1 ratio appears very repeatable.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

18 April 2002, 20:19:05

Got some numbers now...  You'll notice the Random reads are *really* 
slow.  The reason for htis is the particular read sizes that are ebing 
used are the absolute worst-case for my particular configuration. (wiht 
a 32kb or 64kb block size I generally achieve much higher performance 
even on random I/O)  Sequential I/O is most likely being limited atleast 
in part by the CPU power available...

Sequential tests:

2147483648 bytes read in 39.716158 seconds  54070780.16 bytes/sec
2147483648 bytes read in 37.836187 seconds  56757401.27 bytes/sec
2147483648 bytes read in 38.081452 seconds  56391853.13 bytes/sec
2147483648 bytes read in 38.122105 seconds  56331717.46 bytes/sec
2147483648 bytes read in 38.303999 seconds  56064215.33 bytes/sec

Total:  192.059901 seconds 279615967.4 (mumble)
Ave:  38.4119802 seconds  55923193.47 bytes/sec

Random tests:

2147483648 bytes read in 1744.002332 seconds  1231353.656 bytes/sec
2147483648 bytes read in 1744.797705 seconds  1230792.339 bytes/sec
2147483648 bytes read in 1741.577362 seconds  1233068.191 bytes/sec
2147483648 bytes read in 1741.497690 seconds  1233124.603 bytes/sec
2147483648 bytes read in 1739.773354 seconds  1234346.786 bytes/sec

Total:  8711.648443 seconds 6162685.575
Ave: 1742.329689 seconds 1232537.115 bytes/sec

So on this machine at that block I/O level (8kb block I believe it was) 
I have a ~55MB/sec Sequential Read rate and ~12MB/sec Random Read rate. Like I said though I'm fairly certain the
randomread rates were worst 
 
case because of the particular block size in the configuration this 
system uses.  But I feel that the results are respectable and valid 
nonetheless.

Note how the random reads kept getting better...  The ICP and drive 
caching firmware were starting to 'catch on' that this 2gb file was a 
hot spot so were preferring to cache things a little longer and 
pre-fetch in a different order than normal.  I estimate that it would 
have dropped as low as 1700 if allowed to keep going.




RAW output from my script...


mloftis@free:/mnt/rz01/ml01/rndtst$ sh PAGECOST2GB.sh
CREATING FILE
Thu Apr 18 09:11:55 PDT 2002
Creating test file 2gb.test of 2048 mb     176.23 real        22.75 user        34.72 sys
BEGINNING SEQUENTIAL TESTS
Thu Apr 18 09:14:51 PDT 2002
Sequential read test of 2gb.test
2147483648 bytes read in 39.716158 seconds      39.73 real         1.52 user        23.87 sys
Sequential read test of 2gb.test
2147483648 bytes read in 37.836187 seconds      37.83 real         1.44 user        23.68 sys
Sequential read test of 2gb.test
2147483648 bytes read in 38.081452 seconds      38.08 real         1.62 user        23.51 sys
Sequential read test of 2gb.test
2147483648 bytes read in 38.122105 seconds      38.12 real         1.63 user        23.50 sys
Sequential read test of 2gb.test
2147483648 bytes read in 38.303999 seconds      38.30 real         1.32 user        23.83 sys
Thu Apr 18 09:18:03 PDT 2002
BEGINNING RANDOM READ TESTS
Random read test of 2gb.test for 2048 mb
2147483648 bytes read in 1744.002332 seconds    1744.01 real         4.33 user        36.47 sys
Random read test of 2gb.test for 2048 mb
2147483648 bytes read in 1744.797705 seconds    1744.81 real         4.38 user        36.56 sys
Random read test of 2gb.test for 2048 mb
2147483648 bytes read in 1741.577362 seconds    1741.58 real         4.58 user        36.18 sys
Random read test of 2gb.test for 2048 mb
2147483648 bytes read in 1741.497690 seconds    1741.50 real         4.17 user        36.57 sys
Random read test of 2gb.test for 2048 mb
2147483648 bytes read in 1739.773354 seconds    1739.78 real         4.41 user        36.36 sys
TESTS COMPLETED
Thu Apr 18 11:43:15 PDT 2002






Michael Loftis wrote:

> Numbers being run on a BSD box now...
>
> FreeBSD 4.3-p27 512MB RAM 2xPiii600 Xeon ona  4 disk RAID 5 ARRAY on a 
> dedicated ICP Vortex card.  Sorry no single drives on this box, I have 
> an outboard Silicon Gear Mercury on a motherboard based Adaptec 
> controller I can test as well.  I'll post when the tests on the Vortex 
> are done.  I'm using 2Gb files ATM, I'll look at the code and see if 
> it can be made to work with large files.  Atleast for FreeBSD the 
> change will be mostly taking doing s/fseek/fseeko/g s/size_t/off_t/g 
> or something similar.  FreeBSD seems ot prefer teh Open Unix standard 
> in this regard...
>
> This will make it usable for much larger test files.
>
> Tom Lane wrote:
>
>> Michael Loftis <mloftis@wgops.com> writes:
>>
>>> Somethings wrong with the random numbers from the sun... re-run 
>>> them, that first sample is insane....  Caching looks like it's 
>>> affecctign your results alot...
>>>
>>
>> Yeah; it looks like the test case is not large enough to swamp out
>> caching effects on the Sun box.  It is on the Linux box, evidently,
>> since the 10:1 ratio appears very repeatable.
>>
>>             regards, tom lane
>>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

19 April 2002, 02:41:59

On Thu, 18 Apr 2002, mlw wrote:

> The days when "head movement" is relevant are long over. Not a single drive
> sold today, or in the last 5 years, is a simple spindle/head system. ....
> The assumption that sequentially reading a file from a modern disk drive means
> that the head will move less often is largely bogus.

Well, oddly enough, even with the head moving just as often, sequential
I/O has always been much faster than random I/O on every drive I've
owned in the past five years. So I guess I/O speed doesn't have a lot
to do with head movement or something.

Some of my drives have started to "chatter" quite noisily during random
I/O, too. I thought that this was due to the head movement, but I guess
not, since they're quite silent during sequential I/O.

BTW, what sort of benchmarking did you do to determine that the
head movement is similar during random and sequential I/O on drives
in the last five years or so?

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

19 April 2002, 02:58:14

On Thu, 18 Apr 2002, Michael Loftis wrote:

> mlw wrote:
>
> >The supposed advantage of a sequential read over an random read, in
> >an active multitasking system, is a myth. If you are executing one
> >query and the system is doing only that query, you may be right.
> >
> >Execute a number of queries at the same time, the expected benefit
> >of a sequential scan goes out the window. The OS will be fetching
> >blocks, more or less, at random.

On a system that has neither read-ahead nor sorting of I/O requests,
yes. Which systems are you using that provide neither of these
facilities?

> In a multi-tasking system it's always cheaper to fetch less blocks, no
> matter where they are.  Because, as you said, it will end up more or
> less random onf a system experiencing a larger number of queries.

Invariably a process or thread will lose its quantum when it submits
an I/O request. (There's nothing left for it to do, since it's waiting
for its data to be read, so there's nothing for it to execute.) It
receives its next quantum when the data are available, and then it may
begin processing the data. There are two possibilities at this point:
   a) The process will complete its processing of the current blocks of   data and submit an I/O request. In this case,
youwould certainly   have seen better performance (assuming you're not CPU-bound--see   below) had you read more,
becauseyou would have processed more in   that quantum instead of stopping and waiting for more I/O.

   b) In that quantum you cannot complete processing the blocks read   because you don't have any more CPU time left.
Inthis case there   are two possibilities:

i) You're CPU bound, in which case better disk performance makesno difference anyway, or
ii) You are likely to find the blocks still in memory when youget your next quantum. (Unless you don't have enough
memoryinthe system, in which case, you should fix that before you spendany more time or money on tuning disk
performance.)

So basically, it's only cheaper to fetch fewer blocks all the time if
you're doing large amounts of I/O and have relatively little memory. The
latter case is getting more and more rare as time goes on. I'd say at
this point that anybody interested in performance is likely to have at
least 256 MB of memory, which means you're going to need a fairly large
database and a lot of workload before that becomes the problem.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Maarten.Boekhold@reuters.com

Date:

19 April 2002, 05:27:19

<br /><font face="sans-serif" size="2">On 04/18/2002 12:41:15 PM tycho wrote:<br /> > > Don't know if the
optimizertakes this into consideration, but a query that <br /> > uses a primary and/or unique key in the
where-clause,should always choose to <br /> > use<br /> > > the related indices (assuming the table size is
abovea certain threshold). <br /> > Since a primary key/unique index always restricts the resultset to a single <br
/>> row.....<br /> > <br /> > I don't think so.<br /> > <br /> > eg. table with primary key "pk", taking
valuesfrom 1 to 1000000 (so<br /> > 1000000 records)<br /> > <br /> > select * from table where pk > 5<br
/>> <br /> > should probably not use the index ...<br /></font><br /><font face="sans-serif" size="2">Oops,
you'reright of course. Rephrase the above as 'a query that uses a primary key to uniquely qualify a single row' (which
prettymuch restricts it to the = operator with a constant). Still, this is probably a fairly common case.....</font><br
/><br/><font face="sans-serif" size="2">Maarten</font><br /><font face="sans-serif" size="2"><br /> ----<br /><br />
MaartenBoekhold, maarten.boekhold@reuters.com<br /><br /> Reuters Consulting<br /> Dubai Media City<br /> Building 1,
5thFloor<br /> PO Box 1426<br /> Dubai, United Arab Emirates<br /> tel:+971(0)4 3918300 ext 249<br /> fax:+971(0)4
3918333<br/> mob:+971(0)505526539</font> <code><font size="3"><br /><br />
----------------------------------------------------------------<br /> Visit our Internet site at
http://www.reuters.com<br/><br /> Any views expressed in this message are those of the individual<br /> sender, except
wherethe sender specifically states them to be<br /> the views of Reuters Ltd.<br /></font></code>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Lincoln Yeoh

Date:

19 April 2002, 23:59:18

At 10:48 AM 4/18/02 -0400, mlw wrote:
>Bruce Momjian wrote:
> >
> > Have you tried reducing 'random_page_cost' in postgresql.conf.  That
> > should solve most of your problems if you would like more index scans.
>
>My random page cost is 1 :-)

What happens when you set random page cost to 1? Between an index scan of 
50% of a table and a full table scan which would the optimizer pick? With 
it at 1, what percentage would be the switchover point?

Because I'm thinking that for _repeated_ queries when there is caching the 
random page cost for "small" selections may be very low after the first 
very costly select (may not be that costly for smart SCSI drives). So 
selecting 10% of a table randomly may not be that costly after the first 
select. Whereas for sequential scans 100% of the table must fit in the 
cache. If the cache is big enough then whichever results in selecting less 
should be faster ( noting that typically sequential RAM reads are faster 
than random RAM reads ). If the cache is not big enough then selecting less 
may be better up till the point where the total amount repeatedly selected 
cannot be cached, in which case sequential scans should be better. This is 
of course for queries in serial, not queries in parallel. How would one 
take these issues into account in an optimizer?

Mark's problems with the optimizer seem to be something else tho: 
statistics off.

>I had a database where I had to have "enable_seqscan=false" in the config 
>file.
>The nature of the data always makes the statistics bogus, and it always 
>refused
>to use the index.
>My one most important experience (I've had more than one) with this whole 
>topic
>is DMN's music database, when PostgreSQL uses the index, the query executes in
>a fraction of a second. When "enable_seqscan=true" PostgreSQL refuses to use
>the index, and the query takes a about a minute. No matter how much I analyze,
>I have to disable sequential scan for the system to work correctly.

I'm just wondering why not just use enable_seqscan=false for those 
problematic queries as a "hint"? Unless your query does need some seq scans 
as well?

By the way, are updates treated the same as selects by the optimizer?

Regards,
Link.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

20 April 2002, 00:16:57

Lincoln Yeoh <lyeoh@pop.jaring.my> writes:
> ...By the way, are updates treated the same as selects by the optimizer?

Yeah.  The writes must occur in any case, so I see no reason why the
optimizer should worry about them.  All it needs to consider are the
cycles used by the various alternatives for fetching the data.  So
the problem is isomorphic to a SELECT.

This assumption is really wired quite fundamentally into the optimizer,
but I'm not sure if it's made clear anywhere in the documentation.
Can anyone suggest the right place to describe it?
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Luis Alberto Amigo Navarro"

Date:

22 April 2002, 06:09:25

Hi All.
I've been reading all the thread and I want to add a few points:

You can set enable_seqscan=off in small or easy queries, but in large
queries index can speed parts of the query and slow other, so I think it is
neccesary if you want Postgres to become a Wide-used DBMS that the planner
could be able to decide accuratelly, in the thread there is a point that
might be useful, it will be very interesting that the planner could learn
with previous executions, even there could be a warm-up policy to let
planner learn about how the DB is working, this info could be stored with DB
data, and could statistically show how use of index or seqscan works on
every column of the DB.

I think it will be useful hearing all users and not guiding only with our
own experience, the main objective is to make a versatil DBMS, It's very
easy to get down the need of improving indexes with single selects, but a
lot of us are not doing single select, so I think that point needs to be
heard.
Regards

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Luis Alberto Amigo Navarro"

Date:

22 April 2002, 08:33:20

I will use this thread to throw a question I threw time ago but was
unresolved, I think it is very close with this thread, cause we can't talk
about indexes and seqscans without talk about indexes performance, I've to
apologize if someone thinks that it's off-topic.

Index usage has a big contention problem when running on large SMP boxes,
here are the results from running on an 8 r10000 200MHz, I've sized the
database in order to be 100% cachable by OS in order to compare memory
performance with seq_scan an index_scan, lately I've reduced
random_page_cost first to 0.5 and finally to 0.1 to force postgres to use
indexes, in both executions 1 only stream(on left in the graph) is faster
than in random_page_cost=4, but more than one stream results in high
contention rate.
These are results from tpc-h(1 first stream of 22 queries followed for s
parallel streams of same queries other order with refresh functions in
progress)
Orange shows CPU waiting for resources, what means stopped at a sem(it's odd
because all queries are read-only).
first of all is rpg=4(less time 8 streams than first(no loads)),
second=0.5(about twice the parallel than first stream) third=0.1(five times
parallel than first stream).
I've marked where the first stream ends and starts the parallel test.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

23 April 2002, 12:42:15

Lincoln Yeoh wrote:
> At 10:48 AM 4/18/02 -0400, mlw wrote:
> >Bruce Momjian wrote:
> > >
> > > Have you tried reducing 'random_page_cost' in postgresql.conf.  That
> > > should solve most of your problems if you would like more index scans.
> >
> >My random page cost is 1 :-)
> 
> What happens when you set random page cost to 1? Between an index scan of 
> 50% of a table and a full table scan which would the optimizer pick? With 
> it at 1, what percentage would be the switchover point?
> 
> Because I'm thinking that for _repeated_ queries when there is caching the 
> random page cost for "small" selections may be very low after the first 
> very costly select (may not be that costly for smart SCSI drives). So 
> selecting 10% of a table randomly may not be that costly after the first 
> select. Whereas for sequential scans 100% of the table must fit in the 
> cache. If the cache is big enough then whichever results in selecting less 
> should be faster ( noting that typically sequential RAM reads are faster 
> than random RAM reads ). If the cache is not big enough then selecting less 
> may be better up till the point where the total amount repeatedly selected 
> cannot be cached, in which case sequential scans should be better. This is 
> of course for queries in serial, not queries in parallel. How would one 
> take these issues into account in an optimizer?

This is an interesting point, that an index scan may fit in the cache
while a sequential scan may not.  I can see cases where even a index
scan of a large percentage of the table may win over an sequential scan.
Interesting.

Determining that, especially in a multi-user environment, is quite
difficult.

We do have 'effective_cache_size', which does try to determine how much
of the I/O will have to go to disk and how much may fit in the cache,
but it is quite a fuzzy number.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

23 April 2002, 12:42:48

Luis Alberto Amigo Navarro wrote:
> Hi All.
> I've been reading all the thread and I want to add a few points:
> 
> You can set enable_seqscan=off in small or easy queries, but in large
> queries index can speed parts of the query and slow other, so I think it is
> neccesary if you want Postgres to become a Wide-used DBMS that the planner
> could be able to decide accuratelly, in the thread there is a point that
> might be useful, it will be very interesting that the planner could learn
> with previous executions, even there could be a warm-up policy to let
> planner learn about how the DB is working, this info could be stored with DB
> data, and could statistically show how use of index or seqscan works on
> every column of the DB.

Yes, I have always felt it would be good to feed back information from
the executor to the optimizer to help with later estimates.  Of course,
I never figured out how to do it.  :-)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

23 April 2002, 22:55:18

Lincoln Yeoh wrote:
> 
> At 10:48 AM 4/18/02 -0400, mlw wrote:
> >Bruce Momjian wrote:
> > >
> > > Have you tried reducing 'random_page_cost' in postgresql.conf.  That
> > > should solve most of your problems if you would like more index scans.
> >
> >My random page cost is 1 :-)
> 
> What happens when you set random page cost to 1? Between an index scan of
> 50% of a table and a full table scan which would the optimizer pick? With
> it at 1, what percentage would be the switchover point?

I am no longer working on the project. Alas, the company is no more. Anyone
want to buy it? :-)

> I'm just wondering why not just use enable_seqscan=false for those
> problematic queries as a "hint"? Unless your query does need some seq scans
> as well?

I am the architect, thus only one of the developers. It was easier, and safer,
to make sure sequential scans did not get executed on a global basis. It would
be disastrous if the development version of the database did not do a
sequential scan, but the live version did. (This did happen to us once. Another
point of PostgreSQL vs Index frustration.)

The risk was minimal if a live query erroneously used an index, but the
consequenses, at least in our application, would be a 1~2 minute PostgreSQL
query.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

23 April 2002, 23:06:33

Curt Sampson wrote:
> 
> On Thu, 18 Apr 2002, Michael Loftis wrote:
> 
> > mlw wrote:
> >
> > >The supposed advantage of a sequential read over an random read, in
> > >an active multitasking system, is a myth. If you are executing one
> > >query and the system is doing only that query, you may be right.
> > >
> > >Execute a number of queries at the same time, the expected benefit
> > >of a sequential scan goes out the window. The OS will be fetching
> > >blocks, more or less, at random.
> 
> On a system that has neither read-ahead nor sorting of I/O requests,
> yes. Which systems are you using that provide neither of these
> facilities?

This only happens if the OS can organize the I/O requests in such a manner. It
is a non-trivial function.

> 
> > In a multi-tasking system it's always cheaper to fetch less blocks, no
> > matter where they are.  Because, as you said, it will end up more or
> > less random onf a system experiencing a larger number of queries.
> 
> Invariably a process or thread will lose its quantum when it submits
> an I/O request. (There's nothing left for it to do, since it's waiting
> for its data to be read, so there's nothing for it to execute.) 

This statement is verifiably false. What a program does after it submits an I/O
requests is VERY OS and state specific. If an I/O request is made for a disk
block, which is in read-ahead cache, a number of operating systems my return
right away.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

24 April 2002, 00:38:24

On Tue, 23 Apr 2002, mlw wrote:

> > On a system that has neither read-ahead nor sorting of I/O requests,
> > yes. Which systems are you using that provide neither of these
> > facilities?
>
> This only happens if the OS can organize the I/O requests in such a manner. It
> is a non-trivial function.

Well, if you call less than 200 lines of code (including lots of
comments), "non-trivial," yes. Have a look at NetBSD's
src/sys/kern/subr_disk.c for one example implementation.

But trivial or not, if all operating systems on which Postgres runs
are doing this, your point is, well, pointless. So, once again, which
systems are you using that do *not* do this?

> > Invariably a process or thread will lose its quantum when it submits
> > an I/O request. (There's nothing left for it to do, since it's waiting
> > for its data to be read, so there's nothing for it to execute.)
>
> This statement is verifiably false. What a program does after it
> submits an I/O requests is VERY OS and state specific. If an I/O
> request is made for a disk block, which is in read-ahead cache, a
> number of operating systems my return right away.

Sorry, we were working at different levels. You are thinking of
generating an I/O request on the logical level, via a system call.
I was refering to generating a physical I/O request, which a logical
I/O reqeust may or may not do.

So if you would please go back and tackle my argument again, based
on my clarifications above....

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Lincoln Yeoh

Date:

24 April 2002, 03:01:20

At 12:41 PM 4/23/02 -0400, Bruce Momjian wrote:

>This is an interesting point, that an index scan may fit in the cache
>while a sequential scan may not.  I can see cases where even a index
>scan of a large percentage of the table may win over an sequential scan.
>Interesting.

Yes and if it fits in the cache the random access costs drop by orders of 
magnitude as shown by a recent benchmark someone posted where a Solaris box 
cached gigs of data[1].

That's why it might be useful to know what the crossover points for index 
scan vs sequential scans for various random page cost values.

e.g. set random page cost to 1 means optimizer will use sequential scan if 
it thinks an index scan will return 50% or more rows. set to 0.5 for 75% or 
more and so on.

That's probably very simplistic, but basically some idea of what the 
optimizer will do given a random page cost could be helpful.

Thanks,
Link.

[1] Mark Pritchard's benchmark where you can see 3rd try onwards random is 
actually faster than sequential after caching (TWICE as fast too!).

Machine = jedi (Sun E420, 3gb ram, dual 400s, test on single scsi disk)

Sequential
Bytes Read      Time    Bytes / Sec
2097152000      65.19   32167675.28
2097152000      65.22   32154114.65
2097152000      65.16   32182561.99
2097152000      65.12   32206105.12
2097152000      64.67   32429463.26                        32227984.06 (avg)

Random
Bytes Read      Time    Bytes / Sec
4194304000      1522.22 2755394.79
4194304000      278.18  15077622.05
4194304000      91.43   45874730.07
4194304000      61.43   68273795.19
4194304000      54.55   76890231.51                        41774354.72

Sequential Scan Read-Ahead

From

Curt Sampson

Date:

24 April 2002, 03:51:33

At 12:41 PM 4/23/02 -0400, Bruce Momjian wrote:

>This is an interesting point, that an index scan may fit in the cache
>while a sequential scan may not.

If so, I would expect that the number of pages read is significantly
smaller than it was with a sequential scan. If that's the case,
doesn't that mean that the optimizer made the wrong choice anyway?

BTW, I just did a quick walk down this chain of code to see what happens
during a sequential scan:
   access/heap/heapam.c   storage/buffer/bufmgr.c   storage/smgr/smgr.c   storage/smgr/md.c

and it looks to me like individual reads are being done in BLKSIZE
chunks, whether we're scanning or not.

During a sequential scan, I've heard that it's more efficient to
read in multiples of your blocksize, say, 64K chunks rather than
8K chunks, for each read operation you pass to the OS. Does anybody
have any experience to know if this is indeed the case? Has anybody
ever added this to postgresql and benchmarked it?

Certainly if there's a transaction based limit on disk I/O, as well
as a throughput limit, it would be better to read in larger chunks.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

24 April 2002, 08:32:32

Curt Sampson wrote:
> 
> On Tue, 23 Apr 2002, mlw wrote:
> 
> > > On a system that has neither read-ahead nor sorting of I/O requests,
> > > yes. Which systems are you using that provide neither of these
> > > facilities?
> >
> > This only happens if the OS can organize the I/O requests in such a manner. It
> > is a non-trivial function.
> 
> Well, if you call less than 200 lines of code (including lots of
> comments), "non-trivial," yes. Have a look at NetBSD's
> src/sys/kern/subr_disk.c for one example implementation.
> 
> But trivial or not, if all operating systems on which Postgres runs
> are doing this, your point is, well, pointless. So, once again, which
> systems are you using that do *not* do this?

I am not arguing about whether or not they do it, I am saying it is not always
possible. I/O requests do not remain in queue waiting for reordering
indefinitely.

Re: Sequential Scan Read-Ahead

From

Bruce Momjian

Date:

24 April 2002, 10:08:56

Curt Sampson wrote:
> At 12:41 PM 4/23/02 -0400, Bruce Momjian wrote:
> 
> >This is an interesting point, that an index scan may fit in the cache
> >while a sequential scan may not.
> 
> If so, I would expect that the number of pages read is significantly
> smaller than it was with a sequential scan. If that's the case,
> doesn't that mean that the optimizer made the wrong choice anyway?
> 
> BTW, I just did a quick walk down this chain of code to see what happens
> during a sequential scan:
> 
>     access/heap/heapam.c
>     storage/buffer/bufmgr.c
>     storage/smgr/smgr.c
>     storage/smgr/md.c
> 
> and it looks to me like individual reads are being done in BLKSIZE
> chunks, whether we're scanning or not.
> 
> During a sequential scan, I've heard that it's more efficient to
> read in multiples of your blocksize, say, 64K chunks rather than
> 8K chunks, for each read operation you pass to the OS. Does anybody
> have any experience to know if this is indeed the case? Has anybody
> ever added this to postgresql and benchmarked it?
> 
> Certainly if there's a transaction based limit on disk I/O, as well
> as a throughput limit, it would be better to read in larger chunks.

We expect the file system to do re-aheads during a sequential scan. 
This will not happen if someone else is also reading buffers from that
table in another place.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"Luis Alberto Amigo Navarro"

Date:

24 April 2002, 12:33:12

I was thinking in something independent from the executor, simply a variable
that recommends or not the use of a particular index, it could be obtained
from user, and so it could be improved(a factor lower than 1) on planner.
How about something like this?

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: "Lincoln Yeoh" <lyeoh@pop.jaring.my>; "Tom Lane" <tgl@sss.pgh.pa.us>;
"mlw" <markw@mohawksoft.com>; "Andrew Sullivan" <andrew@libertyrms.info>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Tuesday, April 23, 2002 6:42 PM
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE


> Luis Alberto Amigo Navarro wrote:
> > Hi All.
> > I've been reading all the thread and I want to add a few points:
> >
> > You can set enable_seqscan=off in small or easy queries, but in large
> > queries index can speed parts of the query and slow other, so I think it
is
> > neccesary if you want Postgres to become a Wide-used DBMS that the
planner
> > could be able to decide accuratelly, in the thread there is a point that
> > might be useful, it will be very interesting that the planner could
learn
> > with previous executions, even there could be a warm-up policy to let
> > planner learn about how the DB is working, this info could be stored
with DB
> > data, and could statistically show how use of index or seqscan works on
> > every column of the DB.
>
> Yes, I have always felt it would be good to feed back information from
> the executor to the optimizer to help with later estimates.  Of course,
> I never figured out how to do it.  :-)
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
>

Re: Sequential Scan Read-Ahead

From

Curt Sampson

Date:

24 April 2002, 21:40:36

On Wed, 24 Apr 2002, Bruce Momjian wrote:

> We expect the file system to do re-aheads during a sequential scan.
> This will not happen if someone else is also reading buffers from that
> table in another place.

Right. The essential difficulties are, as I see it:
   1. Not all systems do readahead.
   2. Even systems that do do it cannot always reliably detect that   they need to.
   3. Even when the read-ahead does occur, you're still doing more   syscalls, and thus more expensive kernel/userland
transitions,than   you have to.

Has anybody considered writing a storage manager that uses raw
partitions and deals with its own buffer caching? This has the potential
to be a lot more efficient, since the database server knows much more
about its workload than the operating system can guess.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Sequential Scan Read-Ahead

From

Bruce Momjian

Date:

24 April 2002, 21:56:45

Curt Sampson wrote:
> On Wed, 24 Apr 2002, Bruce Momjian wrote:
> 
> > We expect the file system to do re-aheads during a sequential scan.
> > This will not happen if someone else is also reading buffers from that
> > table in another place.
> 
> Right. The essential difficulties are, as I see it:
> 
>     1. Not all systems do readahead.

If they don't, that isn't our problem.  We expect it to be there, and if
it isn't, the vendor/kernel is at fault.

>     2. Even systems that do do it cannot always reliably detect that
>     they need to.

Yes, seek() in file will turn off read-ahead.  Grabbing bigger chunks
would help here, but if you have two people already reading from the
same file, grabbing bigger chunks of the file may not be optimal.

>     3. Even when the read-ahead does occur, you're still doing more
>     syscalls, and thus more expensive kernel/userland transitions, than
>     you have to.

I would guess the performance impact is minimal.

> Has anybody considered writing a storage manager that uses raw
> partitions and deals with its own buffer caching? This has the potential
> to be a lot more efficient, since the database server knows much more
> about its workload than the operating system can guess.

We have talked about it, but rejected it.  Look in TODO.detail in
optimizer and performance for 'raw'.  Also interesting info there about
optimizer cost estimates we have been talking about.

Specificially see:
http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html

Also see:
http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Sequential Scan Read-Ahead

From

Curt Sampson

Date:

24 April 2002, 23:19:14

On Wed, 24 Apr 2002, Bruce Momjian wrote:

> >     1. Not all systems do readahead.
>
> If they don't, that isn't our problem.  We expect it to be there, and if
> it isn't, the vendor/kernel is at fault.

It is your problem when another database kicks Postgres' ass
performance-wise.

And at that point, *you're* at fault. You're the one who's knowingly
decided to do things inefficiently.

Sorry if this sounds harsh, but this, "Oh, someone else is to blame"
attitude gets me steamed. It's one thing to say, "We don't support
this." That's fine; there are often good reasons for that. It's a
completely different thing to say, "It's an unrelated entity's fault we
don't support this."

At any rate, relying on the kernel to guess how to optimise for
the workload will never work as well as well as the software that
knows the workload doing the optimization.

The lack of support thing is no joke. Sure, lots of systems nowadays
support unified buffer cache and read-ahead. But how many, besides
Solaris, support free-behind, which is also very important to avoid
blowing out your buffer cache when doing sequential reads? And who
at all supports read-ahead for reverse scans? (Or does Postgres
not do those, anyway? I can see the support is there.)

And even when the facilities are there, you create problems by
using them.  Look at the OS buffer cache, for example. Not only do
we lose efficiency by using two layers of caching, but (as people
have pointed out recently on the lists), the optimizer can't even
know how much or what is being cached, and thus can't make decisions
based on that.

> Yes, seek() in file will turn off read-ahead.  Grabbing bigger chunks
> would help here, but if you have two people already reading from the
> same file, grabbing bigger chunks of the file may not be optimal.

Grabbing bigger chunks is always optimal, AFICT, if they're not
*too* big and you use the data. A single 64K read takes very little
longer than a single 8K read.

> >     3. Even when the read-ahead does occur, you're still doing more
> >     syscalls, and thus more expensive kernel/userland transitions, than
> >     you have to.
>
> I would guess the performance impact is minimal.

If it were minimal, people wouldn't work so hard to build multi-level
thread systems, where multiple userland threads are scheduled on
top of kernel threads.

However, it does depend on how much CPU your particular application
is using. You may have it to spare.

>     http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html

Well, this message has some points in it that I feel are just incorrect.
   1. It is *not* true that you have no idea where data is when   using a storage array or other similar system. While
you  certainly ought not worry about things such as head positions   and so on, it's been a given for a long, long time
thattwo   blocks that have close index numbers are going to be close   together in physical storage.

   2. Raw devices are quite standard across Unix systems (except   in the unfortunate case of Linux, which I think has
been  remedied, hasn't it?). They're very portable, and have just as   well--if not better--defined write semantics as
afilesystem.

   3. My observations of OS performance tuning over the past six   or eight years contradict the statement, "There's a
considerable  cost in complexity and code in using "raw" storage too, and   it's not a one off cost: as the
technologieschange, the "fast"   way to do things will change and the code will have to be   updated to match." While
optimizationshave been removed over   the years the basic optimizations (order reads by block number,   do larger reads
ratherthan smaller, cache the data) have   remained unchanged for a long, long time.

   4. "Better to leave this to the OS vendor where possible, and   take advantage of the tuning they do." Well, sorry
guys,but   have a look at the tuning they do. It hasn't changed in years,   except to remove now-unnecessary complexity
realatedto really,   really old and slow disk devices, and to add a few thing that   guess workload but still do a
worsejob than if the workload   generator just did its own optimisations in the first place.

>     http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html

Well, this one, with statements like "Postgres does have control
over its buffer cache," I don't know what to say. You can interpret
the statement however you like, but in the end Postgres very little
control at all over how data is moved between memory and disk.

BTW, please don't take me as saying that all control over physical
IO should be done by Postgres. I just think that Posgres could do
a better job of managing data transfer between disk and memory than
the OS can. The rest of the things (using raw paritions, read-ahead,
free-behind, etc.) just drop out of that one idea.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Sequential Scan Read-Ahead

From

Tom Lane

Date:

24 April 2002, 23:30:41

Curt Sampson <cjs@cynic.net> writes:
> Grabbing bigger chunks is always optimal, AFICT, if they're not
> *too* big and you use the data. A single 64K read takes very little
> longer than a single 8K read.

Proof?
        regards, tom lane

Re: Sequential Scan Read-Ahead

From

Tatsuo Ishii

Date:

24 April 2002, 23:37:21

> Curt Sampson <cjs@cynic.net> writes:
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
> > *too* big and you use the data. A single 64K read takes very little
> > longer than a single 8K read.
> 
> Proof?

Long time ago I tested with the 32k block size and got 1.5-2x speed up
comparing ordinary 8k block size in the sequential scan case.
FYI, if this is the case.
--
Tatsuo Ishii

Re: Sequential Scan Read-Ahead

From

Bruce Momjian

Date:

25 April 2002, 00:04:27

Well, this is a very interesting email.  Let me comment on some points.


---------------------------------------------------------------------------

Curt Sampson wrote:
> On Wed, 24 Apr 2002, Bruce Momjian wrote:
> 
> > >     1. Not all systems do readahead.
> >
> > If they don't, that isn't our problem.  We expect it to be there, and if
> > it isn't, the vendor/kernel is at fault.
> 
> It is your problem when another database kicks Postgres' ass
> performance-wise.
> 
> And at that point, *you're* at fault. You're the one who's knowingly
> decided to do things inefficiently.

It is just hard to imagine an OS not doing read-ahead, at least in
simple cases.

> Sorry if this sounds harsh, but this, "Oh, someone else is to blame"
> attitude gets me steamed. It's one thing to say, "We don't support
> this." That's fine; there are often good reasons for that. It's a
> completely different thing to say, "It's an unrelated entity's fault we
> don't support this."

Well, we are guilty of trying to push as much as possible on to other
software.  We do this for portability reasons, and because we think our
time is best spent dealing with db issues, not issues then can be deal
with by other existing software, as long as the software is decent.

> At any rate, relying on the kernel to guess how to optimise for
> the workload will never work as well as well as the software that
> knows the workload doing the optimization.

Sure, that is certainly true.  However, it is hard to know what the
future will hold even if we had perfect knowledge of what was happening
in the kernel.  We don't know who else is going to start doing I/O once
our I/O starts.  We may have a better idea with kernel knowledge, but we
still don't know 100% what will be cached.

> The lack of support thing is no joke. Sure, lots of systems nowadays
> support unified buffer cache and read-ahead. But how many, besides
> Solaris, support free-behind, which is also very important to avoid

We have free-behind on our list.  I think LRU-K will do this quite well
and be a nice general solution for more than just sequential scans.

> blowing out your buffer cache when doing sequential reads? And who
> at all supports read-ahead for reverse scans? (Or does Postgres
> not do those, anyway? I can see the support is there.)
> 
> And even when the facilities are there, you create problems by
> using them.  Look at the OS buffer cache, for example. Not only do
> we lose efficiency by using two layers of caching, but (as people
> have pointed out recently on the lists), the optimizer can't even
> know how much or what is being cached, and thus can't make decisions
> based on that.

Again, are you going to know 100% anyway?

> 
> > Yes, seek() in file will turn off read-ahead.  Grabbing bigger chunks
> > would help here, but if you have two people already reading from the
> > same file, grabbing bigger chunks of the file may not be optimal.
> 
> Grabbing bigger chunks is always optimal, AFICT, if they're not
> *too* big and you use the data. A single 64K read takes very little
> longer than a single 8K read.

There may be validity in this.  It is easy to do (I think) and could be
a win.

> > >     3. Even when the read-ahead does occur, you're still doing more
> > >     syscalls, and thus more expensive kernel/userland transitions, than
> > >     you have to.
> >
> > I would guess the performance impact is minimal.
> 
> If it were minimal, people wouldn't work so hard to build multi-level
> thread systems, where multiple userland threads are scheduled on
> top of kernel threads.
> 
> However, it does depend on how much CPU your particular application
> is using. You may have it to spare.

I assume those apps are doing tons of kernel calls.  I don't think we
really do that many.

> >     http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html
> 
> Well, this message has some points in it that I feel are just incorrect.
> 
>     1. It is *not* true that you have no idea where data is when
>     using a storage array or other similar system. While you
>     certainly ought not worry about things such as head positions
>     and so on, it's been a given for a long, long time that two
>     blocks that have close index numbers are going to be close
>     together in physical storage.

SCSI drivers, for example, are pretty smart.  Not sure we can take
advantage of that from user-land I/O.

>     2. Raw devices are quite standard across Unix systems (except
>     in the unfortunate case of Linux, which I think has been
>     remedied, hasn't it?). They're very portable, and have just as
>     well--if not better--defined write semantics as a filesystem.

Yes, but we are seeing some db's moving away from raw I/O.  Our
performance numbers beat most of the big db's already, so we must be
doing something right.  In fact, our big failing is more is missing
features and limitations of our db, rather than performance.

>     3. My observations of OS performance tuning over the past six
>     or eight years contradict the statement, "There's a considerable
>     cost in complexity and code in using "raw" storage too, and
>     it's not a one off cost: as the technologies change, the "fast"
>     way to do things will change and the code will have to be
>     updated to match." While optimizations have been removed over
>     the years the basic optimizations (order reads by block number,
>     do larger reads rather than smaller, cache the data) have
>     remained unchanged for a long, long time.

Yes, but do we spend our time doing that.  Is the payoff worth it, vs.
working on other features.  Sure it would be great to have all these
fancy things, but is this where our time should be spent, considering
other items on the TODO list?

>     4. "Better to leave this to the OS vendor where possible, and
>     take advantage of the tuning they do." Well, sorry guys, but
>     have a look at the tuning they do. It hasn't changed in years,
>     except to remove now-unnecessary complexity realated to really,
>     really old and slow disk devices, and to add a few thing that
>     guess workload but still do a worse job than if the workload
>     generator just did its own optimisations in the first place.
> 
> >     http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html
> 
> Well, this one, with statements like "Postgres does have control
> over its buffer cache," I don't know what to say. You can interpret
> the statement however you like, but in the end Postgres very little
> control at all over how data is moved between memory and disk.
> 
> BTW, please don't take me as saying that all control over physical
> IO should be done by Postgres. I just think that Posgres could do
> a better job of managing data transfer between disk and memory than
> the OS can. The rest of the things (using raw paritions, read-ahead,
> free-behind, etc.) just drop out of that one idea.

Yes, clearly there is benefit in these, and some of them, like
free-behind, have already been tested, though not committed.

Jumping in and doing the I/O ourselves is a big undertaking, and looking
at our TODO list, I am not sure if it is worth it right now.

Of course, if we had 4 TODO items, I would be much more interested in at
least trying to see how much gain we could get.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

25 April 2002, 00:12:09

On Wed, 24 Apr 2002, mlw wrote:

> I am not arguing about whether or not they do it, I am saying it is
> not always possible. I/O requests do not remain in queue waiting for
> reordering indefinitely.

It doesn't matter. When they go out to the disk they go out in
order. On every Unix-based OS I know of, and Novell Netware, if
you submit a single read request for consecutive blocks, those
blocks *will* be read sequentially, no matter what the system load.

So to get back to the original arugment:

> > >The supposed advantage of a sequential read over an random read, in
> > >an active multitasking system, is a myth. If you are executing one
> > >query and the system is doing only that query, you may be right.

No, it's very real, because your sequential read will not be broken up.

If you think it will, let me know which operating systems this
happens on, and how exactly it happens.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Sequential Scan Read-Ahead

From

Michael Loftis

Date:

25 April 2002, 01:43:19

Tom Lane wrote:

>Curt Sampson <cjs@cynic.net> writes:
>
>>Grabbing bigger chunks is always optimal, AFICT, if they're not
>>*too* big and you use the data. A single 64K read takes very little
>>longer than a single 8K read.
>>
>
>Proof?
>
I contend this statement.

It's optimal to a point.  I know that my system settles into it's best 
read-speeds @ 32K or 64K chunks.  8K chunks are far below optimal for my 
system.  Most systems I work on do far better at 16K than at 8K, and 
most don't see any degradation when going to 32K chunks.  (this is 
across numerous OSes and configs -- results are interpretations from 
bonnie disk i/o marks).

Depending on what you're doing it is more efficiend to read bigger 
blocks up to a point.  If you're multi-thread or reading in non-blocking 
mode, take as big a chunk as you can handle or are ready to process in 
quick order.  If you're picking up a bunch of little chunks here and 
there and know oyu're not using them again then choose a size that will 
hopeuflly cause some of the reads to overlap, failing that, pick the 
smallest usable read size.

The OS can never do that stuff for you.

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Michael Loftis

Date:

25 April 2002, 01:52:52

A Block-sized read will not be rboken up.  But if you're reading ina size bigger than the underlying systems block
sizesthen it can get 

broken up.

So yes a sequential read will get broken up.  A single read request for 
a block may or may not get broken up.  If you're freading with set block 
sizes you'll see the set sizes of blocks come through, but what the 
underlyign OS does is undefined, same for writing.  If the underlying 
block size is 8KB and you dump 4MB down on it, the OS may (and in many 
cases does) decide to write part of it, do a read ona  nearby sector, 
then write the rest.  This happens when doing long writes that end up 
spanning block groups because the inodes must be allocated.  

So the write WILL get broken up.  Reads are under the same gun.  IT all 
depends on how big.  To the application you may or may not see this 
(probably not, unless you're set non-blocking, because the kernel will 
just sleep you until your data is ready).  Further large read requests 
can of course be re-ordered by hardware.  Tagged Command Queueing on 
SCSI drives and RAIDs.  The ICP Vortex cards I use ina  number of 
systems have 64MB on-board cache.  They quite happily, and often 
re-order reads and writes when queueing them to keep things moving as 
fast as possible (Intel didn't buy them for their cards, they use the 
i960 as it is, Intel swiped them for their IP rights).  The OS also tags 
commands it fires to the ICP, which can be re-ordered on block-sized chunks.

Curt Sampson wrote:

>On Wed, 24 Apr 2002, mlw wrote:
>
>>I am not arguing about whether or not they do it, I am saying it is
>>not always possible. I/O requests do not remain in queue waiting for
>>reordering indefinitely.
>>
>
>It doesn't matter. When they go out to the disk they go out in
>order. On every Unix-based OS I know of, and Novell Netware, if
>you submit a single read request for consecutive blocks, those
>blocks *will* be read sequentially, no matter what the system load.
>
>So to get back to the original arugment:
>
>>>>The supposed advantage of a sequential read over an random read, in
>>>>an active multitasking system, is a myth. If you are executing one
>>>>query and the system is doing only that query, you may be right.
>>>>
>
>No, it's very real, because your sequential read will not be broken up.
>
>If you think it will, let me know which operating systems this
>happens on, and how exactly it happens.
>
>cjs
>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

25 April 2002, 02:05:53

On Wed, 24 Apr 2002, Michael Loftis wrote:

> A Block-sized read will not be rboken up.  But if you're reading ina
>  size bigger than the underlying systems block sizes then it can get
> broken up.

In which operating systems, and under what circumstances?

I'll agree that some OSs may not coalesce adjacent reads into a
single read, but even so, submitting a bunch of single reads for
consecutive blocks is going to be much, much faster than if other,
random I/O occured between those reads.

> If the underlying
> block size is 8KB and you dump 4MB down on it, the OS may (and in many
> cases does) decide to write part of it, do a read ona  nearby sector,
> then write the rest.  This happens when doing long writes that end up
> spanning block groups because the inodes must be allocated.

Um...we're talking about 64K vs 8K reads here, not 4 MB reads. I am
certainly not suggesting Posgres ever submit 4 MB read requests to the OS.

I agree that any single-chunk reads or writes that cause non-adjacent
disk blocks to be accessed may be broken up. But in my sense,
they're "broken up" anyway, in that you have no choice but to take
a performance hit.

> Further large read requests can of course be re-ordered by hardware.
> ...The OS also tags ICP, which can be re-ordered on block-sized chunks.

Right. All the more reason to read in larger chunks when we know what we
need in advance, because that will give the OS, controllers, etc. more
advance information, and let them do the reads more efficiently.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Sequential Scan Read-Ahead

From

Curt Sampson

Date:

25 April 2002, 03:28:58

On Wed, 24 Apr 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
> > *too* big and you use the data. A single 64K read takes very little
> > longer than a single 8K read.
>
> Proof?

Well, there are various sorts of "proof" for this assertion. What
sort do you want?

Here's a few samples; if you're looking for something different to
satisfy you, let's discuss it.

1. Theoretical proof: two components of the delay in retrieving a
block from disk are the disk arm movement and the wait for the
right block to rotate under the head.

When retrieving, say, eight adjacent blocks, these will be spread
across no more than two cylinders (with luck, only one). The worst
case access time for a single block is the disk arm movement plus
the full rotational wait; this is the same as the worst case for
eight blocks if they're all on one cylinder. If they're not on one
cylinder, they're still on adjacent cylinders, requiring a very
short seek.

2. Proof by others using it: SQL server uses 64K reads when doing
table scans, as they say that their research indicates that the
major limitation is usually the number of I/O requests, not the
I/O capacity of the disk. BSD's explicitly separates the optimum
allocation size for storage (1K fragments) and optimum read size
(8K blocks) because they found performance to be much better when
a larger size block was read. Most file system vendors, too, do
read-ahead for this very reason.

3. Proof by testing. I wrote a little ruby program to seek to a
random point in the first 2 GB of my raw disk partition and read
1-8 8K blocks of data. (This was done as one I/O request.) (Using
the raw disk partition I avoid any filesystem buffering.) Here are
typical results:
125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block250 reads of 8x8K blocks: 1.9 sec, 132.3
req/sec.7.56 ms/req, 0.945 ms/block500 reads of 4x8K blocks: 2.5 sec, 199 req/sec. 5.03 ms/req, 1.26 ms/block

1000 reads of 2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block
2000 reads of 1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block

The ratios of data retrieval speed per read for groups of adjacent
8K blocks, assuming a single 8K block reads in 1 time unit, are:
1 block 1.00 2 blocks 1.18 4 blocks 1.56 8 blocks 2.34 16 blocks 4.68

At less than 20% more expensive, certainly two-block read requests
could be considered to cost "very little more" than one-block read
requests. Even four-block read requests are only half-again as
expensive. And if you know you're really going to be using the
data, read in 8 block chunks and your cost per block (in terms of
time) drops to less than a third of the cost of single-block reads.

Let me put paid to comments about multiple simultaneous readers
making this invalid. Here's a typical result I get with four
instances of the program running simultaneously:

125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block
250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block
500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block
1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block
2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block

Here's the ratio table again, with another column comparing the
aggregate number of requests per second for one process and four
processes:
1 block 1.00 310 : 440 2 blocks 1.10 262 : 401 4 blocks 1.28 199 : 346 8 blocks
1.69 132 : 260 16 blocks 3.89 66 : 113

Note that, here the relative increase in performance for increasing
sizes of reads is even *better* until we get past 64K chunks. The
overall throughput is better, of course, because with more requests
per second coming in, the disk seek ordering code has more to work
with and the average seek time spent seeking vs. reading will be
reduced.

You know, this is not rocket science; I'm sure there must be papers
all over the place about this. If anybody still disagrees that it's
a good thing to read chunks up to 64K or so when the blocks are
adjacent and you know you'll need the data, I'd like to see some
tangible evidence to support that.

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're
alllight. --XTC

Re: Sequential Scan Read-Ahead

From

Curt Sampson

Date:

25 April 2002, 03:55:52

On Thu, 25 Apr 2002, Bruce Momjian wrote:

> Well, we are guilty of trying to push as much as possible on to other
> software.  We do this for portability reasons, and because we think our
> time is best spent dealing with db issues, not issues then can be deal
> with by other existing software, as long as the software is decent.

That's fine. I think that's a perfectly fair thing to do.

It was just the wording (i.e., "it's this other software's fault
that blah de blah") that got to me. To say, "We don't do readahead
becase most OSes supply it, and we feel that other things would
help more to improve performance," is fine by me. Or even, "Well,
nobody feels like doing it. You want it, do it yourself," I have
no problem with.

> Sure, that is certainly true.  However, it is hard to know what the
> future will hold even if we had perfect knowledge of what was happening
> in the kernel.  We don't know who else is going to start doing I/O once
> our I/O starts.  We may have a better idea with kernel knowledge, but we
> still don't know 100% what will be cached.

Well, we do if we use raw devices and do our own caching, using
pages that are pinned in RAM. That was sort of what I was aiming
at for the long run.

> We have free-behind on our list.

Uh...can't do it, if you're relying on the OS to do the buffering.
How do you tell the OS that you're no longer going to use a page?

> I think LRU-K will do this quite well
> and be a nice general solution for more than just sequential scans.

LRU-K sounds like a great idea to me, as does putting pages read
for a table scan at the LRU end of the cache, rather than the MRU
(assuming we do something to ensure that they stay in cache until
read once, at any rate).

But again, great for your own cache, but doesn't work with the OS
cache. And I'm a bit scared to crank up too high the amount of
memory I give Postgres, lest the OS try to too aggressively buffer
all that I/O in what memory remains to it, and start blowing programs
(like maybe the backend binary itself) out of RAM. But maybe this
isn't typically a problem; I don't know.

> There may be validity in this.  It is easy to do (I think) and could be
> a win.

It didn't look to difficult to me, when I looked at the code, and
you can see what kind of win it is from the response I just made
to Tom.

> >     1. It is *not* true that you have no idea where data is when
> >     using a storage array or other similar system. While you
> >     certainly ought not worry about things such as head positions
> >     and so on, it's been a given for a long, long time that two
> >     blocks that have close index numbers are going to be close
> >     together in physical storage.
>
> SCSI drivers, for example, are pretty smart.  Not sure we can take
> advantage of that from user-land I/O.

Looking at the NetBSD ones, I don't see what they're doing that's
so smart. (Aside from some awfully clever workarounds for stupid
hardware limitations that would otherwise kill performance.) What
sorts of "smart" are you referring to?

> Yes, but we are seeing some db's moving away from raw I/O.

Such as whom? And are you certain that they're moving to using the
OS buffer cache, too? MS SQL server, for example, uses the filesystem,
but turns off all buffering on those files.

> Our performance numbers beat most of the big db's already, so we must
> be doing something right.

Really? Do the performance numbers for simple, bulk operations
(imports, exports, table scans) beat the others handily? My intuition
says not, but I'll happily be convinced otherwise.

> Yes, but do we spend our time doing that.  Is the payoff worth it, vs.
> working on other features.  Sure it would be great to have all these
> fancy things, but is this where our time should be spent, considering
> other items on the TODO list?

I agree that these things need to be assesed.

> Jumping in and doing the I/O ourselves is a big undertaking, and looking
> at our TODO list, I am not sure if it is worth it right now.

Right. I'm not trying to say this is a critical priority, I'm just
trying to determine what we do right now, what we could do, and
the potential performance increase that would give us.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Sequential Scan Read-Ahead

From

Lincoln Yeoh

Date:

25 April 2002, 04:21:28

At 12:19 PM 4/25/02 +0900, Curt Sampson wrote:
>Grabbing bigger chunks is always optimal, AFICT, if they're not
>*too* big and you use the data. A single 64K read takes very little
>longer than a single 8K read.

Yes I agree that if sequential scans are done reading ahead helps.

And often doesn't cost much more- whilst waiting for the first block you 
ask for sometimes the other blocks are going to spin past first and often 
the subsystems will read and cache them anyway. At least that was what a 
disk caching program I wrote years ago did (it had a track cache and an O/S 
metadata cache[1]), I'm sure most modern HDDs will do the track caching 
amongst even more advanced stuff.

>     3. My observations of OS performance tuning over the past six
>     or eight years contradict the statement, "There's a considerable
>     cost in complexity and code in using "raw" storage too, and
>     it's not a one off cost: as the technologies change, the "fast"
>     way to do things will change and the code will have to be
>     updated to match." While optimizations have been removed over
>     the years the basic optimizations (order reads by block number,
>     do larger reads rather than smaller, cache the data) have
>     remained unchanged for a long, long time.

>BTW, please don't take me as saying that all control over physical
>IO should be done by Postgres. I just think that Posgres could do
>a better job of managing data transfer between disk and memory than
>the OS can. The rest of the things (using raw paritions, read-ahead,
>free-behind, etc.) just drop out of that one idea.

I think the raw partitions will be more trouble than they are worth. 
Reading larger chunks at appropriate circumstances seems to be the "low 
hanging fruit".

If postgresql prefers sequential scans so much it should do them better ;) 
(just being naughty!).

Cheerio,
Link.

[1] The theory was the drive typically has to jump around a lot more for 
metadata than for files. In practice it worked pretty well, if I do say so 
myself :). Not sure if modern HDDs do specialized O/S metadata caching 
(wonder how many megabytes would typically be needed for 18GB drives :) ).

Re: Sequential Scan Read-Ahead

From

Curt Sampson

Date:

25 April 2002, 05:19:04

On Thu, 25 Apr 2002, Curt Sampson wrote:

> Here's the ratio table again, with another column comparing the
> aggregate number of requests per second for one process and four
> processes:
>

Just for interest, I ran this again with 20 processes working
simultaneously. I did six runs at each blockread size and summed
the tps for each process to find the aggregate number of reads per
second during the test. I dropped the higest and the lowest ones,
and averaged the rest. Here's the new table:
    1 proc    4 procs    20 procs
   1 block    310    440    260   2 blocks    262    401    481   4 blocks    199    346    354   8 blocks    132
260   250   16 blocks     66    113    116
 

I'm not sure at all why performance gets so much *worse* with a lot of
contention on the 1K reads. This could have something to with NetBSD, or
its buffer cache, or my laptop's crappy little disk drive....

Or maybe I'm just running out of CPU.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Sequential Scan Read-Ahead

From

Curt Sampson

Date:

25 April 2002, 06:47:14

On Thu, 25 Apr 2002, Lincoln Yeoh wrote:

> I think the raw partitions will be more trouble than they are worth.
> Reading larger chunks at appropriate circumstances seems to be the "low
> hanging fruit".

That's certainly a good start. I don't know if the raw partitions
would be more trouble than they are worth, but it certainly would
be a lot more work, yes. One could do pretty much as well, I think,
by using the "don't buffer blocks for this file" option on those
OSes that have it.

> [1] The theory was the drive typically has to jump around a lot more for
> metadata than for files. In practice it worked pretty well, if I do say so
> myself :). Not sure if modern HDDs do specialized O/S metadata caching
> (wonder how many megabytes would typically be needed for 18GB drives :) ).

Sure they do, though they don't necessarially read it all. Most
unix systems have special cache for namei lookups (turning a filename
into an i-node number), often one per-process as well as a system-wide
one. And on machines with a unified buffer cache for file data,
there's still a separate metadata cache.

But in fact, at least with BSD's FFS, there's probably not quite
as much jumping as you'd think. An FFS filesystem is divided into
"cylinder groups" (though these days the groups don't necessarially
match the physical cylinder boundaries on the disk) and a file's
i-node entry is kept in the same cylinder group as the file's data,
or at least the first part of the it.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Sequential Scan Read-Ahead

From

Tom Lane

Date:

25 April 2002, 09:54:34

Curt Sampson <cjs@cynic.net> writes:
> 1. Theoretical proof: two components of the delay in retrieving a
> block from disk are the disk arm movement and the wait for the
> right block to rotate under the head.

> When retrieving, say, eight adjacent blocks, these will be spread
> across no more than two cylinders (with luck, only one).

Weren't you contending earlier that with modern disk mechs you really
have no idea where the data is?  You're asserting as an article of 
faith that the OS has been able to place the file's data blocks
optimally --- or at least well enough to avoid unnecessary seeks.
But just a few days ago I was getting told that random_page_cost
was BS because there could be no such placement.

I'm getting a tad tired of sweeping generalizations offered without
proof, especially when they conflict.

> 3. Proof by testing. I wrote a little ruby program to seek to a
> random point in the first 2 GB of my raw disk partition and read
> 1-8 8K blocks of data. (This was done as one I/O request.) (Using
> the raw disk partition I avoid any filesystem buffering.)

And also ensure that you aren't testing the point at issue.
The point at issue is that *in the presence of kernel read-ahead*
it's quite unclear that there's any benefit to a larger request size.
Ideally the kernel will have the next block ready for you when you
ask, no matter what the request is.

There's been some talk of using the AIO interface (where available)
to "encourage" the kernel to do read-ahead.  I don't foresee us
writing our own substitute filesystem to make this happen, however.
Oracle may have the manpower for that sort of boondoggle, but we
don't...
        regards, tom lane

Re: Sequential Scan Read-Ahead

From

Bruce Momjian

Date:

25 April 2002, 10:56:03

Curt Sampson wrote:
> 3. Proof by testing. I wrote a little ruby program to seek to a
> random point in the first 2 GB of my raw disk partition and read
> 1-8 8K blocks of data. (This was done as one I/O request.) (Using
> the raw disk partition I avoid any filesystem buffering.) Here are
> typical results:
> 
>  125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block
>  250 reads of  8x8K blocks: 1.9 sec, 132.3 req/sec. 7.56 ms/req, 0.945 ms/block
>  500 reads of  4x8K blocks: 2.5 sec, 199 req/sec.   5.03 ms/req, 1.26 ms/block
> 1000 reads of  2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block
> 2000 reads of  1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block
> 
> The ratios of data retrieval speed per read for groups of adjacent
> 8K blocks, assuming a single 8K block reads in 1 time unit, are:
> 
>     1 block    1.00
>     2 blocks    1.18
>     4 blocks    1.56
>     8 blocks    2.34
>     16 blocks    4.68

You mention you are reading from a raw partition.  It is my
understanding that raw partition reads have no kernel read-ahead.  (I
assume this because raw devices are devices, not file system files.)  If
this is true, could we get numbers with kernel read-ahead active?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

25 April 2002, 11:01:14

Michael Loftis wrote:
> A Block-sized read will not be broken up.  But if you're reading ina 
>  size bigger than the underlying systems block sizes then it can get 
> broken up.
> 
> So yes a sequential read will get broken up.  A single read request for 
> a block may or may not get broken up.  If you're freading with set block 
> sizes you'll see the set sizes of blocks come through, but what the 
> underlying OS does is undefined, same for writing.  If the underlying 
> block size is 8KB and you dump 4MB down on it, the OS may (and in many 
> cases does) decide to write part of it, do a read ona  nearby sector, 
> then write the rest.  This happens when doing long writes that end up 
> spanning block groups because the inodes must be allocated.  

Also keep in mind most disks have 512 byte blocks, so even if the file
system is 8k, the disk block sizes are different.  A given 8k or 1k file
system block may not even be all in the same cylinder.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

25 April 2002, 11:34:35

Actually, this brings up a different point.  We use 8k blocks now
because at the time PostgreSQL was developed, it used BSD file systems,
and those prefer 8k blocks, and there was some concept that an 8k write
was atomic, though with 512 byte disk blocks, that was incorrect.  (We
knew that at the time too, but we didn't have any options, so we just
hoped.)

In fact, we now write pre-modified pages to WAL specifically because we
can't be sure an 8k page write to disk will be atomic.  Part of the page
may make it to disk, and part may not.

Now, with larger RAM and disk sizes, it may be time to consider larger
page sizes, like 32k pages.  That reduces the granularity of the cache,
but it may have other performance advantages that would be worth it.

What people are actually suggesting with the read-ahead for sequential
scans is basically a larger block size for sequential scans than for
index scans.  While this makes sense, it may be better to just increase
the block size overall.

---------------------------------------------------------------------------

Curt Sampson wrote:
> On Wed, 24 Apr 2002, Michael Loftis wrote:
> 
> > A Block-sized read will not be rboken up.  But if you're reading ina
> >  size bigger than the underlying systems block sizes then it can get
> > broken up.
> 
> In which operating systems, and under what circumstances?
> 
> I'll agree that some OSs may not coalesce adjacent reads into a
> single read, but even so, submitting a bunch of single reads for
> consecutive blocks is going to be much, much faster than if other,
> random I/O occured between those reads.
> 
> > If the underlying
> > block size is 8KB and you dump 4MB down on it, the OS may (and in many
> > cases does) decide to write part of it, do a read ona  nearby sector,
> > then write the rest.  This happens when doing long writes that end up
> > spanning block groups because the inodes must be allocated.
> 
> Um...we're talking about 64K vs 8K reads here, not 4 MB reads. I am
> certainly not suggesting Posgres ever submit 4 MB read requests to the OS.
> 
> I agree that any single-chunk reads or writes that cause non-adjacent
> disk blocks to be accessed may be broken up. But in my sense,
> they're "broken up" anyway, in that you have no choice but to take
> a performance hit.
> 
> > Further large read requests can of course be re-ordered by hardware.
> > ...The OS also tags ICP, which can be re-ordered on block-sized chunks.
> 
> Right. All the more reason to read in larger chunks when we know what we
> need in advance, because that will give the OS, controllers, etc. more
> advance information, and let them do the reads more efficiently.
> 
> cjs
> -- 
> Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
>     Don't you know, in this new Dark Age, we're all light.  --XTC
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Sequential Scan Read-Ahead

From

Hannu Krosing

Date:

25 April 2002, 12:41:01

On Thu, 2002-04-25 at 12:47, Curt Sampson wrote:
> On Thu, 25 Apr 2002, Lincoln Yeoh wrote:
> 
> > I think the raw partitions will be more trouble than they are worth.
> > Reading larger chunks at appropriate circumstances seems to be the "low
> > hanging fruit".
> 
> That's certainly a good start. I don't know if the raw partitions
> would be more trouble than they are worth, but it certainly would
> be a lot more work, yes. One could do pretty much as well, I think,
> by using the "don't buffer blocks for this file" option on those
> OSes that have it.

I was on a short DB2 tuning course and was told that on Win NT turning
off cache causes about 15-20% speedup. 

(I don't know what exacly is sped up :)

> > [1] The theory was the drive typically has to jump around a lot more for
> > metadata than for files. In practice it worked pretty well, if I do say so
> > myself :). Not sure if modern HDDs do specialized O/S metadata caching
> > (wonder how many megabytes would typically be needed for 18GB drives :) ).
> 
> Sure they do, though they don't necessarially read it all. Most
> unix systems

Do modern HDD's have unix inside them ;)

> have special cache for namei lookups (turning a filename
> into an i-node number), often one per-process as well as a system-wide
> one. And on machines with a unified buffer cache for file data,
> there's still a separate metadata cache.

-----------
Hannu

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

mlw

Date:

25 April 2002, 14:18:01

Bruce Momjian wrote:
> Now, with larger RAM and disk sizes, it may be time to consider larger
> page sizes, like 32k pages.  That reduces the granularity of the cache,
> but it may have other performance advantages that would be worth it.
> 
> What people are actually suggesting with the read-ahead for sequential
> scans is basically a larger block size for sequential scans than for
> index scans.  While this makes sense, it may be better to just increase
> the block size overall.

I have seen performance improvements by using 16K blocks over 8K blocks in
sequential scans of large tables. 

I am investigating the performance difference between 16K and 8K block sizes on
one of my systems. I'll let you know what I see. I am using pgbench for generic
performance levels.

If you would like to see any extra data, just let me know.

Re: Sequential Scan Read-Ahead

From

Kyle

Date:

25 April 2002, 20:45:26

Tom Lane wrote:
> ...
> Curt Sampson <cjs@cynic.net> writes:
> > 3. Proof by testing. I wrote a little ruby program to seek to a
> > random point in the first 2 GB of my raw disk partition and read
> > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
> > the raw disk partition I avoid any filesystem buffering.)
> 
> And also ensure that you aren't testing the point at issue.
> The point at issue is that *in the presence of kernel read-ahead*
> it's quite unclear that there's any benefit to a larger request size.
> Ideally the kernel will have the next block ready for you when you
> ask, no matter what the request is.
> ...

I have to agree with Tom.  I think the numbers below show that with
kernel read-ahead, block size isn't an issue.

The big_file1 file used below is 2.0 gig of random data, and the
machine has 512 mb of main memory.  This ensures that we're not
just getting cached data.

foreach i (4k 8k 16k 32k 64k 128k) echo $i time dd bs=$i if=big_file1 of=/dev/null
end

and the results:

bs    user    kernel   elapsed
4k:   0.260   7.740    1:27.25
8k:   0.210   8.060    1:30.48
16k:  0.090   7.790    1:30.88
32k:  0.060   8.090    1:32.75
64k:  0.030   8.190    1:29.11
128k: 0.070   9.830    1:28.74

so with kernel read-ahead, we have basically the same elapsed (wall
time) regardless of block size.  Sure, user time drops to a low at 64k
blocksize, but kernel time is increasing.

You could argue that this is a contrived example, no other I/O is
being done.  Well I created a second 2.0g file (big_file2) and did two
simultaneous reads from the same disk.  Sure performance went to hell
but it shows blocksize is still irrelevant in a multi I/O environment
with sequential read-ahead.

foreach i ( 4k 8k 16k 32k 64k 128k ) echo $i time dd bs=$i if=big_file1 of=/dev/null & time dd bs=$i if=big_file2
of=/dev/null& wait

end

bs    user    kernel   elapsed
4k:   0.480   8.290    6:34.13  bigfile1     0.320   8.730    6:34.33  bigfile2
8k:   0.250   7.580    6:31.75     0.180   8.450    6:31.88
16k:  0.150   8.390    6:32.47     0.100   7.900    6:32.55
32k:  0.190   8.460    6:24.72     0.060   8.410    6:24.73
64k:  0.060   9.350    6:25.05     0.150   9.240    6:25.13
128k: 0.090  10.610    6:33.14     0.110  11.320    6:33.31

the differences in read times are basically in the mud.  Blocksize
just doesn't matter much with the kernel doing readahead.

-Kyle

Re: Sequential Scan Read-Ahead

From

Bruce Momjian

Date:

25 April 2002, 22:18:53

Nice test.  Would you test simultaneous 'dd' on the same file, perhaps
with a slight delay between to the two so they don't read each other's
blocks?

seek() in the file will turn off read-ahead in most OS's.  I am not
saying this is a major issue for PostgreSQL but the numbers would be
interesting.


---------------------------------------------------------------------------

Kyle wrote:
> Tom Lane wrote:
> > ...
> > Curt Sampson <cjs@cynic.net> writes:
> > > 3. Proof by testing. I wrote a little ruby program to seek to a
> > > random point in the first 2 GB of my raw disk partition and read
> > > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
> > > the raw disk partition I avoid any filesystem buffering.)
> > 
> > And also ensure that you aren't testing the point at issue.
> > The point at issue is that *in the presence of kernel read-ahead*
> > it's quite unclear that there's any benefit to a larger request size.
> > Ideally the kernel will have the next block ready for you when you
> > ask, no matter what the request is.
> > ...
> 
> I have to agree with Tom.  I think the numbers below show that with
> kernel read-ahead, block size isn't an issue.
> 
> The big_file1 file used below is 2.0 gig of random data, and the
> machine has 512 mb of main memory.  This ensures that we're not
> just getting cached data.
> 
> foreach i (4k 8k 16k 32k 64k 128k)
>   echo $i
>   time dd bs=$i if=big_file1 of=/dev/null
> end
> 
> and the results:
> 
> bs    user    kernel   elapsed
> 4k:   0.260   7.740    1:27.25
> 8k:   0.210   8.060    1:30.48
> 16k:  0.090   7.790    1:30.88
> 32k:  0.060   8.090    1:32.75
> 64k:  0.030   8.190    1:29.11
> 128k: 0.070   9.830    1:28.74
> 
> so with kernel read-ahead, we have basically the same elapsed (wall
> time) regardless of block size.  Sure, user time drops to a low at 64k
> blocksize, but kernel time is increasing.
> 
> 
> You could argue that this is a contrived example, no other I/O is
> being done.  Well I created a second 2.0g file (big_file2) and did two
> simultaneous reads from the same disk.  Sure performance went to hell
> but it shows blocksize is still irrelevant in a multi I/O environment
> with sequential read-ahead.
> 
> foreach i ( 4k 8k 16k 32k 64k 128k )
>   echo $i
>   time dd bs=$i if=big_file1 of=/dev/null &
>   time dd bs=$i if=big_file2 of=/dev/null &
>   wait
> end
> 
> bs    user    kernel   elapsed
> 4k:   0.480   8.290    6:34.13  bigfile1
>       0.320   8.730    6:34.33  bigfile2
> 8k:   0.250   7.580    6:31.75
>       0.180   8.450    6:31.88
> 16k:  0.150   8.390    6:32.47
>       0.100   7.900    6:32.55
> 32k:  0.190   8.460    6:24.72
>       0.060   8.410    6:24.73
> 64k:  0.060   9.350    6:25.05
>       0.150   9.240    6:25.13
> 128k: 0.090  10.610    6:33.14
>       0.110  11.320    6:33.31
> 
> 
> the differences in read times are basically in the mud.  Blocksize
> just doesn't matter much with the kernel doing readahead.
> 
> -Kyle
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Sequential Scan Read-Ahead

From

Curt Sampson

Date:

25 April 2002, 22:27:16

On Thu, 25 Apr 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > 1. Theoretical proof: two components of the delay in retrieving a
> > block from disk are the disk arm movement and the wait for the
> > right block to rotate under the head.
>
> > When retrieving, say, eight adjacent blocks, these will be spread
> > across no more than two cylinders (with luck, only one).
>
> Weren't you contending earlier that with modern disk mechs you really
> have no idea where the data is?

No, that was someone else. I contend that with pretty much any
large-scale storage mechanism (i.e., anything beyond ramdisks),
you will find that accessing two adjacent blocks is almost always
1) close to as fast as accessing just the one, and 2) much, much
faster than accessing two blocks that are relatively far apart.

There will be the odd case where the two adjacent blocks are
physically far apart, but this is rare.

If this idea doesn't hold true, the whole idea that sequential
reads are faster than random reads falls apart, and the optimizer
shouldn't even have the option to make random reads cost more, much
less have it set to four rather than one (or whatever it's set to).

> You're asserting as an article of
> faith that the OS has been able to place the file's data blocks
> optimally --- or at least well enough to avoid unnecessary seeks.

So are you, in the optimizer. But that's all right; the OS often
can and does do this placement; the FFS filesystem is explicitly
designed to do this sort of thing. If the filesystem isn't empty
and the files grow a lot they'll be split into large fragments,
but the fragments will be contiguous.

> But just a few days ago I was getting told that random_page_cost
> was BS because there could be no such placement.

I've been arguing against that point as well.

> And also ensure that you aren't testing the point at issue.
> The point at issue is that *in the presence of kernel read-ahead*
> it's quite unclear that there's any benefit to a larger request size.

I will test this.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

25 April 2002, 22:37:09

On Thu, 25 Apr 2002, Bruce Momjian wrote:

> Also keep in mind most disks have 512 byte blocks, so even if the file
> system is 8k, the disk block sizes are different.  A given 8k or 1k file
> system block may not even be all in the same cylinder.

Right. Though much of the time they will be in the same cylinder,
sometimes they will be in adjacent cylinders if the drive manufacturer
has made cylinder sizes that are not multiples of 8K. I don't think
this is terribly frequent, but there's no way to substantiate that
assumption without knowing the real geometries of the drive, which
generally are not given out. (What is reported to the OS has not
been the real geometry for years now, because drive manufacturers
long ago started putting more blocks on the outer cylinders than
the inner ones.)

However, even that they will be in adjacent cylinders doesn't always
hold: depending on how the disk subsystems are partitioned, you
might be crossing a boundary where two partitions are joined
together, necessitating a seek. But this case would be quite rare.

You can always find conditions, in modern drive subsystems, where
the "read close together" idea doesn't hold, but in the vast, vast
majority of circumstances it does.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

25 April 2002, 22:51:42

On Thu, 25 Apr 2002, Bruce Momjian wrote:

> Actually, this brings up a different point.  We use 8k blocks now
> because at the time PostgreSQL was developed, it used BSD file systems,
> and those prefer 8k blocks, and there was some concept that an 8k write
> was atomic, though with 512 byte disk blocks, that was incorrect.  (We
> knew that at the time too, but we didn't have any options, so we just
> hoped.)

MS SQL Server has an interesting way of dealing with this. They have a
"torn" bit in each 512-byte chunk of a page, and this bit is set the
same for each chunk. When they are about to write out a page, they first
flip all of the torn bits and then do the write. If the write does not
complete due to a system crash or whatever, this can be detected later
because the torn bits won't match across the entire page.

> Now, with larger RAM and disk sizes, it may be time to consider larger
> page sizes, like 32k pages.  That reduces the granularity of the cache,
> but it may have other performance advantages that would be worth it.

It really depends on the block size your underlying layer is using.
Reading less than that is never useful as you pay for that entire
block anyway. (E.g., on an FFS filesystem with 8K blocks, the OS
always reads 8K even if you ask for only 4K.)

On the other hand, reading more does have a tangible cost, as you
saw from the benchmark I posted; reading 16K on my system cost 20%
more than reading 8K, and used twice the buffer space. If I'm doing
lots of really random reads, this would result in a performance
loss (due to doing more I/O, and having less chance that the next
item I want is in the buffer cache).

For some reason I thought we had the ability to change the block
size that postgres uses on a table-by-table basis, but I can't find
anything in the docs about that. Maybe it's just because I saw some
support in the code for it. But this feature would be a nice addition
for those cases where a larger block size would help.

But I think that 8K is a pretty good default, and I think that 32K
blocks would result in a quite noticable performance reduction for
apps that did a lot of random I/O.

> What people are actually suggesting with the read-ahead for sequential
> scans is basically a larger block size for sequential scans than for
> index scans.  While this makes sense, it may be better to just increase
> the block size overall.

I don't think so, because the smaller block size is definitely
better for random I/O.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

20 June 2002, 22:48:09

Curt Sampson wrote:
> On Thu, 25 Apr 2002, Bruce Momjian wrote:
> 
> > Actually, this brings up a different point.  We use 8k blocks now
> > because at the time PostgreSQL was developed, it used BSD file systems,
> > and those prefer 8k blocks, and there was some concept that an 8k write
> > was atomic, though with 512 byte disk blocks, that was incorrect.  (We
> > knew that at the time too, but we didn't have any options, so we just
> > hoped.)
> 
> MS SQL Server has an interesting way of dealing with this. They have a
> "torn" bit in each 512-byte chunk of a page, and this bit is set the
> same for each chunk. When they are about to write out a page, they first
> flip all of the torn bits and then do the write. If the write does not
> complete due to a system crash or whatever, this can be detected later
> because the torn bits won't match across the entire page.

I was wondering, how does knowing the block is corrupt help MS SQL? 
Right now, we write changed pages to WAL, then later write them to disk.
I have always been looking for a way to prevent these WAL writes.  The
512-byte bit seems interesting, but how does it help?

And how does the bit help them with partial block writes?  Is the bit at
the end of the block?  Is that reliable?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

20 June 2002, 22:50:16

On Thu, 20 Jun 2002, Bruce Momjian wrote:

> > MS SQL Server has an interesting way of dealing with this. They have a
> > "torn" bit in each 512-byte chunk of a page, and this bit is set the
> > same for each chunk. When they are about to write out a page, they first
> > flip all of the torn bits and then do the write. If the write does not
> > complete due to a system crash or whatever, this can be detected later
> > because the torn bits won't match across the entire page.
>
> I was wondering, how does knowing the block is corrupt help MS SQL?

I'm trying to recall, but I can't off hand. I'll have to look it
up in my Inside SQL Server book, which is at home right now,
unfortunately. I'll bring the book into work and let you know the
details later.

> Right now, we write changed pages to WAL, then later write them to disk.

Ah. You write the entire page? MS writes only the changed tuple.
And DB2, in fact, goes one better and writes only the part of the
tuple up to the change, IIRC. Thus, if you put smaller and/or more
frequently changed columns first, you'll have smaller logs.

> I have always been looking for a way to prevent these WAL writes.  The
> 512-byte bit seems interesting, but how does it help?

Well, this would at least let you reduce the write to the 512-byte
chunk that changed, rather than writing the entire 8K page.

> And how does the bit help them with partial block writes?  Is the bit at
> the end of the block?  Is that reliable?

The bit is somewhere within every 512 byte "disk page" within the
8192 byte "filesystem/database page." So an 8KB page is divided up
like this:
   | <----------------------- 8 Kb ----------------------> |
   | 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b |

Thus, the tear bits start out like this:
   |  0   |  0   |  0   |  0   |  0   |  0   |  0   |  0   |

After a successful write of the entire page, you have this:
   |  1   |  1   |  1   |  1   |  1   |  1   |  1   |  1   |

If the write is unsuccessful, you end up with something like this:
   |  1   |  1   |  1   |  1   |  1   |  0   |  0   |  0   |

And now you know which parts of your page got written, and which
parts didn't.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

21 June 2002, 10:29:33

Curt Sampson <cjs@cynic.net> writes:
> And now you know which parts of your page got written, and which
> parts didn't.

Yes ... and what do you *do* about it?
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

22 June 2002, 04:41:47

On Fri, 21 Jun 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > And now you know which parts of your page got written, and which
> > parts didn't.
>
> Yes ... and what do you *do* about it?

Ok. Here's the extract from _Inside Microsoft SQL Server 7.0_, page 207:
   torn page detection   When TRUE, this option causes a bit to beflipped for each 512-byte sector in a database page
(8KB)whenever the page is written to disk.  This option allowsSQL Server to detect incomplete I/O operations caused
bypowerfailures or other system outages. If a bit is in thewrong state when the page is later read by SQL Server,
thismeansthe page was written incorrectly; a torn page hasbeen detected. Although SQL Server database pages are 8KB,
disksperform I/O operations using 512-byte sectors.Therefore, 16 sectors are written per database page.  Atorn page can
occurif the system crashes (for example,because of power failure) between the time the operatingsystem writes the first
512-bytesector to disk and thecompletion of the 8-KB I/O operation.  If the first sectorof a database page is
successfullywritten before the crash,it will appear that the database page on disk was updated,although it might not
havesucceeded. Using battery-backeddisk caches can ensure that data is [sic] successfullywritten to disk or not written
atall. In this case, don'tset torn page detection to TRUE, as it isn't needed. If atorn page is detected, the database
willneed to be restoredfrom backup because it will be physically inconsistent.

As I understand it, this is not a problem for postgres becuase the
entire page is written to the log. So postgres is safe, but quite
inefficient. (It would be much more efficient to write just the
changed tuple, or even just the changed values within the tuple,
to the log.)

Adding these torn bits would allow posgres at least to write to
the log just the 512-byte sectors that have changed, rather than
the entire 8 KB page.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"J. R. Nield"

Date:

22 June 2002, 18:28:41

On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote:
> I was wondering, how does knowing the block is corrupt help MS SQL? 
> Right now, we write changed pages to WAL, then later write them to disk.
> I have always been looking for a way to prevent these WAL writes.  The
> 512-byte bit seems interesting, but how does it help?
> 
> And how does the bit help them with partial block writes?  Is the bit at
> the end of the block?  Is that reliable?
> 

My understanding of this is as follows:

1) On most commercial systems, if you get a corrupted block (from
partial write or whatever) you need to restore the file(s) from the most
recent backup, and replay the log from the log archive (usually only the
damaged files will be written to during replay). 

2) If you can't deal with the downtime to recover the file, then EMC,
Sun, or IBM will sell you an expensive disk array with an NVRAM cache
that will do atomic writes. Some plain-vanilla SCSI disks are also
capable of atomic writes, though usually they don't use NVRAM to do it. 

The database must then make sure that each page-write gets translated
into exactly one SCSI-level write. This is one reason why ORACLE and
Sybase recommend that you use raw disk partitions for high availability.
Some operating systems support this through the filesystem, but it is OS
dependent. I think Solaris 7 & 8 has support for this, but I'm not sure.

PostgreSQL has trouble because it can neither archive logs for replay,
nor use raw disk partitions.

One other point:

Page pre-image logging is fundamentally the same as what Jim Grey's
book[1] would call "careful writes". I don't believe they should be in
the XLOG, because we never need to keep the pre-images after we're sure
the buffer has made it to the disk. Instead, we should have the buffer
IO routines implement ping-pong writes of some kind if we want
protection from partial writes.

Does any of this make sense?

;jrnield

[1] Grey, J. and Reuter, A. (1993). "Transaction Processing: Conceptsand Techniques". Morgan Kaufmann.

-- 
J. R. Nield
jrnield@usol.com

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

22 June 2002, 19:40:02

J. R. Nield wrote:
> One other point:
> 
> Page pre-image logging is fundamentally the same as what Jim Grey's
> book[1] would call "careful writes". I don't believe they should be in
> the XLOG, because we never need to keep the pre-images after we're sure
> the buffer has made it to the disk. Instead, we should have the buffer
> IO routines implement ping-pong writes of some kind if we want
> protection from partial writes.

Ping-pong writes to where?  We have to fsync, and rather than fsync that
area and WAL, we just do WAL.  Not sure about a win there.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"J. R. Nield"

Date:

23 June 2002, 08:37:39

On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
> J. R. Nield wrote:
> > One other point:
> > 
> > Page pre-image logging is fundamentally the same as what Jim Grey's
> > book[1] would call "careful writes". I don't believe they should be in
> > the XLOG, because we never need to keep the pre-images after we're sure
> > the buffer has made it to the disk. Instead, we should have the buffer
> > IO routines implement ping-pong writes of some kind if we want
> > protection from partial writes.
> 
> Ping-pong writes to where?  We have to fsync, and rather than fsync that
> area and WAL, we just do WAL.  Not sure about a win there.
> 

The key question is: do we have some method to ensure that the OS
doesn't do the writes in parallel?

If the OS will ensure that one of the two block writes of a ping-pong
completes before the other starts, then we don't need to fsync() at 
all. 

The only thing we are protecting against is the possibility of both
writes being partial. If neither is done, that's fine because WAL will
protect us. If the first write is partial, we will detect that and use
the old data from the other, then recover from WAL. If the first is
complete but the second is partial, then we detect that and use the
newer block from the first write. If the second is complete but the
first is partial, we detect that and use the newer block from the second
write.

So does anyone know a way to prevent parallel writes in one of the
common unix standards? Do they say anything about this?

It would seem to me that if the same process does both ping-pong writes,
then there should be a cheap way to enforce a serial order. I could be
wrong though.

As to where the first block of the ping-pong should go, maybe we could
reserve a file with nBlocks space for them, and write the information
about which block was being written to the XLOG for use in recovery.
There are many other ways to do it.

;jrnield

-- 
J. R. Nield
jrnield@usol.com

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

23 June 2002, 09:34:54

On 23 Jun 2002, J. R. Nield wrote:

> On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
> > J. R. Nield wrote:
> > > One other point:
> > >
> > > Page pre-image logging is fundamentally the same as what Jim Grey's
> > > book[1] would call "careful writes". I don't believe they should be in
> > > the XLOG, because we never need to keep the pre-images after we're sure
> > > the buffer has made it to the disk. Instead, we should have the buffer
> > > IO routines implement ping-pong writes of some kind if we want
> > > protection from partial writes.
> >
> > Ping-pong writes to where?  We have to fsync, and rather than fsync that
> > area and WAL, we just do WAL.  Not sure about a win there.

Presumably the win is that, "we never need to keep the pre-images
after we're sure the buffer has made it to the disk." So the
pre-image log can be completely ditched when we shut down the
server, so a full system sync, or whatever. This keeps the log file
size down, which means faster recovery, less to back up (when we
start getting transaction logs that can be backed up), etc.

This should also allow us to disable completely the ping-pong writes
if we have a disk subsystem that we trust. (E.g., a disk array with
battery backed memory.) That would, in theory, produce a nice little
performance increase when lots of inserts and/or updates are being
committed, as we have much, much less to write to the log file.

Are there stats that track, e.g., the bandwidth of writes to the
log file? I'd be interested in knowing just what kind of savings
one might see by doing this.

> The key question is: do we have some method to ensure that the OS
> doesn't do the writes in parallel?...
> It would seem to me that if the same process does both ping-pong writes,
> then there should be a cheap way to enforce a serial order. I could be
> wrong though.

Well, whether or not there's a cheap way depends on whether you consider
fsync to be cheap. :-)

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

23 June 2002, 11:19:49

Curt Sampson <cjs@cynic.net> writes:
> This should also allow us to disable completely the ping-pong writes
> if we have a disk subsystem that we trust.

If we have a disk subsystem we trust, we just disable fsync on the
WAL and the performance issue largely goes away.

I concur with Bruce: the reason we keep page images in WAL is to
minimize the number of places we have to fsync, and thus the amount of
head movement required for a commit.  Putting the page images elsewhere
cannot be a win AFAICS.

> Well, whether or not there's a cheap way depends on whether you consider
> fsync to be cheap. :-)

It's never cheap :-(
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

23 June 2002, 12:10:42

On Sun, 23 Jun 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > This should also allow us to disable completely the ping-pong writes
> > if we have a disk subsystem that we trust.
>
> If we have a disk subsystem we trust, we just disable fsync on the
> WAL and the performance issue largely goes away.

No, you can't do this. If you don't fsync(), there's no guarantee
that the write ever got out of the computer's buffer cache and to
the disk subsystem in the first place.

> I concur with Bruce: the reason we keep page images in WAL is to
> minimize the number of places we have to fsync, and thus the amount of
> head movement required for a commit.

An fsync() does not necessarially cause head movement, or any real
disk writes at all. If you're writing to many external disk arrays,
for example, the fsync() ensures that the data are in the disk array's
non-volatile or UPS-backed RAM, no more. The array might hold the data
for quite some time before it actually writes it to disk.

But you're right that it's faster, if you're going to write out changed
pages and have have the ping-pong file and the transaction log on the
same disk, just to write out the entire page to the transaction log.

So what we would really need to implement, if we wanted to be more
efficient with trusted disk subsystems, would be the option of writing
to the log only the changed row or changed part of the row, or writing
the entire changed page. I don't know how hard this would be....

> > Well, whether or not there's a cheap way depends on whether you consider
> > fsync to be cheap. :-)
>
> It's never cheap :-(

Actually, with a good external RAID system with non-volatile RAM,
it's a good two to four orders of magnitude cheaper than writing to a
directly connected disk that doesn't claim the write is complete until
it's physically on disk. I'd say that it qualifies as at least "not
expensive." Not that you want to do it more often than you have to
anyway....

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"J. R. Nield"

Date:

23 June 2002, 13:57:10

On Sun, 2002-06-23 at 11:19, Tom Lane wrote: 
> Curt Sampson <cjs@cynic.net> writes:
> > This should also allow us to disable completely the ping-pong writes
> > if we have a disk subsystem that we trust.
> 
> If we have a disk subsystem we trust, we just disable fsync on the
> WAL and the performance issue largely goes away.

It wouldn't work because the OS buffering interferes, and we need those
WAL records on disk up to the greatest LSN of the Buffer we will be writing.

We already buffer WAL ourselves. We also already buffer regular pages.
Whenever we write a Buffer out of the buffer cache, it is because we
really want that page on disk and wanted to start an IO. If thats not
the case, then we should have more block buffers! 

So since we have all this buffering designed especially to meet our
needs, and since the OS buffering is in the way, can someone explain to
me why postgresql would ever open a file without the O_DSYNC flag if the
platform supports it? 

> 
> I concur with Bruce: the reason we keep page images in WAL is to
> minimize the number of places we have to fsync, and thus the amount of
> head movement required for a commit.  Putting the page images elsewhere
> cannot be a win AFAICS.

Why not put all the page images in a single pre-allocated file and treat
it as a ring? How could this be any worse than flushing them in the WAL
log? 

Maybe fsync would be slower with two files, but I don't see how
fdatasync would be, and most platforms support that. 

What would improve performance would be to have a dbflush process that
would work in the background flushing buffers in groups and trying to
stay ahead of ReadBuffer requests. That would let you do the temporary
side of the ping-pong as a huge O_DSYNC writev(2) request (or
fdatasync() once) and then write out the other buffers. It would also
tend to prevent the other backends from blocking on write requests. 

A dbflush could also support aio_read/aio_write on platforms like
Solaris and WindowsNT that support it. 

Am I correct that right now, buffers only get written when they get
removed from the free list for reuse? So a released dirty buffer will
sit in the buffer free list until it becomes the Least Recently Used
buffer, and will then cause a backend to block for IO in a call to
BufferAlloc? 

This would explain why we like using the OS buffer cache, and why our
performance is troublesome when we have to do synchronous IO writes, and
why fsync() takes so long to complete. All of the backends block for
each call to BufferAlloc() after a large table update by a single
backend, and then the OS buffers are always full of our "written" data. 

Am I reading the bufmgr code correctly? I already found an imaginary
race condition there once :-) 

;jnield 

> 
> > Well, whether or not there's a cheap way depends on whether you consider
> > fsync to be cheap. :-)
> 
> It's never cheap :-(
> 
-- 
J. R. Nield
jrnield@usol.com

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"J. R. Nield"

Date:

23 June 2002, 14:17:19

On Sun, 2002-06-23 at 12:10, Curt Sampson wrote:
> 
> So what we would really need to implement, if we wanted to be more
> efficient with trusted disk subsystems, would be the option of writing
> to the log only the changed row or changed part of the row, or writing
> the entire changed page. I don't know how hard this would be....
> 
We already log that stuff. The page images are in addition to the
"Logical Changes", so we could just stop logging the page images.

-- 
J. R. Nield
jrnield@usol.com

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

23 June 2002, 14:17:21

On 23 Jun 2002, J. R. Nield wrote:

> So since we have all this buffering designed especially to meet our
> needs, and since the OS buffering is in the way, can someone explain to
> me why postgresql would ever open a file without the O_DSYNC flag if the
> platform supports it?

It's more code, if there are platforms out there that don't support
O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
O_DSYNC could save us a disk arm movement over fsync() because it
appears to me that fsync is also going to force a metadata update, which
means that the inode blocks have to be written as well.

> Maybe fsync would be slower with two files, but I don't see how
> fdatasync would be, and most platforms support that.

Because, if both files are on the same disk, you still have to move
the disk arm from the cylinder at the current log file write point
to the cylinder at the current ping-pong file write point. And then back
again to the log file write point cylinder.

In the end, having a ping-pong file as well seems to me unnecessary
complexity, especially when anyone interested in really good
performance is going to buy a disk subsystem that guarantees no
torn pages and thus will want to turn off the ping-pong file writes
entirely, anyway.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

24 June 2002, 08:15:44

On Sun, 23 Jun 2002, Bruce Momjian wrote:

> Yes, I don't see writing to two files vs. one to be any win, especially
> when we need to fsync both of them.  What I would really like is to
> avoid the double I/O of writing to WAL and to the data file;  improving
> that would be a huge win.

You mean, the double I/O of writing the block to the WAL and data file?
(We'd still have to write the changed columns or whatever to the WAL,
right?)

I'd just add an option to turn it off. If you need it, you need it;
there's no way around that except to buy hardware that is really going
to guarantee your writes (which then means you don't need it).

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

pgadmin.postgresql.org displaying errors

From

Dave Cramer

Date:

24 June 2002, 08:21:37

I am getting lots of errors on pgadmin.postgresql.org

Dave

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"J. R. Nield"

Date:

24 June 2002, 08:40:46

On Sun, 2002-06-23 at 21:29, J. R. Nield wrote:

> If is impossible to do what you want. You can not protect against...
Wow. The number of typo's in that last one was just amazing. I even
started with one.

Have an nice weekend everybody :-)

;jrnield

-- 
J. R. Nield
jrnield@usol.com

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"J. R. Nield"

Date:

24 June 2002, 08:55:24

On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
> Yes, I don't see writing to two files vs. one to be any win, especially
> when we need to fsync both of them.  What I would really like is to
> avoid the double I/O of writing to WAL and to the data file;  improving
> that would be a huge win.
> 

If is impossible to do what you want. You can not protect against
partial writes without writing pages twice and calling fdatasync between
them while going through a generic filesystem. The best disk array will
not protect you if the operating system does not align block writes to
the structure of the underlying device. Even with raw devices, you need
special support or knowledge of the operating system and/or the disk
device to ensure that each write request will be atomic to the
underlying hardware. 

All other systems rely on the fact that you can recover a damaged file
using the log archive. This means downtime in the rare case, but no data
loss. Until PostgreSQL can do this, then it will not be acceptable for
real critical production use. This is not to knock PostgreSQL, because
it is a very good database system, and clearly the best open-source one.
It even has feature advantages over the commercial systems. But at the
end of the day, unless you have complete understanding of the I/O system
from write(2) through to the disk system, the only sure ways to protect
against partial writes are by "careful writes" (in the WAL log or
elsewhere, writing pages twice), or by requiring (and allowing) users to
do log-replay recovery when a file is corrupted by a partial write. As
long as there is a UPS, and the operating system doesn't crash, then
there still should be no partial writes.

If we log pages to WAL, they are useless when archived (after a
checkpoint). So either we have a separate "log" for them (the ping-pong
file), or we should at least remove them when archived, which makes log
archiving more complex but is perfectly doable.

Finally, I would love to hear why we are using the operating system
buffer manager at all. The OS is acting as a secondary buffer manager
for us. Why is that? What flaw in our I/O system does this reveal? I
know that:

>We sync only WAL, not the other pages, except for the sync() call we do
> during checkpoint when we discard old WAL files.

But this is probably not a good thing. We should only be writing blocks
when they need to be on disk. We should not be expecting the OS to write
them "sometime later" and avoid blocking (as long) for the write. If we
need that, then our buffer management is wrong and we need to fix it.
The reason we are doing this is because we expect the OS buffer manager
to do asynchronous I/O for us, but then we don't control the order. That
is the reason why we have to call fdatasync(), to create "sequence
points".

The reason we have performance problems with either D_OSYNC or fdatasync
on the normal relations is because we have no dbflush process. This
causes an unacceptable amount of I/O blocking by other transactions.

The ORACLE people were not kidding when they said that they could not
certify Linux for production use until it supported O_DSYNC. Can you
explain why that was the case?

Finally, let me apologize if the above comes across as somewhat
belligerent. I know very well that I can't compete with you guys for
knowledge of the PosgreSQL system. I am still at a loss when I look at
the optimizer and executor modules, and it will take some time before I
can follow discussion of that area. Even then, I doubt my ability to
compare with people like Mr. Lane and Mr. Momjian in experience and
general intelligence, or in the field of database programming and
software development in particular. However, this discussion and a
search of the pgsql-hackers archives reveals this problem to be the KEY
area of PostgreSQL's failing, and general misunderstanding, when
compared to its commercial competitors.

Sincerely, 
J. R. Nield

-- 
J. R. Nield
jrnield@usol.com

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

24 June 2002, 08:55:26

J. R. Nield wrote:
> On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
> > Yes, I don't see writing to two files vs. one to be any win, especially
> > when we need to fsync both of them.  What I would really like is to
> > avoid the double I/O of writing to WAL and to the data file;  improving
> > that would be a huge win.
> > 
> 
> If is impossible to do what you want. You can not protect against
> partial writes without writing pages twice and calling fdatasync between
> them while going through a generic filesystem. The best disk array will
> not protect you if the operating system does not align block writes to
> the structure of the underlying device. Even with raw devices, you need
> special support or knowledge of the operating system and/or the disk
> device to ensure that each write request will be atomic to the
> underlying hardware. 

Yes, I suspected it was impossible, but that doesn't mean I want it any
less.  ;-)

> All other systems rely on the fact that you can recover a damaged file
> using the log archive. This means downtime in the rare case, but no data
> loss. Until PostgreSQL can do this, then it will not be acceptable for
> real critical production use. This is not to knock PostgreSQL, because
> it is a very good database system, and clearly the best open-source one.
> It even has feature advantages over the commercial systems. But at the
> end of the day, unless you have complete understanding of the I/O system
> from write(2) through to the disk system, the only sure ways to protect
> against partial writes are by "careful writes" (in the WAL log or
> elsewhere, writing pages twice), or by requiring (and allowing) users to
> do log-replay recovery when a file is corrupted by a partial write. As
> long as there is a UPS, and the operating system doesn't crash, then
> there still should be no partial writes.

You are talking point-in-time recovery, a major missing feature right
next to replication, and I agree it makes PostgreSQL unacceptable for
some applications.  Point taken.

And the interesting thing you are saying is that with point-in-time
recovery, we don't need to write pre-write images of pages because if we
detect a partial page write, we then abort the database and tell the
user to do a point-in-time recovery, basically meaning we are using the
previous full backup as our pre-write page image and roll forward using
the logical logs.  This is clearly a nice thing to be able to do because
it let's you take a pre-write image of the page once during full backup,
keep it offline, and bring it back in the rare case of a full page write
failure.  I now can see how the MSSQL tearoff-bits would be used, not
for recovery, but to detect a partial write and force a point-in-time
recovery from the administrator.

> If we log pages to WAL, they are useless when archived (after a
> checkpoint). So either we have a separate "log" for them (the ping-pong
> file), or we should at least remove them when archived, which makes log
> archiving more complex but is perfectly doable.

Yes, that is how we will do point-in-time recovery;  remove the
pre-write page images and archive the rest.  It is more complex, but
having the fsync all in one file is too big a win.

> Finally, I would love to hear why we are using the operating system
> buffer manager at all. The OS is acting as a secondary buffer manager
> for us. Why is that? What flaw in our I/O system does this reveal? I
> know that:
> 
> >We sync only WAL, not the other pages, except for the sync() call we do
> > during checkpoint when we discard old WAL files.
> 
> But this is probably not a good thing. We should only be writing blocks
> when they need to be on disk. We should not be expecting the OS to write
> them "sometime later" and avoid blocking (as long) for the write. If we
> need that, then our buffer management is wrong and we need to fix it.
> The reason we are doing this is because we expect the OS buffer manager
> to do asynchronous I/O for us, but then we don't control the order. That
> is the reason why we have to call fdatasync(), to create "sequence
> points".

Yes.  I think I understand.  It is true we have to fsync WAL because we
can't control the individual writes by the OS.

> The reason we have performance problems with either D_OSYNC or fdatasync
> on the normal relations is because we have no dbflush process. This
> causes an unacceptable amount of I/O blocking by other transactions.

Uh, that would force writes all over the disk. Why do we really care how
the OS writes them?  If we are going to fsync, let's just do the one
file and be done with it.  What would a separate flusher process really
buy us  if it has to use fsync too. The main backend doesn't have to
wait for the fsync, but then again, we can't say the transaction is
committed until it hits the disk, so how does a flusher help?

> The ORACLE people were not kidding when they said that they could not
> certify Linux for production use until it supported O_DSYNC. Can you
> explain why that was the case?

I don't see O_DSYNC as very different from write/fsync(or fdatasync).

> Finally, let me apologize if the above comes across as somewhat
> belligerent. I know very well that I can't compete with you guys for
> knowledge of the PostgreSQL system. I am still at a loss when I look at
> the optimizer and executor modules, and it will take some time before I
> can follow discussion of that area. Even then, I doubt my ability to
> compare with people like Mr. Lane and Mr. Momjian in experience and
> general intelligence, or in the field of database programming and
> software development in particular. However, this discussion and a
> search of the pgsql-hackers archives reveals this problem to be the KEY
> area of PostgreSQL's failing, and general misunderstanding, when
> compared to its commercial competitors.

We appreciate your ideas.  Few of us are professional db folks so we are
always looking for good ideas.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

24 June 2002, 08:55:33

Curt Sampson wrote:
> On 23 Jun 2002, J. R. Nield wrote:
> 
> > So since we have all this buffering designed especially to meet our
> > needs, and since the OS buffering is in the way, can someone explain to
> > me why postgresql would ever open a file without the O_DSYNC flag if the
> > platform supports it?
> 
> It's more code, if there are platforms out there that don't support
> O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
> O_DSYNC could save us a disk arm movement over fsync() because it
> appears to me that fsync is also going to force a metadata update, which
> means that the inode blocks have to be written as well.

Again, see postgresql.conf:

#wal_sync_method = fsync        # the default varies across platforms:
#                               # fsync, fdatasync, open_sync, or open_datasync

> 
> > Maybe fsync would be slower with two files, but I don't see how
> > fdatasync would be, and most platforms support that.
> 
> Because, if both files are on the same disk, you still have to move
> the disk arm from the cylinder at the current log file write point
> to the cylinder at the current ping-pong file write point. And then back
> again to the log file write point cylinder.
> 
> In the end, having a ping-pong file as well seems to me unnecessary
> complexity, especially when anyone interested in really good
> performance is going to buy a disk subsystem that guarantees no
> torn pages and thus will want to turn off the ping-pong file writes
> entirely, anyway.

Yes, I don't see writing to two files vs. one to be any win, especially
when we need to fsync both of them.  What I would really like is to
avoid the double I/O of writing to WAL and to the data file;  improving
that would be a huge win.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

24 June 2002, 09:18:29

On 23 Jun 2002, J. R. Nield wrote:

> If is impossible to do what you want. You can not protect against
> partial writes without writing pages twice and calling fdatasync
> between them while going through a generic filesystem.

I agree with this.

> The best disk array will not protect you if the operating system does
> not align block writes to the structure of the underlying device.

This I don't quite understand. Assuming you're using a SCSI drive
(and this mostly applies to ATAPI/IDE, too), you can do naught but
align block writes to the structure of the underlying device. When you
initiate a SCSI WRITE command, you start by telling the device at which
block to start writing and how many blocks you intend to write. Then you
start passing the data.

(See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter
details for the SCSI WRITE(10) command. You may find the SCSI 2
specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful
reference here.)

> Even with raw devices, you need special support or knowledge of the
> operating system and/or the disk device to ensure that each write
> request will be atomic to the underlying hardware.

Well, so here I guess you're talking about two things:
   1. When you request, say, an 8K block write, will the OS really   write it to disk in a single 8K or multiple of 8K
SCSIwrite   command?

   2. Does the SCSI device you're writing to consider these writes to   be transactional. That is, if the write is
interruptedbefore being   completed, does the SCSI device guarantee that the partially-sent   data is not written, and
theold data is maintained? And of course,   does it guarantee that, when it acknowledges a write, that write is   now
instable storage and will never go away?

Both of these are not hard to guarantee, actually. For a BSD-based OS,
for example, just make sure that your filesystem block size is the
same as or a multiple of the database block size. BSD will never write
anything other than a block or a sequence of blocks to a disk in a
single SCSI transaction (unless you've got a really odd SCSI driver).
And for your disk, buy a Baydel or Clarion disk array, or something
similar.

Given that it's not hard to set up a system that meets these criteria,
and this is in fact commonly done for database servers, it would seem a
good idea for postgres to have the option to take advantage of the time
and money spent and adjust its performance upward appropriately.

> All other systems rely on the fact that you can recover a damaged file
> using the log archive.

Not exactly. For MS SQL Server, at any rate, if it detects a page tear
you cannot restore based on the log file alone. You need a full or
partial backup that includes that entire torn block.

> This means downtime in the rare case, but no data loss. Until
> PostgreSQL can do this, then it will not be acceptable for real
> critical production use.

It seems to me that it is doing this right now. In fact, it's more
reliable than some commerial systems (such as SQL Server) because it can
recover from a torn block with just the logfile.

> But at the end of the day, unless you have complete understanding of
> the I/O system from write(2) through to the disk system, the only sure
> ways to protect against partial writes are by "careful writes" (in
> the WAL log or elsewhere, writing pages twice), or by requiring (and
> allowing) users to do log-replay recovery when a file is corrupted by
> a partial write.

I don't understand how, without a copy of the old data that was in the
torn block, you can restore that block from just log file entries. Can
you explain this to me? Take, as an example, a block with ten tuples,
only one of which has been changed "recently." (I.e., only that change
is in the log files.)

> If we log pages to WAL, they are useless when archived (after a
> checkpoint). So either we have a separate "log" for them (the
> ping-pong file), or we should at least remove them when archived,
> which makes log archiving more complex but is perfectly doable.

Right. That seems to me a better option, since we've now got only one
write point on the disk rather than two.

> Finally, I would love to hear why we are using the operating system
> buffer manager at all. The OS is acting as a secondary buffer manager
> for us. Why is that? What flaw in our I/O system does this reveal?

It's acting as a "second-level" buffer manager, yes, but to say it's
"secondary" may be a bit misleading. On most of the systems I've set
up, the OS buffer cache is doing the vast majority of the work, and the
postgres buffering is fairly minimal.

There are some good (and some perhaps not-so-good) reasons to do it this
way. I'll list them more or less in the order of best to worst:
   1. The OS knows where the blocks physically reside on disk, and   postgres does not. Therefore it's in the interest
ofpostgresql to   dispatch write responsibility back to the OS as quickly as possible   so that the OS can prioritize
requestsappropriately. Most operating   systems use an "elevator" algorithm to minimize disk head movement;   but if
theOS does not have a block that it could write while the   head is "on the way" to another request, it can't write it
inthat   head pass.

   2. Postgres does not know about any "bank-switching" tricks for   mapping more physical memory than it has address
space.Thus, on   32-bit machines, postgres might be limited to mapping 2 or 3 GB of   memory, even though the machine
has,say, 6 GB of physical RAM. The   OS can use all of the available memory for caching; postgres cannot.

   3. A lot of work has been put into the seek algorithms, read-ahead   algorithms, block allocation algorithms, etc.
inthe OS. Why   duplicate all that work again in postgres?

When you say things like the following:

> We should only be writing blocks when they need to be on disk. We
> should not be expecting the OS to write them "sometime later" and
> avoid blocking (as long) for the write. If we need that, then our
> buffer management is wrong and we need to fix it.

you appear to be making the arugment that we should take the route of
other database systems, and use raw devices and our own management of
disk block allocation. If so, you might want first to look back through
the archives at the discussion I and several others had about this a
month or two ago. After looking in detail at what NetBSD, at least, does
in terms of its disk I/O algorithms and buffering, I've pretty much come
around, at least for the moment, to the attitude that we should stick
with using the OS. I wouldn't mind seeing postgres be able to manage all
of this stuff, but it's a *lot* of work for not all that much benefit
that I can see.

> The ORACLE people were not kidding when they said that they could not
> certify Linux for production use until it supported O_DSYNC. Can you
> explain why that was the case?

I'm suspecting it's because Linux at the time had no raw devices, so
O_DSYNC was the only other possible method of making sure that disk
writes actually got to disk.

You certainly don't want to use O_DSYNC if you can use another method,
because O_DSYNC still goes through the the operating system's buffer
cache, wasting memory and double-caching things. If you're doing your
own management, you need either to use a raw device or open files with
the flag that indicates that the buffer cache should not be used at all
for reads from and writes to that file.

> However, this discussion and a search of the pgsql-hackers archives
> reveals this problem to be the KEY area of PostgreSQL's failing, and
> general misunderstanding, when compared to its commercial competitors.

No, I think it's just that you're under a few minor misapprehensions
here about what postgres and the OS are actually doing. As I said, I
went through this whole exact argument a month or two ago, on this very
list, and I came around to the idea that what postgres is doing now
works quite well, at least on NetBSD. (Most other OSes have disk I/O
algorithms that are pretty much as good or better.) There might be a
very slight advantage to doing all one's own I/O management, but it's
a huge amount of work, and I think that much effort could be much more
usefully applied to other areas.

Just as a side note, I've been a NetBSD developer since about '96,
and have been delving into the details of OS design since well before
that time, so I'm coming to this with what I hope is reasonably good
knowledge of how disks work and how operating systems use them. (Not
that this should stop you from pointing out holes in my arguments. :-))

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

24 June 2002, 09:26:53

J. R. Nield wrote:
> So since we have all this buffering designed especially to meet our
> needs, and since the OS buffering is in the way, can someone explain to
> me why postgresql would ever open a file without the O_DSYNC flag if the
> platform supports it? 

We sync only WAL, not the other pages, except for the sync() call we do
during checkpoint when we discard old WAL files.

> > I concur with Bruce: the reason we keep page images in WAL is to
> > minimize the number of places we have to fsync, and thus the amount of
> > head movement required for a commit.  Putting the page images elsewhere
> > cannot be a win AFAICS.
> 
> 
> Why not put all the page images in a single pre-allocated file and treat
> it as a ring? How could this be any worse than flushing them in the WAL
> log? 
> 
> Maybe fsync would be slower with two files, but I don't see how
> fdatasync would be, and most platforms support that. 

We have fdatasync option for WAL in postgresql.conf.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: pgadmin.postgresql.org displaying errors

From

"Marc G. Fournier"

Date:

24 June 2002, 12:34:52

should already be fixed ...

On 23 Jun 2002, Dave Cramer wrote:

> I am getting lots of errors on pgadmin.postgresql.org
>
> Dave
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

24 June 2002, 14:13:59

> On Sun, 23 Jun 2002, Bruce Momjian wrote:
>> Yes, I don't see writing to two files vs. one to be any win, especially
>> when we need to fsync both of them.  What I would really like is to
>> avoid the double I/O of writing to WAL and to the data file;  improving
>> that would be a huge win.

I don't believe it's possible to eliminate the double I/O.  Keep in mind
though that in the ideal case (plenty of shared buffers) you are only
paying two writes per modified block per checkpoint interval --- one to
the WAL during the first write of the interval, and then a write to the
real datafile issued by the checkpoint process.  Anything that requires
transaction commits to write data blocks will likely result in more I/O
not less, at least for blocks that are modified by several successive
transactions.

The only thing I've been able to think of that seems like it might
improve matters is to make the WAL writing logic aware of the layout
of buffer pages --- specifically, to know that our pages generally
contain an uninteresting "hole" in the middle, and not write the hole.
Optimistically this might reduce the WAL data volume by something
approaching 50%; though pessimistically (if most pages are near full)
it wouldn't help much.

This was not very feasible when the WAL code was designed because the
buffer manager needed to cope with both normal pages and pg_log pages,
but as of 7.2 I think it'd be safe to assume that all pages have the
standard layout.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

24 June 2002, 16:01:50

Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> The only thing I've been able to think of that seems like it might
>> improve matters is to make the WAL writing logic aware of the layout
>> of buffer pages --- specifically, to know that our pages generally
>> contain an uninteresting "hole" in the middle, and not write the hole.
>> Optimistically this might reduce the WAL data volume by something
>> approaching 50%; though pessimistically (if most pages are near full)
>> it wouldn't help much.

> Good idea.  How about putting the page through or TOAST compression
> routine before writing it to WAL?  Should be pretty easy and fast and
> doesn't require any knowledge of the page format.

Easy, maybe, but fast definitely NOT.  The compressor is not speedy.
Given that we have to be holding various locks while we build WAL
records, I do not think it's a good idea to add CPU time there.

Also, compressing already-compressed data is not a win ...
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

24 June 2002, 17:12:16

Tom Lane wrote:
> > On Sun, 23 Jun 2002, Bruce Momjian wrote:
> >> Yes, I don't see writing to two files vs. one to be any win, especially
> >> when we need to fsync both of them.  What I would really like is to
> >> avoid the double I/O of writing to WAL and to the data file;  improving
> >> that would be a huge win.
> 
> I don't believe it's possible to eliminate the double I/O.  Keep in mind
> though that in the ideal case (plenty of shared buffers) you are only
> paying two writes per modified block per checkpoint interval --- one to
> the WAL during the first write of the interval, and then a write to the
> real datafile issued by the checkpoint process.  Anything that requires
> transaction commits to write data blocks will likely result in more I/O
> not less, at least for blocks that are modified by several successive
> transactions.
> 
> The only thing I've been able to think of that seems like it might
> improve matters is to make the WAL writing logic aware of the layout
> of buffer pages --- specifically, to know that our pages generally
> contain an uninteresting "hole" in the middle, and not write the hole.
> Optimistically this might reduce the WAL data volume by something
> approaching 50%; though pessimistically (if most pages are near full)
> it wouldn't help much.

Good idea.  How about putting the page through or TOAST compression
routine before writing it to WAL?  Should be pretty easy and fast and
doesn't require any knowledge of the page format.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

24 June 2002, 18:25:49

"J. R. Nield" <jrnield@usol.com> writes:
> Also, postgreSQL can't recover from any other type of block corruption,
> while the commercial systems can.

Say again?

> Would it not be the case that things like read-ahead, grouping writes,
> and caching written data are probably best done by PostgreSQL, because
> only our buffer manager can understand when they will be useful or when
> they will thrash the cache?

I think you have been missing the point.  No one denies that there will
be some incremental gain if we do all that.  However, the conclusion of
everyone who has thought much about it (and I see Curt has joined that
group) is that the effort would be far out of proportion to the probable
gain.  There are a lot of other things we desperately need to spend time
on that would not amount to re-engineering large quantities of OS-level
code.  Given that most Unixen have perfectly respectable disk management
subsystems, we prefer to tune our code to make use of that stuff, rather
than follow the "conventional wisdom" that databases need to bypass it.

Oracle can afford to do that sort of thing because they have umpteen
thousand developers available.  Postgres does not.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"J. R. Nield"

Date:

24 June 2002, 19:25:57

On Sun, 2002-06-23 at 23:40, Curt Sampson wrote:
> On 23 Jun 2002, J. R. Nield wrote:
> 
> > If is impossible to do what you want. You can not protect against
> > partial writes without writing pages twice and calling fdatasync
> > between them while going through a generic filesystem.
> 
> I agree with this.
> 
> > The best disk array will not protect you if the operating system does
> > not align block writes to the structure of the underlying device.
> 
> This I don't quite understand. Assuming you're using a SCSI drive
> (and this mostly applies to ATAPI/IDE, too), you can do naught but
> align block writes to the structure of the underlying device. When you
> initiate a SCSI WRITE command, you start by telling the device at which
> block to start writing and how many blocks you intend to write. Then you
> start passing the data.
> 

All I'm saying is that the entire postgresql block write must be
converted into exactly one SCSI write command in all cases, and I don't
know a portable way to ensure this. 

> > Even with raw devices, you need special support or knowledge of the
> > operating system and/or the disk device to ensure that each write
> > request will be atomic to the underlying hardware.
> 
> Well, so here I guess you're talking about two things:
> 
>     1. When you request, say, an 8K block write, will the OS really
>     write it to disk in a single 8K or multiple of 8K SCSI write
>     command?
> 
>     2. Does the SCSI device you're writing to consider these writes to
>     be transactional. That is, if the write is interrupted before being
>     completed, does the SCSI device guarantee that the partially-sent
>     data is not written, and the old data is maintained? And of course,
>     does it guarantee that, when it acknowledges a write, that write is
>     now in stable storage and will never go away?
> 
> Both of these are not hard to guarantee, actually. For a BSD-based OS,
> for example, just make sure that your filesystem block size is the
> same as or a multiple of the database block size. BSD will never write
> anything other than a block or a sequence of blocks to a disk in a
> single SCSI transaction (unless you've got a really odd SCSI driver).
> And for your disk, buy a Baydel or Clarion disk array, or something
> similar.
> 
> Given that it's not hard to set up a system that meets these criteria,
> and this is in fact commonly done for database servers, it would seem a
> good idea for postgres to have the option to take advantage of the time
> and money spent and adjust its performance upward appropriately.

I agree with this. My point was only that you need to know what
guarantees your operating system/hardware combination provides on a
case-by-case basis, and there is no standard way for a program to
discover this. Most system administrators are not going to know this
either, unless databases are their main responsibility.

> 
> > All other systems rely on the fact that you can recover a damaged file
> > using the log archive.
> 
> Not exactly. For MS SQL Server, at any rate, if it detects a page tear
> you cannot restore based on the log file alone. You need a full or
> partial backup that includes that entire torn block.
> 

I should have been more specific: you need a backup of the file from
some time ago, plus all the archived logs from then until the current
log sequence number.

> > This means downtime in the rare case, but no data loss. Until
> > PostgreSQL can do this, then it will not be acceptable for real
> > critical production use.
> 
> It seems to me that it is doing this right now. In fact, it's more
> reliable than some commerial systems (such as SQL Server) because it can
> recover from a torn block with just the logfile.

Again, what I meant to say is that the commercial systems can recover
with an old file backup + logs. How old the backup can be depends only
on how much time you are willing to spend playing the logs forward. So
if you do a full backup once a week, and multiplex and backup the logs,
then even if a backup tape gets destroyed you can still survive. It just
takes longer.

Also, postgreSQL can't recover from any other type of block corruption,
while the commercial systems can. That's what I meant by the "critical
production use" comment, which was sort-of unfair.

So I would say they are equally reliable for torn pages (but not bad
blocks), and the commercial systems let you trade potential recovery
time for not having to write the blocks twice. You do need to back-up
the log archives though.

> 
> > But at the end of the day, unless you have complete understanding of
> > the I/O system from write(2) through to the disk system, the only sure
> > ways to protect against partial writes are by "careful writes" (in
> > the WAL log or elsewhere, writing pages twice), or by requiring (and
> > allowing) users to do log-replay recovery when a file is corrupted by
> > a partial write.
> 
> I don't understand how, without a copy of the old data that was in the
> torn block, you can restore that block from just log file entries. Can
> you explain this to me? Take, as an example, a block with ten tuples,
> only one of which has been changed "recently." (I.e., only that change
> is in the log files.)
>
> 
> > If we log pages to WAL, they are useless when archived (after a
> > checkpoint). So either we have a separate "log" for them (the
> > ping-pong file), or we should at least remove them when archived,
> > which makes log archiving more complex but is perfectly doable.
> 
> Right. That seems to me a better option, since we've now got only one
> write point on the disk rather than two.

OK. I agree with this now.

> 
> > Finally, I would love to hear why we are using the operating system
> > buffer manager at all. The OS is acting as a secondary buffer manager
> > for us. Why is that? What flaw in our I/O system does this reveal?
> 
> It's acting as a "second-level" buffer manager, yes, but to say it's
> "secondary" may be a bit misleading. On most of the systems I've set
> up, the OS buffer cache is doing the vast majority of the work, and the
> postgres buffering is fairly minimal.
> 
> There are some good (and some perhaps not-so-good) reasons to do it this
> way. I'll list them more or less in the order of best to worst:
> 
>     1. The OS knows where the blocks physically reside on disk, and
>     postgres does not. Therefore it's in the interest of postgresql to
>     dispatch write responsibility back to the OS as quickly as possible
>     so that the OS can prioritize requests appropriately. Most operating
>     systems use an "elevator" algorithm to minimize disk head movement;
>     but if the OS does not have a block that it could write while the
>     head is "on the way" to another request, it can't write it in that
>     head pass.
> 
>     2. Postgres does not know about any "bank-switching" tricks for
>     mapping more physical memory than it has address space. Thus, on
>     32-bit machines, postgres might be limited to mapping 2 or 3 GB of
>     memory, even though the machine has, say, 6 GB of physical RAM. The
>     OS can use all of the available memory for caching; postgres cannot.
> 
>     3. A lot of work has been put into the seek algorithms, read-ahead
>     algorithms, block allocation algorithms, etc. in the OS. Why
>     duplicate all that work again in postgres?
> 
> When you say things like the following:
> 
> > We should only be writing blocks when they need to be on disk. We
> > should not be expecting the OS to write them "sometime later" and
> > avoid blocking (as long) for the write. If we need that, then our
> > buffer management is wrong and we need to fix it.
> 
> you appear to be making the arugment that we should take the route of
> other database systems, and use raw devices and our own management of
> disk block allocation. If so, you might want first to look back through
> the archives at the discussion I and several others had about this a
> month or two ago. After looking in detail at what NetBSD, at least, does
> in terms of its disk I/O algorithms and buffering, I've pretty much come
> around, at least for the moment, to the attitude that we should stick
> with using the OS. I wouldn't mind seeing postgres be able to manage all
> of this stuff, but it's a *lot* of work for not all that much benefit
> that I can see.

I'll back off on that. I don't know if we want to use the OS buffer
manager, but shouldn't we try to have our buffer manager group writes
together by files, and pro-actively get them out to disk? Right now, it
looks like all our write requests are delayed as long as possible and
the order in which they are written is pretty-much random, as is the
backend that writes the block, so there is no locality of reference even
when the blocks are adjacent on disk, and the write calls are spread-out
over all the backends.

Would it not be the case that things like read-ahead, grouping writes,
and caching written data are probably best done by PostgreSQL, because
only our buffer manager can understand when they will be useful or when
they will thrash the cache?

I may likely be wrong on this, and I haven't done any performance
testing. I shouldn't have brought this up alongside the logging issues,
but there seemed to be some question about whether the OS was actually
doing all these things behind the scene.


> 
> > The ORACLE people were not kidding when they said that they could not
> > certify Linux for production use until it supported O_DSYNC. Can you
> > explain why that was the case?
> 
> I'm suspecting it's because Linux at the time had no raw devices, so
> O_DSYNC was the only other possible method of making sure that disk
> writes actually got to disk.
> 
> You certainly don't want to use O_DSYNC if you can use another method,
> because O_DSYNC still goes through the the operating system's buffer
> cache, wasting memory and double-caching things. If you're doing your
> own management, you need either to use a raw device or open files with
> the flag that indicates that the buffer cache should not be used at all
> for reads from and writes to that file.

Would O_DSYNC|O_RSYNC turn off the cache? 

> 
> > However, this discussion and a search of the pgsql-hackers archives
> > reveals this problem to be the KEY area of PostgreSQL's failing, and
> > general misunderstanding, when compared to its commercial competitors.
> 
> No, I think it's just that you're under a few minor misapprehensions
> here about what postgres and the OS are actually doing. As I said, I
> went through this whole exact argument a month or two ago, on this very
> list, and I came around to the idea that what postgres is doing now
> works quite well, at least on NetBSD. (Most other OSes have disk I/O
> algorithms that are pretty much as good or better.) There might be a
> very slight advantage to doing all one's own I/O management, but it's
> a huge amount of work, and I think that much effort could be much more
> usefully applied to other areas.

I will look for that discussion in the archives.

The logging issue is a key one I think. At least I would be very nervous
as a DBA if I were running a system where any damaged file would cause
data loss.

Does anyone know what the major barriers to infinite log replay are in
PostgreSQL? I'm trying to look for everything that might need to be
changed outside xlog.c, but surely this has come up before. Searching
the archives hasn't revealed much.



As to the I/O issue:

Since you know a lot about NetBSD internals, I'd be interested in
hearing about what postgresql looks like to the NetBSD buffer manager.
Am I right that strings of successive writes get randomized? What do our
cache-hit percentages look like? I'm going to do some experimenting with
this.

> 
> Just as a side note, I've been a NetBSD developer since about '96,
> and have been delving into the details of OS design since well before
> that time, so I'm coming to this with what I hope is reasonably good
> knowledge of how disks work and how operating systems use them. (Not
> that this should stop you from pointing out holes in my arguments. :-))
> 

This stuff is very difficult to get right. Glad to know you follow this
list.


> cjs
> -- 
> Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
>     Don't you know, in this new Dark Age, we're all light.  --XTC
> 
-- 
J. R. Nield
jrnield@usol.com

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

24 June 2002, 19:30:13

J. R. Nield wrote:
> > This I don't quite understand. Assuming you're using a SCSI drive
> > (and this mostly applies to ATAPI/IDE, too), you can do naught but
> > align block writes to the structure of the underlying device. When you
> > initiate a SCSI WRITE command, you start by telling the device at which
> > block to start writing and how many blocks you intend to write. Then you
> > start passing the data.
> > 
> 
> All I'm saying is that the entire postgresql block write must be
> converted into exactly one SCSI write command in all cases, and I don't
> know a portable way to ensure this. 

...

> I agree with this. My point was only that you need to know what
> guarantees your operating system/hardware combination provides on a
> case-by-case basis, and there is no standard way for a program to
> discover this. Most system administrators are not going to know this
> either, unless databases are their main responsibility.

Yes, agreed.  >1% are going to know the answer to this question so we
have to assume worst case.

> > It seems to me that it is doing this right now. In fact, it's more
> > reliable than some commerial systems (such as SQL Server) because it can
> > recover from a torn block with just the logfile.
> 
> Again, what I meant to say is that the commercial systems can recover
> with an old file backup + logs. How old the backup can be depends only
> on how much time you are willing to spend playing the logs forward. So
> if you do a full backup once a week, and multiplex and backup the logs,
> then even if a backup tape gets destroyed you can still survive. It just
> takes longer.
> 
> Also, postgreSQL can't recover from any other type of block corruption,
> while the commercial systems can. That's what I meant by the "critical
> production use" comment, which was sort-of unfair.
> 
> So I would say they are equally reliable for torn pages (but not bad
> blocks), and the commercial systems let you trade potential recovery
> time for not having to write the blocks twice. You do need to back-up
> the log archives though.

Yes, good tradeoff analysis.  We recover from partial writes quicker,
and don't require saving of log files, _but_ we don't recover from bad
disk blocks.  Good summary.

> I'll back off on that. I don't know if we want to use the OS buffer
> manager, but shouldn't we try to have our buffer manager group writes
> together by files, and pro-actively get them out to disk? Right now, it
> looks like all our write requests are delayed as long as possible and
> the order in which they are written is pretty-much random, as is the
> backend that writes the block, so there is no locality of reference even
> when the blocks are adjacent on disk, and the write calls are spread-out
> over all the backends.
> 
> Would it not be the case that things like read-ahead, grouping writes,
> and caching written data are probably best done by PostgreSQL, because
> only our buffer manager can understand when they will be useful or when
> they will thrash the cache?

The OS should handle all of this.  We are doing main table writes but no
sync until checkpoint, so the OS can keep those blocks around and write
them at its convenience.  It knows the size of the buffer cache and when
stuff is forced to disk.  We can't second-guess that.

> I may likely be wrong on this, and I haven't done any performance
> testing. I shouldn't have brought this up alongside the logging issues,
> but there seemed to be some question about whether the OS was actually
> doing all these things behind the scene.

It had better.  Looking at the kernel source is the way to know.

> Does anyone know what the major barriers to infinite log replay are in
> PostgreSQL? I'm trying to look for everything that might need to be
> changed outside xlog.c, but surely this has come up before. Searching
> the archives hasn't revealed much.

This has been brought up.  Could we just save WAL files and get replay? 
I believe some things have to be added to WAL to allow this, but it
seems possible.  However, the pg_dump is just a data dump and does not
have the file offsets and things.  Somehow you would need a tar-type
backup of the database, and with a running db, it is hard to get a valid
snapshot of that.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Tom Lane

Date:

24 June 2002, 20:17:43

Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Does anyone know what the major barriers to infinite log replay are in
>> PostgreSQL? I'm trying to look for everything that might need to be
>> changed outside xlog.c, but surely this has come up before. Searching
>> the archives hasn't revealed much.

> This has been brought up.  Could we just save WAL files and get replay? 
> I believe some things have to be added to WAL to allow this, but it
> seems possible.

The Red Hat group has been looking at this somewhat; so far there seem
to be some minor tweaks that would be needed, but no showstoppers.

> Somehow you would need a tar-type
> backup of the database, and with a running db, it is hard to get a valid
> snapshot of that.

But you don't *need* a "valid snapshot", only a correct copy of
every block older than the first checkpoint in your WAL log series.
Any inconsistencies in your tar dump will look like repairable damage;
replaying the WAL log will fix 'em.
        regards, tom lane

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

24 June 2002, 20:26:07

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> Does anyone know what the major barriers to infinite log replay are in
> >> PostgreSQL? I'm trying to look for everything that might need to be
> >> changed outside xlog.c, but surely this has come up before. Searching
> >> the archives hasn't revealed much.
> 
> > This has been brought up.  Could we just save WAL files and get replay? 
> > I believe some things have to be added to WAL to allow this, but it
> > seems possible.
> 
> The Red Hat group has been looking at this somewhat; so far there seem
> to be some minor tweaks that would be needed, but no showstoppers.


Good.

> > Somehow you would need a tar-type
> > backup of the database, and with a running db, it is hard to get a valid
> > snapshot of that.
> 
> But you don't *need* a "valid snapshot", only a correct copy of
> every block older than the first checkpoint in your WAL log series.
> Any inconsistencies in your tar dump will look like repairable damage;
> replaying the WAL log will fix 'em.

Yes, my point was that you need physical file backups, not pg_dump, and
you have to be tricky about the files changing during the backup.  You
_can_ work around changes to the files during backup.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 
/usr/local/bin/mime: cannot create /dev/ttyp3: permission denied

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

"J. R. Nield"

Date:

24 June 2002, 22:25:43

On Mon, 2002-06-24 at 17:16, Tom Lane wrote:
> I think you have been missing the point...  
Yes, this appears to be the case. Thanks especially to Curt for clearing
things up for me.

-- 
J. R. Nield
jrnield@usol.com

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

25 June 2002, 00:20:07

On 24 Jun 2002, J. R. Nield wrote:

> All I'm saying is that the entire postgresql block write must be
> converted into exactly one SCSI write command in all cases, and I don't
> know a portable way to ensure this.

No, there's no portable way. All you can do is give the admin who
is able to set things up safely the ability to turn of the now-unneeded
(and expensive) safety-related stuff that postgres does.

> I agree with this. My point was only that you need to know what
> guarantees your operating system/hardware combination provides on a
> case-by-case basis, and there is no standard way for a program to
> discover this. Most system administrators are not going to know this
> either, unless databases are their main responsibility.

Certainly this is true of pretty much every database system out there.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Curt Sampson

Date:

25 June 2002, 02:05:21

On Mon, 24 Jun 2002, Tom Lane wrote:

> There are a lot of other things we desperately need to spend time
> on that would not amount to re-engineering large quantities of OS-level
> code.  Given that most Unixen have perfectly respectable disk management
> subsystems, we prefer to tune our code to make use of that stuff, rather
> than follow the "conventional wisdom" that databases need to bypass it.
> ...
> Oracle can afford to do that sort of thing because they have umpteen
> thousand developers available.  Postgres does not.

Well, Oracle also started out, a long long time ago, on systems without
unified buffer cache and so on, and so they *had* to write this stuff
because otherwise data would not be cached. So Oracle can also afford to
maintain it now because the code already exists.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Buffer Management

From

Curt Sampson

Date:

25 June 2002, 02:07:03

I'm splitting off this buffer mangement stuff into a separate thread.

On 24 Jun 2002, J. R. Nield wrote:

> I'll back off on that. I don't know if we want to use the OS buffer
> manager, but shouldn't we try to have our buffer manager group writes
> together by files, and pro-actively get them out to disk?

The only way the postgres buffer manager can "get [data] out to disk"
is to do an fsync(). For data files (as opposed to log files), this can
only slow down overall system throughput, as this would only disrupt the
OS's write management.

> Right now, it
> looks like all our write requests are delayed as long as possible and
> the order in which they are written is pretty-much random, as is the
> backend that writes the block, so there is no locality of reference even
> when the blocks are adjacent on disk, and the write calls are spread-out
> over all the backends.

It doesn't matter. The OS will introduce locality of reference with its
write algorithms. Take a look at
   http://www.cs.wisc.edu/~solomon/cs537/disksched.html

for an example. Most OSes use the elevator or one-way elevator
algorithm.  So it doesn't matter whether it's one back-end or many
writing, and it doesn't matter in what order they do the write.

> Would it not be the case that things like read-ahead, grouping writes,
> and caching written data are probably best done by PostgreSQL, because
> only our buffer manager can understand when they will be useful or when
> they will thrash the cache?

Operating systems these days are not too bad at guessing guessing what
you're doing. Pretty much every OS I've seen will do read-ahead when
it detects you're doing sequential reads, at least in the forward
direction. And Solaris is even smart enough to mark the pages you've
read as "not needed" so that they quickly get flushed from the cache,
rather than blowing out your entire cache if you go through a large
file.

> Would O_DSYNC|O_RSYNC turn off the cache?

No. I suppose there's nothing to stop it doing so, in some
implementations, but the interface is not designed for direct I/O.

> Since you know a lot about NetBSD internals, I'd be interested in
> hearing about what postgresql looks like to the NetBSD buffer manager.

Well, looks like pretty much any program, or group of programs,
doing a lot of I/O. :-)

> Am I right that strings of successive writes get randomized?

No; as I pointed out, they in fact get de-randomized as much as
possible. The more proceses you have throwing out requests, the better
the throughput will be in fact.

> What do our cache-hit percentages look like? I'm going to do some
> experimenting with this.

Well, that depends on how much memory you have and what your working
set is. :-)

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From

Bruce Momjian

Date:

25 June 2002, 09:22:22

Curt Sampson wrote:
> On Mon, 24 Jun 2002, Tom Lane wrote:
> 
> > There are a lot of other things we desperately need to spend time
> > on that would not amount to re-engineering large quantities of OS-level
> > code.  Given that most Unixen have perfectly respectable disk management
> > subsystems, we prefer to tune our code to make use of that stuff, rather
> > than follow the "conventional wisdom" that databases need to bypass it.
> > ...
> > Oracle can afford to do that sort of thing because they have umpteen
> > thousand developers available.  Postgres does not.
> 
> Well, Oracle also started out, a long long time ago, on systems without
> unified buffer cache and so on, and so they *had* to write this stuff
> because otherwise data would not be cached. So Oracle can also afford to
> maintain it now because the code already exists.

Well, actually, it isn't unified buffer cache that is the issue, but
rather the older SysV file system had pretty poor performance so
bypassing it was a bigger win that it is today.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Buffer Management

From

Curt Sampson

Date:

25 June 2002, 09:53:15

So, while we're at it, what's the current state of people's thinking
on using mmap rather than shared memory for data file buffers? I
see some pretty powerful advantages to this approach, and I'm not
(yet :-)) convinced that the disadvantages are as bad as people think.
I think I can address most of the concerns in doc/TODO.detail/mmap.

Is this worth pursuing a bit? (I.e., should I spend an hour or two
writing up the advantages and thoughts on how to get around the
problems?) Anybody got objections that aren't in doc/TODO.detail/mmap?

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Buffer Management

From

Tom Lane

Date:

25 June 2002, 10:11:26

Curt Sampson <cjs@cynic.net> writes:
> So, while we're at it, what's the current state of people's thinking
> on using mmap rather than shared memory for data file buffers?

There seem to be a couple of different threads in doc/TODO.detail/mmap.

One envisions mmap as a one-for-one replacement for our current use of
SysV shared memory, the main selling point being to get out from under
kernels that don't have SysV support or have it configured too small.
This might be worth doing, and I think it'd be relatively easy to do
now that the shared memory support is isolated in one file and there's
provisions for selecting a shmem implementation at configure time.
The only thing you'd really have to think about is how to replace the
current behavior that uses shmem attach counts to discover whether any
old backends are left over from a previous crashed postmaster.  I dunno
if mmap offers any comparable facility.

The other discussion seemed to be considering how to mmap individual
data files right into backends' address space.  I do not believe this
can possibly work, because of loss of control over visibility of data
changes to other backends, timing of write-backs, etc.

But as long as you stay away from interpretation #2 and go with
mmap-as-a-shmget-substitute, it might be worthwhile.

(Hey Marc, can one do mmap in a BSD jail?)
        regards, tom lane

Re: Buffer Management

From

Larry Rosenman

Date:

25 June 2002, 10:17:57

On Tue, 2002-06-25 at 09:09, Tom Lane wrote:
> Curt Sampson <cjs@cynic.net> writes:
> > So, while we're at it, what's the current state of people's thinking
> > on using mmap rather than shared memory for data file buffers?
> 
[snip]
> 
> (Hey Marc, can one do mmap in a BSD jail?)
I believe the answer is YES.  

I can send you the man pages if you want. 


> 
>             regards, tom lane
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 
> 
-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 972-414-9812                 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

Re: Buffer Management

From

Curt Sampson

Date:

25 June 2002, 10:20:32

On Tue, 25 Jun 2002, Tom Lane wrote:

> The only thing you'd really have to think about is how to replace the
> current behavior that uses shmem attach counts to discover whether any
> old backends are left over from a previous crashed postmaster.  I dunno
> if mmap offers any comparable facility.

Sure. Just mmap a file, and it will be persistent.

> The other discussion seemed to be considering how to mmap individual
> data files right into backends' address space.  I do not believe this
> can possibly work, because of loss of control over visibility of data
> changes to other backends, timing of write-backs, etc.

I don't understand why there would be any loss of visibility of changes.
If two backends mmap the same block of a file, and it's shared, that's
the same block of physical memory that they're accessing. Changes don't
even need to "propagate," because the memory is truly shared. You'd keep
your locks in the page itself as well, of course.

Can you describe the problem in more detail?

> But as long as you stay away from interpretation #2 and go with
> mmap-as-a-shmget-substitute, it might be worthwhile.

It's #2 that I was really looking at. :-)

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Buffer Management

From

Bruce Momjian

Date:

25 June 2002, 10:23:20

Tom Lane wrote:
> Curt Sampson <cjs@cynic.net> writes:
> > So, while we're at it, what's the current state of people's thinking
> > on using mmap rather than shared memory for data file buffers?
> 
> There seem to be a couple of different threads in doc/TODO.detail/mmap.
> 
> One envisions mmap as a one-for-one replacement for our current use of
> SysV shared memory, the main selling point being to get out from under
> kernels that don't have SysV support or have it configured too small.
> This might be worth doing, and I think it'd be relatively easy to do
> now that the shared memory support is isolated in one file and there's
> provisions for selecting a shmem implementation at configure time.
> The only thing you'd really have to think about is how to replace the
> current behavior that uses shmem attach counts to discover whether any
> old backends are left over from a previous crashed postmaster.  I dunno
> if mmap offers any comparable facility.
> 
> The other discussion seemed to be considering how to mmap individual
> data files right into backends' address space.  I do not believe this
> can possibly work, because of loss of control over visibility of data
> changes to other backends, timing of write-backs, etc.

Agreed.  Also, there was in intresting thread that mmap'ing /dev/zero is
the same as anonmap for OS's that don't have anonmap.  That should cover
most of them.  The only downside I can see is that SysV shared memory is
locked into RAM on some/most OS's while mmap anon probably isn't. 
Locking in RAM is good in most cases, bad in others.

This will also work well when we have non-SysV semaphore support, like
Posix semaphores, so we would be able to run with no SysV stuff.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Buffer Management

From

Lee Kindness

Date:

25 June 2002, 10:23:21

Tom Lane writes:> There seem to be a couple of different threads in> doc/TODO.detail/mmap.> [ snip ]

A place where mmap could be easily used and would offer a good
performance increase is for COPY FROM.

Lee.

Re: Buffer Management

From

Curt Sampson

Date:

25 June 2002, 10:32:44

On Tue, 25 Jun 2002, Bruce Momjian wrote:

> The only downside I can see is that SysV shared memory is
> locked into RAM on some/most OS's while mmap anon probably isn't.

It is if you mlock() it. :-)

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Buffer Management

From

Tom Lane

Date:

25 June 2002, 10:33:57

Curt Sampson <cjs@cynic.net> writes:
> On Tue, 25 Jun 2002, Tom Lane wrote:
>> The other discussion seemed to be considering how to mmap individual
>> data files right into backends' address space.  I do not believe this
>> can possibly work, because of loss of control over visibility of data
>> changes to other backends, timing of write-backs, etc.

> I don't understand why there would be any loss of visibility of changes.
> If two backends mmap the same block of a file, and it's shared, that's
> the same block of physical memory that they're accessing.

Is it?  You have a mighty narrow conception of the range of
implementations that's possible for mmap.

But the main problem is that mmap doesn't let us control when changes to
the memory buffer will get reflected back to disk --- AFAICT, the OS is
free to do the write-back at any instant after you dirty the page, and
that completely breaks the WAL algorithm.  (WAL = write AHEAD log;
the log entry describing a change must hit disk before the data page
change itself does.)
        regards, tom lane

Re: Buffer Management

From

Tom Lane

Date:

25 June 2002, 10:34:27

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> This will also work well when we have non-SysV semaphore support, like
> Posix semaphores, so we would be able to run with no SysV stuff.

You do realize that we can use Posix semaphores today?  The Darwin (OS X)
port uses 'em now.  That's one reason I am more interested in mmap as
a shmget substitute than I used to be.
        regards, tom lane

Re: Buffer Management

From

Bruce Momjian

Date:

25 June 2002, 10:59:42

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > This will also work well when we have non-SysV semaphore support, like
> > Posix semaphores, so we would be able to run with no SysV stuff.
> 
> You do realize that we can use Posix semaphores today?  The Darwin (OS X)
> port uses 'em now.  That's one reason I am more interested in mmap as

No, I didn't realize we had gotten that far.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Buffer Management

From

Bruce Momjian

Date:

25 June 2002, 10:59:46

Tom Lane wrote:
> Curt Sampson <cjs@cynic.net> writes:
> > On Tue, 25 Jun 2002, Tom Lane wrote:
> >> The other discussion seemed to be considering how to mmap individual
> >> data files right into backends' address space.  I do not believe this
> >> can possibly work, because of loss of control over visibility of data
> >> changes to other backends, timing of write-backs, etc.
> 
> > I don't understand why there would be any loss of visibility of changes.
> > If two backends mmap the same block of a file, and it's shared, that's
> > the same block of physical memory that they're accessing.
> 
> Is it?  You have a mighty narrow conception of the range of
> implementations that's possible for mmap.
> 
> But the main problem is that mmap doesn't let us control when changes to
> the memory buffer will get reflected back to disk --- AFAICT, the OS is
> free to do the write-back at any instant after you dirty the page, and
> that completely breaks the WAL algorithm.  (WAL = write AHEAD log;
> the log entry describing a change must hit disk before the data page
> change itself does.)

Can we mmap WAL without problems?  Not sure if there is any gain to it
because we just write it and rarely read from it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Buffer Management

From

Tom Lane

Date:

25 June 2002, 11:09:57

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Can we mmap WAL without problems?  Not sure if there is any gain to it
> because we just write it and rarely read from it.

Perhaps, but I don't see any point to it.
        regards, tom lane

Re: Buffer Management

From

Bruce Momjian

Date:

25 June 2002, 11:10:32

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Can we mmap WAL without problems?  Not sure if there is any gain to it
> > because we just write it and rarely read from it.
> 
> Perhaps, but I don't see any point to it.

Agreed.  I have been poking around google looking for an article I read
months ago saying that mmap of files is slighly faster in low memory
usage situations, but much slower in high memory usage situations
because the kernel doesn't know as much about the file access in mmap as
it does with stdio.  I will find it.  :-)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Buffer Management

From

Curt Sampson

Date:

26 June 2002, 00:13:54

On Tue, 25 Jun 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
>
> > I don't understand why there would be any loss of visibility of changes.
> > If two backends mmap the same block of a file, and it's shared, that's
> > the same block of physical memory that they're accessing.
>
> Is it?  You have a mighty narrow conception of the range of
> implementations that's possible for mmap.

It's certainly possible to implement something that you call mmap
that is not. But if you are using the posix-defined MAP_SHARED flag,
the behaviour above is what you see. It might be implemented slightly
differently internally, but that's no concern of postgres. And I find
it pretty unlikely that it would be implemented otherwise without good
reason.

Note that your proposal of using mmap to replace sysv shared memory
relies on the behaviour I've described too. As well, if you're replacing
sysv shared memory with an mmap'd file, you may end up doing excessive
disk I/O on systems without the MAP_NOSYNC option. (Without this option,
the update thread/daemon may ensure that every buffer is flushed to the
backing store on disk every 30 seconds or so. You might be able to get
around this by using a small file-backed area for things that need to
persist after a crash, and a larger anonymous area for things that don't
need to persist after a crash.)

> But the main problem is that mmap doesn't let us control when changes to
> the memory buffer will get reflected back to disk --- AFAICT, the OS is
> free to do the write-back at any instant after you dirty the page, and
> that completely breaks the WAL algorithm.  (WAL = write AHEAD log;
> the log entry describing a change must hit disk before the data page
> change itself does.)

Hm. Well ,we could try not to write the data to the page until
after we receive notification that our WAL data is committed to
stable storage. However, new the data has to be availble to all of
the backends at the exact time that the commit happens. Perhaps a
shared list of pending writes?

Another option would be to just let it write, but on startup, scan
all of the data blocks in the database for tuples that have a
transaction ID later than the last one we updated to, and remove
them. That could pretty darn expensive on a large database, though.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: Buffer Management

From

Tom Lane

Date:

26 June 2002, 10:03:39

Curt Sampson <cjs@cynic.net> writes:
> Note that your proposal of using mmap to replace sysv shared memory
> relies on the behaviour I've described too.

True, but I was not envisioning mapping an actual file --- at least
on HPUX, the only way to generate an arbitrary-sized shared memory
region is to use MAP_ANONYMOUS and not have the mmap'd area connected
to any file at all.  It's not farfetched to think that this aspect
of mmap might work differently from mapping pieces of actual files.

In practice of course we'd have to restrict use of any such
implementation to platforms where mmap behaves reasonably ... according
to our definition of "reasonably".
        regards, tom lane

Re: Buffer Management

From

Bruce Momjian

Date:

26 June 2002, 14:31:09

Tom Lane wrote:
> Curt Sampson <cjs@cynic.net> writes:
> > Note that your proposal of using mmap to replace sysv shared memory
> > relies on the behaviour I've described too.
> 
> True, but I was not envisioning mapping an actual file --- at least
> on HPUX, the only way to generate an arbitrary-sized shared memory
> region is to use MAP_ANONYMOUS and not have the mmap'd area connected
> to any file at all.  It's not farfetched to think that this aspect
> of mmap might work differently from mapping pieces of actual files.
> 
> In practice of course we'd have to restrict use of any such
> implementation to platforms where mmap behaves reasonably ... according
> to our definition of "reasonably".

Yes, I am told mapping /dev/zero is the same as the anon map.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Buffer Management

From

Curt Sampson

Date:

26 June 2002, 23:38:15

On Wed, 26 Jun 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > Note that your proposal of using mmap to replace sysv shared memory
> > relies on the behaviour I've described too.
>
> True, but I was not envisioning mapping an actual file --- at least
> on HPUX, the only way to generate an arbitrary-sized shared memory
> region is to use MAP_ANONYMOUS and not have the mmap'd area connected
> to any file at all.  It's not farfetched to think that this aspect
> of mmap might work differently from mapping pieces of actual files.

I find it somewhat farfetched, for a couple of reasons:
   1. Memory mapped with the MAP_SHARED flag is shared memory,   anonymous or not. POSIX is pretty explicit about how
thisworks,   and the "standard" for mmap that predates POSIX is the same.   Anonymous memory does not behave
differently.
   You could just as well say that some systems might exist such   that one process can write() a block to a file, and
thenanother   might read() it afterwards but not see the changes. Postgres   should not try to deal with hypothetical
systemsthat are so   completely broken.

   2. Mmap is implemented as part of a unified buffer cache system   on all of today's operating systems that I know
of.The memory   is backed by swap space when anonymous, and by a specified file   when not anonymous; but the way these
twoare handled is   *exactly* the same internally.

   Even on older systems without unified buffer cache, the behaviour   is the same between anonymous and file-backed
mmap'dmemory.   And there would be no point in making it otherwise. Mmap is   designed to let you share memory; why
makea broken implementation   under certain circumstances?

> In practice of course we'd have to restrict use of any such
> implementation to platforms where mmap behaves reasonably ... according
> to our definition of "reasonably".

Of course. As we do already with regular I/O.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC