Thread: Same query, same performance

Same query, same performance

From
"alexandre :: aldeia digital"
Date:
Hi all,

First, sorry for the long mail...

I have a system with 7 Million of records in 600 tables.
My actual production machine is: P4 1.6G, 3 IDE 7200, 1GB PC133
My new machine production is: Dual Xeon 2.0G HT, 1GB DDR266 ECC
3 SCSI with HW Raid 5

The postgresql.conf is the SAME in both systems and I test
with no other connections, only my local test.

shared_buffers = 80000
effective_cache_size = 60000
random_page_cost = 2.5
cpu_tuple_cost = 0.001
cpu_index_tuple_cost = 0.0001
cpu_operator_cost = 0.00025

My question is:

If I execute the same query executed a lot of times, the
duration is praticaly the same in both systems ?

1) !       1.185424 elapsed 1.090000 user 0.100000 system sec
2) !       1.184415 elapsed 1.070000 user 0.120000 system sec
3) !       1.185209 elapsed 1.100000 user 0.080000 system sec

If the disks is not read directly, the system must find
the rows in RAM. If it find in RAM, why so diffrents machines
have the times of execution and why the times does not down ???

The variations of query show bellow have the times pratically
equals and my system send thousands os this querys with a
thousands of 1.18 seconds... :(

Very thank´s

Alexandre


Query:
[postgres@host1 data]$ psql -c "explain SELECT T2.fi15emp05,
T2.fi15flagcf, T2.fi15codcf, T1.Fn06Emp07, T1.Fn06TipTit, T1.Fn06TitBan,
T1.Fn06Conta1, T1.Fn06NumTit, T1.Fn06Desdob, T1.Fn05CodPre, T1.Fn06eCli1,
T1.Fn06tCli1,   T1.Fn06cCli1, T2.fi15nome  FROM (FN06T T1 LEFT JOIN FI15T
T2 ON T2.fi15emp05 = T1.Fn06eCli1   AND T2.fi15flagcf = T1.Fn06tCli1 AND
T2.fi15codcf = T1.Fn06cCli1) WHERE ( T1.Fn06Emp07   = '1' AND
T1.Fn06TipTit = 'R' ) AND ( T1.Fn06TitBan = '002021001525
' ) ORDER BY T1.Fn06Emp07, T1.Fn06TipTit,   T1.Fn06NumTit, T1.Fn06Desdob,
T1.Fn05CodPre, T1.Fn06eCli1, T1.Fn06tCli1, T1.Fn06cCli1" Pro13Z
                                                                  QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=25875.53..25875.53 rows=15 width=155)
   Sort Key: t1.fn06emp07, t1.fn06tiptit, t1.fn06numtit, t1.fn06desdob,
t1.fn05codpre, t1.fn06ecli1, t1.fn06tcli1, t1.fn06ccli1
   ->  Nested Loop  (cost=0.00..25875.50 rows=15 width=155)
         ->  Seq Scan on fn06t t1  (cost=0.00..25808.30 rows=15 width=95)
               Filter: ((fn06emp07 = 1::smallint) AND (fn06tiptit =
'R'::bpchar) AND (fn06titban = '002021001525
  '::bpchar))
         ->  Index Scan using fi15t_pkey on fi15t t2  (cost=0.00..4.33
rows=1 width=60)
               Index Cond: ((t2.fi15emp05 = "outer".fn06ecli1) AND
(t2.fi15flagcf = "outer".fn06tcli1) AND (t2.fi15codcf =
"outer".fn06ccli1))
(7 rows)

*** AND FROM LOG when a execute the query:

2003-01-23 00:09:49 [3372]   LOG:  duration: 1.285900 sec
2003-01-23 00:09:49 [3372]   LOG:  QUERY STATISTICS
! system usage stats:
!       1.286001 elapsed 1.240000 user 0.040000 system sec
!       [1.250000 user 0.040000 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       50526/130 [50693/372] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! buffer usage stats:
!       Shared blocks:          0 read,          0 written, buffer hit
rate = 100.00%
!       Local  blocks:          0 read,          0 written, buffer hit
rate = 0.00%
!       Direct blocks:          0 read,          0 written




Re: Same query, same performance

From
Tomasz Myrta
Date:
alexandre :: aldeia digital wrote:

>Hi all,
>
>First, sorry for the long mail...
>
>I have a system with 7 Million of records in 600 tables.
>My actual production machine is: P4 1.6G, 3 IDE 7200, 1GB PC133
>My new machine production is: Dual Xeon 2.0G HT, 1GB DDR266 ECC
>3 SCSI with HW Raid 5
>
>The postgresql.conf is the SAME in both systems and I test
>with no other connections, only my local test.
>
>shared_buffers = 80000
>effective_cache_size = 60000
>random_page_cost = 2.5
>cpu_tuple_cost = 0.001
>cpu_index_tuple_cost = 0.0001
>cpu_operator_cost = 0.00025
>
>My question is:
>
>If I execute the same query executed a lot of times, the
>duration is praticaly the same in both systems ?
>
>1) !       1.185424 elapsed 1.090000 user 0.100000 system sec
>2) !       1.184415 elapsed 1.070000 user 0.120000 system sec
>3) !       1.185209 elapsed 1.100000 user 0.080000 system sec
>
>If the disks is not read directly, the system must find
>the rows in RAM. If it find in RAM, why so diffrents machines
>have the times of execution and why the times does not down ???

Here is your problem:
->  Seq Scan on fn06t t1  (cost=0.00..25808.30 rows=15 width=95)
               Filter: ((fn06emp07 = 1::smallint) AND (fn06tiptit =
'R'::bpchar) AND (fn06titban = '002021001525
  '::bpchar))

Problably system has to read from disk whole table fn06t each time, beacuse it
doesn't use index scan.

Do you have any indexes on table fn06t? How selective are conditions above
How big is this table? Can you use indexes on multiple fields on this table
- it should help, because conditions above return only 15 rows?

Regards,
Tomasz Myrta




Re: Same query, same performance

From
"Josh Berkus"
Date:
Alexandre,

> I have a system with 7 Million of records in 600 tables.
> My actual production machine is: P4 1.6G, 3 IDE 7200, 1GB PC133
> My new machine production is: Dual Xeon 2.0G HT, 1GB DDR266 ECC
> 3 SCSI with HW Raid 5

Well, first of all, those two systems are almost equivalent as far as
Postgres is concerned for simple queries.   The extra processor power
will only help you with very complex queries.  3-disk RAID 5 is no
faster ... and sometimes slower ... than IDE for database purposes.
  The only real boost to the Xeon is the faster RAM ... which may not
help you if your drive array is the bottleneck.

>
> The postgresql.conf is the SAME in both systems and I test
> with no other connections, only my local test.
>
> shared_buffers = 80000
> effective_cache_size = 60000
> random_page_cost = 2.5
> cpu_tuple_cost = 0.001
> cpu_index_tuple_cost = 0.0001
> cpu_operator_cost = 0.00025

Not that it affects the query below, but what about SORT_MEM?

> If I execute the same query executed a lot of times, the
> duration is praticaly the same in both systems ?
>
> 1) !       1.185424 elapsed 1.090000 user 0.100000 system sec
> 2) !       1.184415 elapsed 1.070000 user 0.120000 system sec
> 3) !       1.185209 elapsed 1.100000 user 0.080000 system sec
>
> If the disks is not read directly, the system must find
> the rows in RAM. If it find in RAM, why so diffrents machines
> have the times of execution and why the times does not down ???

I'm pretty sure that PostgreSQL always checks on disk, even when the
same query is run repeatedly.  Tom?

> [postgres@host1 data]$ psql -c "explain SELECT T2.fi15emp05,
> T2.fi15flagcf, T2.fi15codcf, T1.Fn06Emp07, T1.Fn06TipTit,
> T1.Fn06TitBan,
> T1.Fn06Conta1, T1.Fn06NumTit, T1.Fn06Desdob, T1.Fn05CodPre,
> T1.Fn06eCli1,
> T1.Fn06tCli1,   T1.Fn06cCli1, T2.fi15nome  FROM (FN06T T1 LEFT JOIN
> FI15T
> T2 ON T2.fi15emp05 = T1.Fn06eCli1   AND T2.fi15flagcf = T1.Fn06tCli1
> AND
> T2.fi15codcf = T1.Fn06cCli1) WHERE ( T1.Fn06Emp07   = '1' AND
> T1.Fn06TipTit = 'R' ) AND ( T1.Fn06TitBan = '002021001525
>
> ' ) ORDER BY T1.Fn06Emp07, T1.Fn06TipTit,   T1.Fn06NumTit,
> T1.Fn06Desdob,
> T1.Fn05CodPre, T1.Fn06eCli1, T1.Fn06tCli1, T1.Fn06cCli1" Pro13Z

Actually, from your stats, Postgres is doing a pretty good job.   1.18
seconds to return 15 rows from a 7 million row table searching on not
Indexed columns?  I don't think you have anything to complain about.

If you want less-than-1 second respose time: Add some indexes and keep
the tables VACUUMed so the indexes work.  Particularly, add a
multi-column index on ( T1.Fn06Emp07, T1.Fn06TipTit, T1.Fn06TitBan )

If you want single-digit-msec response: Get a better disk set for
Postgres: I recommend dual-channel RAID 1 (n addition to indexing).

-Josh Berkus



Re: Same query, same performance

From
"alexandre :: aldeia digital"
Date:
Tomasz,

>>1) !       1.185424 elapsed 1.090000 user 0.100000 system sec
>>2) !       1.184415 elapsed 1.070000 user 0.120000 system sec
>>3) !       1.185209 elapsed 1.100000 user 0.080000 system sec
>>
>>If the disks is not read directly, the system must find
>>the rows in RAM. If it find in RAM, why so diffrents machines
>>have the times of execution and why the times does not down ???
>
> Here is your problem:
> ->  Seq Scan on fn06t t1  (cost=0.00..25808.30 rows=15 width=95)
>                Filter: ((fn06emp07 = 1::smallint) AND (fn06tiptit =
> 'R'::bpchar) AND (fn06titban = '002021001525
>   '::bpchar))

Really! I do not attemp that fn06t does not have an index
with fn06titban ... :)

Now, tehe time of the querys are < 0.02 sec on P4
and <0.05 on Xeon.

Very Thank´s

Alexandre,



Re: Same query, same performance

From
"alexandre :: aldeia digital"
Date:
Josh,

> Alexandre,
>
>> I have a system with 7 Million of records in 600 tables.
>> My actual production machine is: P4 1.6G, 3 IDE 7200, 1GB PC133
>> My new machine production is: Dual Xeon 2.0G HT, 1GB DDR266 ECC
>> 3 SCSI with HW Raid 5
>
> Well, first of all, those two systems are almost equivalent as far as
> Postgres is concerned for simple queries.   The extra processor power
> will only help you with very complex queries.  3-disk RAID 5 is no
> faster ... and sometimes slower ... than IDE for database purposes.
>   The only real boost to the Xeon is the faster RAM ... which may not
> help you if your drive array is the bottleneck.

Today, I will add more one HD and I will make an RAID 10 ...
In next week i will report my tests to the list...

>
>>
>> The postgresql.conf is the SAME in both systems and I test
>> with no other connections, only my local test.
>>
>> shared_buffers = 80000
>> effective_cache_size = 60000
>> random_page_cost = 2.5
>> cpu_tuple_cost = 0.001
>> cpu_index_tuple_cost = 0.0001
>> cpu_operator_cost = 0.00025
>
> Not that it affects the query below, but what about SORT_MEM?

Sort_mem = 32000

> Actually, from your stats, Postgres is doing a pretty good job.   1.18
> seconds to return 15 rows from a 7 million row table searching on not
> Indexed columns?  I don't think you have anything to complain about.

The table have 300000 tuples, the entire database have 7 million.
Tomazs answer the question: a missing index on fn06t ...

But the query time difference of the systems continue.
I will change the discs and tell to list after...

Thank´s Josh,


Alexandre






Does "correlation" mislead the optimizer on large tables?

From
Ron Mayer
Date:
Short summary:

  On a large tables, I think the "correlation" pg_stats field as calculated
  by "vacuum analyze" or "analyze" can mislead the optimizer.

  By forcing index scans on some queries shown below, some queries
  in my database speed up from 197 seconds to under 30 seconds.

  I'd like feedback on whether or not having a smarter "analyze"
  function (which I think I could write as a separate utility) would
  help me situations like this.

Longer:

  In particular, if I have a large table t with columns 'a','b','c', etc,
  and I cluster the table as follows:

    create table t_ordered as select * from t order by a,b;
    vacuum analyze t_ordered;

  Column "b" will (correctly) get a very low "correlation" in
  the pg_stats table -- but I think the optimizer would do better
  assuming a high correlation because similar 'b' values are still
  grouped closely on the same disk pages.



  Below is a real-world example of this issue.

  The table "fact" is a large one (reltuples = 1e8, relpages = 1082385)
  and contains about 1 years worth of data.  The data was loaded
  sequentialy (ordered by dat,tim).

    logs=# \d fact;
            Table "fact"
     Column |          Type          | Modifiers
    --------+------------------------+-----------
     dat    | date                   |
     tim    | time without time zone |
     ip_id  | integer                |
     bid_id | integer                |
     req_id | integer                |
     ref_id | integer                |
     uag_id | integer                |
    Indexes: i_fact_2__bid_id,
         i_fact_2__dat,
         i_fact_2__tim,
         i_fact_2__ip_id,
         i_fact_2__ref_id,
         i_fact_2__req_id


  With a table this large, each day's worth of data contains
  about 3000 pages; or conversely, each page contains only about
  a 30 second range of values for "tim".

  As shown in the queries below, the optimizer wanted to do
  a sequential scan when looking at a 10 minute part of the day.
  However also as shown, forcing an index scan did much better.

  I'm guessing this happened because the optimizer saw the
  horrible correlation, and decided it would have to read
  an enormous number of pages if it did an index scan.

===========================================

logs=# select tablename,attname,n_distinct,correlation from pg_stats where tablename='fact';
 tablename | attname | n_distinct | correlation
-----------+---------+------------+-------------
 fact      | dat     |        365 |           1
 fact      | tim     |      80989 | -0.00281447
 fact      | ip_id   |      44996 |    0.660689
 fact      | bid_id  |     742850 |    0.969026
 fact      | req_id  |       2778 |     0.67896
 fact      | ref_id  |        595 |    0.258023
 fact      | uag_id  |        633 |    0.234216
(7 rows)


logs=# explain analyze select * from fact where tim<'00:10:00';
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Seq Scan on fact  (cost=0.00..1949838.40 rows=526340 width=32) (actual time=0.39..197447.50 rows=402929 loops=1)
   Filter: (tim < '00:10:00'::time without time zone)
 Total runtime: 197810.01 msec
(3 rows)

logs=# explain analyze select * from fact where tim<'00:10:00';
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on fact  (cost=0.00..1949838.40 rows=526340 width=32) (actual time=15.25..156705.76 rows=402929 loops=1)
   Filter: (tim < '00:10:00'::time without time zone)
 Total runtime: 157089.15 msec
(3 rows)

logs=# set enable_seqscan = off;
SET
logs=# explain analyze select * from fact where tim<'00:10:00';
                                                               QUERY PLAN
                

----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using i__fact__tim on fact  (cost=0.00..2110978.39 rows=526340 width=32) (actual time=104.41..23307.84
rows=402929loops=1) 
   Index Cond: (tim < '00:10:00'::time without time zone)
 Total runtime: 23660.95 msec
(3 rows)

logs=# explain analyze select * from fact where tim<'00:10:00';
                                                             QUERY PLAN
             

-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using i__fact__tim on fact  (cost=0.00..2110978.39 rows=526340 width=32) (actual time=0.03..1477.35
rows=402929loops=1) 
   Index Cond: (tim < '00:10:00'::time without time zone)
 Total runtime: 1827.94 msec
(3 rows)



logs=#

*******************************************************************************
*******************************************************************************


So two questions:

  a) Am I on to something.... or is something else the reason why
     the optimizer chose the much slower sequential scan?

  b) If I did write an "analyze" that tried to set "correlation" values
     that took into account such local grouping of data, would anyone
     be interested?


       Ron






Re: Does "correlation" mislead the optimizer on large tables?

From
Tom Lane
Date:
Ron Mayer <ron@intervideo.com> writes:
>   On a large tables, I think the "correlation" pg_stats field as calculated
>   by "vacuum analyze" or "analyze" can mislead the optimizer.

If you look in the pghackers archives, you will find some discussion
about changing the equation that cost_index() uses to estimate the
impact of correlation on indexscan cost.  The existing equation is
ad-hoc and surely wrong, but so far no one's proposed a replacement
that can be justified any better.  If you've got such a replacement
then we're all ears...

>   In particular, if I have a large table t with columns 'a','b','c', etc,
>   and I cluster the table as follows:
>     create table t_ordered as select * from t order by a,b;
>     vacuum analyze t_ordered;
>   Column "b" will (correctly) get a very low "correlation" in
>   the pg_stats table -- but I think the optimizer would do better
>   assuming a high correlation because similar 'b' values are still
>   grouped closely on the same disk pages.

How would that be?  They'll be separated by the stride of 'a'.

It seems likely to me that a one-dimensional correlation statistic may
be inadequate, but I haven't seen any proposals for better stats.

            regards, tom lane

Re: Does "correlation" mislead the optimizer on large

From
Stephan Szabo
Date:
On Fri, 24 Jan 2003, Tom Lane wrote:

> Ron Mayer <ron@intervideo.com> writes:
> >   In particular, if I have a large table t with columns 'a','b','c', etc,
> >   and I cluster the table as follows:
> >     create table t_ordered as select * from t order by a,b;
> >     vacuum analyze t_ordered;
> >   Column "b" will (correctly) get a very low "correlation" in
> >   the pg_stats table -- but I think the optimizer would do better
> >   assuming a high correlation because similar 'b' values are still
> >   grouped closely on the same disk pages.
>
> How would that be?  They'll be separated by the stride of 'a'.

I think it's a clumping effect.

For example, I made a table (ordered)  with 20 values of a, 50 values of b
(each showing up in each a) and 100 values of c (not used, just means 100
rows for each (a,b) combination. It's got 541 pages it looks like. Analyze
sets the correlation to about 0.08 on the table and so a query like:
select * from test1 where b=1; prefers a sequence scan (1791 vs 2231)
while the index scan actually performs about 5 times better.

I guess the reason is that in general, the index scan *really* is reading
something on the order of 40 pages rather than the much larger estimate
(I'd guess something on the order of say 300-400?  I'm not sure how to
find that except by trying to reverse engineer the estimate number),
because pretty much each value of a will probably have 1 or 2 pages with
b=1.

I'm not really sure how to measure that, however.



Re: Does "correlation" mislead the optimizer on large

From
Ron Mayer
Date:
On Fri, 24 Jan 2003, Tom Lane wrote:
>
> Ron Mayer <ron@intervideo.com> writes:
> >   On a large tables, I think the "correlation" pg_stats field as calculated
> >   by "vacuum analyze" or "analyze" can mislead the optimizer.
>
> If you look in the pghackers archives, you will find some discussion
> about changing the equation that cost_index() uses to estimate the
> impact of correlation on indexscan cost.  The existing equation is
> ad-hoc and surely wrong, but so far no one's proposed a replacement
> that can be justified any better.  If you've got such a replacement
> then we're all ears...

I've got a very slow one (full table scan perl script) that helps
my database...  I don't know if it's a good general purpose solution.

That's why I'm asking if the concept is good here.  :-)


> >   In particular, if I have a large table t with columns 'a','b','c', etc,
> >   and I cluster the table as follows:
> >     create table t_ordered as select * from t order by a,b;
> >     vacuum analyze t_ordered;
> >   Column "b" will (correctly) get a very low "correlation" in
> >   the pg_stats table -- but I think the optimizer would do better
> >   assuming a high correlation because similar 'b' values are still
> >   grouped closely on the same disk pages.
>
> How would that be?  They'll be separated by the stride of 'a'.


In the case of date/time (for the queries I showed) the issue was
that 'a's were not at all unique so I had data like this:

     dat     |  time    | value
 ------------|----------|--------------------------------
 2002-01-01  | 00:00:00 | whatever
 2002-01-01  | 00:00:00 |
 2002-01-01  | 00:00:00 |
 2002-01-01  | 00:00:01 |
 2002-01-01  | 00:00:01 |  [many pages of 12am]
 2002-01-01  | 00:00:01 |
 2002-01-01  | 00:00:01 |
 ... thousands more rows....
 2002-01-01  | 00:00:59 |
 2002-01-01  | 00:01:00 | [many pages of 1am]
 ... tens of thousands of rows.
 2002-01-01  | 23:59:59 |
 2002-01-01  | 23:59:59 |
 2002-01-01  | 23:59:59 | [many pages of 11pm]
 2002-01-02  | 00:00:00 | [many *MORE* pages of 12am]
 2002-01-02  | 00:00:00 |
 2002-01-02  | 00:00:00 |
 ... tens of thousands of rows...
 2002-01-02  | 23:59:59 | [many pages of 11pm]
 2002-01-03  | 00:00:00 | [many *MORE* pages of 12am]
 ... millions more rows ...


A similar problem actually shows up again in the dimention tables
of my database; where I bulk load many pages at a time (which can
easily be ordered to give a good correlation for a single load) ...
but then the next week's data gets appended to the end.

    id | value
     ------|----------------------------------
     1 | aalok mehta                      [many pages of all 'a's]
     2 | aamir khan
     3 | aaron beall
       | [...]
      6234 | axel rose
      6234 | austin wolf
      6123 | barbara boxer                    [many pages of all 'b's]
       | [...]
    123456 | young
    123457 | zebra
       | [...data loaded later..]
    123458 | aaron whatever                   [more pages of all 'a's]
    123458 | aaron something else
    123458 | aaron something else
           | [...]
    512344 | zelany


In this case I get many clustered blocks of "a" values, but these
clustered blocks happen at many different times across the table.


> It seems likely to me that a one-dimensional correlation statistic may
> be inadequate, but I haven't seen any proposals for better stats.

The idea is it walks the whole table and looks for more local
correlations and replaces the correlation value with a "good"
value if values "close" to each other on the disk are similar.

This way a single "correlation" value still works ... so I didn't
have to change the optimizer logic, just the "analyze" logic.


Basically if data within each block is highly correlated, it doesn't
matter as much (yeah, I now the issue about sequential reads vs. random
reads).


    Ron


Re: Does "correlation" mislead the optimizer on large

From
Ron Mayer
Date:
On Fri, 24 Jan 2003, Stephan Szabo wrote:
>
> I think it's a clumping effect.


Yup, I think that's exactly the effect.

A proposal.... (yes I I'm volunteering if people point me in the right
direction)... would be to have a "plugable" set of analyze functions so that a
huge database that runs analyze infrequently could choose to have a very slow
analyze that might work better for it's data.

I see no reason different analyze functions would to be compiled into
the source code ... but could probably exists as PL/pgSQL languages.

The one thing compiling it in would help with is to let me know
the exact number of tuples on each individual page, but I guess
reltuples/relpages from pg_class is a good estimate.


> For example, I made a table (ordered)  with 20 values of a, 50 values of b
> (each showing up in each a) and 100 values of c (not used, just means 100
> rows for each (a,b) combination. It's got 541 pages it looks like. Analyze
> sets the correlation to about 0.08 on the table and so a query like:
> select * from test1 where b=1; prefers a sequence scan (1791 vs 2231)
> while the index scan actually performs about 5 times better.

That sounds like the same situation I was in. If my logic is right,  this
means you had about 184 tuples/page (200*50*100/541), so it looks to me
like for each "a", you get half-a-page where "b=1".

If you had 'c' have 200 values, I think you'd get even a bigger speedup
because half the page is still "wasted" with b=2 values.

If you had 'c' have 10000 values, I think you'd get even a slightly bigger
speedup because you'd have so many b=1 pages next to each other you'd
benefit from more sequential disk access.


> I guess the reason is that in general, the index scan *really* is reading
> something on the order of 40 pages rather than the much larger estimate
> (I'd guess something on the order of say 300-400?  I'm not sure how to
> find that except by trying to reverse engineer the estimate number),

Or by adding a printf()... I think it'd be in cost_index in costsize.c.

> because pretty much each value of a will probably have 1 or 2 pages with
> b=1.
>
> I'm not really sure how to measure that, however.


As I said... I'm happy to volunteer and experiment if people point
me in a good direction.

    Ron



Re: Does "correlation" mislead the optimizer on large

From
Tom Lane
Date:
Ron Mayer <ron@intervideo.com> writes:
> A proposal.... (yes I I'm volunteering if people point me in the right
> direction)... would be to have a "plugable" set of analyze functions so that a
> huge database that runs analyze infrequently could choose to have a very slow
> analyze that might work better for it's data.

I do not think ANALYZE is the problem here; at least, it's premature to
worry about that end of things until you've defined (a) what's to be
stored in pg_statistic, and (b) what computation the planner needs to
make to derive a cost estimate given the stats.

            regards, tom lane

Re: Does "correlation" mislead the optimizer on large

From
Noah Silverman
Date:
Hi!

Another fun question in our ongoing analysis on whether to switch from
mysql to postgres. (Just as an update, Postgres has performed
flawlessly on all of our stress tests so far.)

We have a situation where we will be creating two fairly large and
complex databases with many tables (thousands) each.  From what I
understand, postgres keeps everything in one big data directory.

Would there be an advantage to putting each of the two databases into a
separate directory and starting two instances of postgres?  Is it
better to just lump everything together.

In a perfect world, we would buy another database server and raid for
the second database, but being a small company, we just  don't have the
budget right now. The raid on our current server is much bigger than we
need.

Thanks,

-N


Multiple databases one directory

From
Noah Silverman
Date:
Hi!

Another fun question in our ongoing analysis on whether to switch from
mysql to postgres. (Just as an update, Postgres has performed
flawlessly on all of our stress tests so far.)

We have a situation where we will be creating two fairly large and
complex databases with many tables (thousands) each.  From what I
understand, postgres keeps everything in one big data directory.

Would there be an advantage to putting each of the two databases into a
separate directory and starting two instances of postgres?  Is it
better to just lump everything together.

In a perfect world, we would buy another database server and raid for
the second database, but being a small company, we just  don't have the
budget right now. The raid on our current server is much bigger than we
need.

Thanks,

-N


Re: Does "correlation" mislead the optimizer on large

From
Ron Mayer
Date:
On Fri, 24 Jan 2003, Tom Lane wrote:
>
> Ron Mayer <ron@intervideo.com> writes:
> > A proposal.... (yes I I'm volunteering if people point me in the right
> > direction)...
>
> I do not think ANALYZE is the problem here; at least, it's premature to
> worry about that end of things until you've defined (a) what's to be
> stored in pg_statistic, and (b) what computation the planner needs to
> make to derive a cost estimate given the stats.

Cool.  Thanks for a good starting point.  If I wanted to brainstorm
further, should I do so here, or should I encourage interested people
to take it off line with me (ron@intervideo.com) and I can post
a summary of the conversation?

       Ron

For those who do want to brainstorm with me, my starting point is this:

 With my particular table, I think the main issue is still that I have a
 lot of data that looks like:

  values:    aaaaaaaaaaabbbbbbbbccccccccddddddddddaaaabbbbbbbccccccccddddd...
  disk page: |page 1|page 2|page 3|page 4|page 5|page 6|page 7|page 8|page 9|

 The problem I'm trying to address is that the current planner guesses
 that most of the pages will need to be read; however the local clustering
 means that in fact only a small subset need to be accessed.  My first
 guess is that modifying the definition of "correlation" to account for
 page-sizes would be a good approach.

 I.e. Instead of the correlation across the whole table, for each row
 perform an auto-correlation
 (http://astronomy.swin.edu.au/~pbourke/analysis/correlate/)
 and keep only the values with a "delay" of less than 1 page-size.

If you want to share thoughts offline (ron@intervideo.com), I'll gladly
post a summary of responses here to save the bandwidth of the group.




Re: Multiple databases one directory

From
Josh Berkus
Date:
Noah,

> Would there be an advantage to putting each of the two databases into a
> separate directory and starting two instances of postgres?  Is it
> better to just lump everything together.

You can use the WITH LOCATION option in CREATE DATABASE to put the two
databases into seperate directories *without* running two instances of
postgres.

For that matter, the databases each have their own directories, by OID number.

Of course, this only helps you if the seperate directories are on seperate
disks/arrays/channels.  If everying is on the same disk or array, don't
bother.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


WEIRD CRASH?!?!

From
Noah Silverman
Date:
I think my server crashed and then restarted itself.  Does anybody know
what all this means:

2003-01-24 18:28:06 PANIC:  link from
/RAID/pgsql/pg_xlog/00000009000000BC to
/RAID/pgsql/pg_xlog/00000009000000C4 (initialization of log file 9,
segment 196) failed: File exists
2003-01-24 18:28:06 LOG:  server process (pid 1574) was terminated by
signal 6
2003-01-24 18:28:06 LOG:  terminating any other active server processes
2003-01-24 18:28:06 WARNING:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
2003-01-24 18:28:06 LOG:  all server processes terminated;
reinitializing shared memory and semaphores
2003-01-24 18:28:06 LOG:  database system was interrupted at 2003-01-24
18:28:06 EST
2003-01-24 18:28:06 LOG:  checkpoint record is at 9/C4574974
2003-01-24 18:28:06 LOG:  redo record is at 9/C200D144; undo record is
at 0/0; shutdown FALSE
2003-01-24 18:28:06 LOG:  next transaction id: 5159292; next oid:
50856954
2003-01-24 18:28:06 LOG:  database system was not properly shut down;
automatic recovery in progress
2003-01-24 18:28:06 LOG:  redo starts at 9/C200D144
2003-01-24 18:28:13 LOG:  ReadRecord: record with zero length at
9/C4578CC0
2003-01-24 18:28:13 LOG:  redo done at 9/C4578C9C
2003-01-24 18:29:02 LOG:  recycled transaction log file 00000009000000C0
2003-01-24 18:29:02 LOG:  recycled transaction log file 00000009000000C1
2003-01-24 18:29:02 LOG:  recycled transaction log file 00000009000000BC
2003-01-24 18:29:02 LOG:  recycled transaction log file 00000009000000BD
2003-01-24 18:29:02 LOG:  recycled transaction log file 00000009000000BE
2003-01-24 18:29:02 LOG:  recycled transaction log file 00000009000000BF
2003-01-24 18:29:02 LOG:  database system is ready





Re: WEIRD CRASH?!?!

From
Josh Berkus
Date:
Noah,

> I think my server crashed and then restarted itself.  Does anybody know
> what all this means:
>
> 2003-01-24 18:28:06 PANIC:  link from
> /RAID/pgsql/pg_xlog/00000009000000BC to
> /RAID/pgsql/pg_xlog/00000009000000C4 (initialization of log file 9,
> segment 196) failed: File exists
> 2003-01-24 18:28:06 LOG:  server process (pid 1574) was terminated by
> signal 6
> 2003-01-24 18:28:06 LOG:  terminating any other active server processes
> 2003-01-24 18:28:06 WARNING:  Message from PostgreSQL backend:
>          The Postmaster has informed me that some other backend
>          died abnormally and possibly corrupted shared memory.
>          I have rolled back the current transaction and am
>          going to terminate your database system connection and exit.
>          Please reconnect to the database system and repeat your query.

This means that somebody KILL -9'd a postgres process or the postmaster, and
Postgres restarted in order to clear the shared buffers.   If the database
started up again, you are fine.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: WEIRD CRASH?!?!

From
Noah Silverman
Date:
Yes,
but I'm the only one logged into this box, and I didn't kill anything.
It appears to have died all by itself.

Thanks,

-N


On Friday, January 24, 2003, at 07:03 PM, Josh Berkus wrote:

> Noah,
>
>> I think my server crashed and then restarted itself.  Does anybody
>> know
>> what all this means:
>>
>> 2003-01-24 18:28:06 PANIC:  link from
>> /RAID/pgsql/pg_xlog/00000009000000BC to
>> /RAID/pgsql/pg_xlog/00000009000000C4 (initialization of log file 9,
>> segment 196) failed: File exists
>> 2003-01-24 18:28:06 LOG:  server process (pid 1574) was terminated by
>> signal 6
>> 2003-01-24 18:28:06 LOG:  terminating any other active server
>> processes
>> 2003-01-24 18:28:06 WARNING:  Message from PostgreSQL backend:
>>          The Postmaster has informed me that some other backend
>>          died abnormally and possibly corrupted shared memory.
>>          I have rolled back the current transaction and am
>>          going to terminate your database system connection and exit.
>>          Please reconnect to the database system and repeat your
>> query.
>
> This means that somebody KILL -9'd a postgres process or the
> postmaster, and
> Postgres restarted in order to clear the shared buffers.   If the
> database
> started up again, you are fine.
>
> --
> -Josh Berkus
>  Aglio Database Solutions
>  San Francisco
>
>


Re: WEIRD CRASH?!?!

From
Josh Berkus
Date:
Noah,

> but I'm the only one logged into this box, and I didn't kill anything.
> It appears to have died all by itself.

I'd check your disk array, then.  It doesn't happen to be a Mylex, does it?

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: WEIRD CRASH?!?!

From
Noah Silverman
Date:
We are using a 3ware escalade on this box.

One clue.

I actually moved the pg_xlog directory to another drive and then
symbolically linked it back to the data directory.

Another idea is that Linux killed one of the processes because postgres
was using up too much memory.  I belive the part of the kernel  is
called "oomkiller".  We're not sure if this happened, just a guess.

Thanks,

-N



On Friday, January 24, 2003, at 07:15 PM, Josh Berkus wrote:

>
> Noah,
>
>> but I'm the only one logged into this box, and I didn't kill anything.
>> It appears to have died all by itself.
>
> I'd check your disk array, then.  It doesn't happen to be a Mylex,
> does it?
>
> --
> -Josh Berkus
>  Aglio Database Solutions
>  San Francisco
>
>


Re: WEIRD CRASH?!?!

From
"scott.marlowe"
Date:
On Fri, 24 Jan 2003, Noah Silverman wrote:

> Yes,
> but I'm the only one logged into this box, and I didn't kill anything.
> It appears to have died all by itself.
>

It certainly sounds that way.  Can you recreate the circumstances and make
it happen reliably?  If not, the likely it's just an isolated occurance
and nothing to get too worried about.  Your data is still coherent, that's
why all the backends were forced to reset, to cleanse the buffers from
possible corruption.



Re: WEIRD CRASH?!?!

From
Stephan Szabo
Date:
On Fri, 24 Jan 2003, Josh Berkus wrote:

> Noah,
>
> > I think my server crashed and then restarted itself.  Does anybody know
> > what all this means:
> >
> > 2003-01-24 18:28:06 PANIC:  link from
> > /RAID/pgsql/pg_xlog/00000009000000BC to
> > /RAID/pgsql/pg_xlog/00000009000000C4 (initialization of log file 9,
> > segment 196) failed: File exists
> > 2003-01-24 18:28:06 LOG:  server process (pid 1574) was terminated by
> > signal 6
> > 2003-01-24 18:28:06 LOG:  terminating any other active server processes
> > 2003-01-24 18:28:06 WARNING:  Message from PostgreSQL backend:
> >          The Postmaster has informed me that some other backend
> >          died abnormally and possibly corrupted shared memory.
> >          I have rolled back the current transaction and am
> >          going to terminate your database system connection and exit.
> >          Please reconnect to the database system and repeat your query.
>
> This means that somebody KILL -9'd a postgres process or the postmaster, and
> Postgres restarted in order to clear the shared buffers.   If the database
> started up again, you are fine.

Actually, it looks like an abort() (signal 6) to me.  Probably from the
PANIC listed.

The question is why did it get confused and end up linking to a filename
that already existed?




Re: Multiple databases one directory

From
Tom Lane
Date:
Noah Silverman <noah@allresearch.com> writes:
> We have a situation where we will be creating two fairly large and
> complex databases with many tables (thousands) each.  From what I
> understand, postgres keeps everything in one big data directory.

Yeah.  You're kind of at the mercy of the operating system when you do
that: if it copes well with big directories, no problem, but if lookups
in big directories are slow then you'll take a performance hit.

The first thing I'd ask is *why* you think you need thousands of
tables.  How will you keep track of them?  Are there really thousands of
different table schemas?  Maybe you can combine tables by introducing
an extra key column.

Perhaps a little bit of rethinking will yield a small design screaming
to get out of this big one ...

            regards, tom lane

Re: Multiple databases one directory

From
Noah Silverman
Date:
Thanks,

We're considering this.

On an unrelated note, it looks like our crash was due to running out of
file descriptors for the bash shell.

Linux won't let me increase the limit for a user other than root.  Does
anyone know how to change this (We're running slackware)


Thanks,

-N


On Friday, January 24, 2003, at 07:50 PM, Tom Lane wrote:

> Noah Silverman <noah@allresearch.com> writes:
>> We have a situation where we will be creating two fairly large and
>> complex databases with many tables (thousands) each.  From what I
>> understand, postgres keeps everything in one big data directory.
>
> Yeah.  You're kind of at the mercy of the operating system when you do
> that: if it copes well with big directories, no problem, but if lookups
> in big directories are slow then you'll take a performance hit.
>
> The first thing I'd ask is *why* you think you need thousands of
> tables.  How will you keep track of them?  Are there really thousands
> of
> different table schemas?  Maybe you can combine tables by introducing
> an extra key column.
>
> Perhaps a little bit of rethinking will yield a small design screaming
> to get out of this big one ...
>
>             regards, tom lane
>


Re: WEIRD CRASH?!?!

From
Tom Lane
Date:
Noah Silverman <noah@allresearch.com> writes:
> One clue.
> I actually moved the pg_xlog directory to another drive and then
> symbolically linked it back to the data directory.

Uh, did you have the postmaster shut down while you did that?

This looks like a collision between two processes both trying to create
the next segment of the xlog at about the same time.  But there are
interlocks that are supposed to prevent that.

I don't think you need to worry about the integrity of your data; the
panic reset should put everything right.  But I'd sure be interested
if you can reproduce this problem.

            regards, tom lane

Re: WEIRD CRASH?!?!

From
Rod Taylor
Date:
On Fri, 2003-01-24 at 19:17, Noah Silverman wrote:
> We are using a 3ware escalade on this box.
>
> One clue.
>
> I actually moved the pg_xlog directory to another drive and then
> symbolically linked it back to the data directory.

You shut it down first right?

--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Attachment

Re: WEIRD CRASH?!?!

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> The question is why did it get confused and end up linking to a filename
> that already existed?

The message comes from InstallXLogFileSegment(), which is careful to
ensure that the link() cannot fail, either by unlinking the previous
file, or searching for an unused name.  But it failed anyway.

It seems to me that there are only two possible explanations: a race
condition (but holding ControlFileLock should prevent that) or
BasicOpenFile() failed for a reason other than nonexistence of the file.

Hmm ... I wonder if Noah's machine could have been running out of kernel
file table slots, or something like that?  It does seem that it'd be
more robust to use something like stat(2) to probe for an existing file.

            regards, tom lane

Re: WEIRD CRASH?!?!

From
Noah Silverman
Date:
OF COURSE!

It actually looks like the problem was with file descriptors.  Our
shell only had 1024 set, and we also have mysql running and using up a
bunch of those.  We just upped to limit to 8000 to see it that would
give postgres more room to breathe.

-N

On Friday, January 24, 2003, at 08:08 PM, Rod Taylor wrote:

> On Fri, 2003-01-24 at 19:17, Noah Silverman wrote:
>> We are using a 3ware escalade on this box.
>>
>> One clue.
>>
>> I actually moved the pg_xlog directory to another drive and then
>> symbolically linked it back to the data directory.
>
> You shut it down first right?
>
> --
> Rod Taylor <rbt@rbt.ca>
>
> PGP Key: http://www.rbt.ca/rbtpub.asc
> <signature.asc>


Re: WEIRD CRASH?!?!

From
Rod Taylor
Date:
On Fri, 2003-01-24 at 20:14, Noah Silverman wrote:
> OF COURSE!

Sorry, but I've seen people try to do that stuff before.


> On Friday, January 24, 2003, at 08:08 PM, Rod Taylor wrote:
>
> > On Fri, 2003-01-24 at 19:17, Noah Silverman wrote:
> >> We are using a 3ware escalade on this box.
> >>
> >> One clue.
> >>
> >> I actually moved the pg_xlog directory to another drive and then
> >> symbolically linked it back to the data directory.
> >
> > You shut it down first right?
> >
> > --
> > Rod Taylor <rbt@rbt.ca>
> >
> > PGP Key: http://www.rbt.ca/rbtpub.asc
> > <signature.asc>
--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Attachment

Re: WEIRD CRASH?!?!

From
Tom Lane
Date:
Noah Silverman <noah@allresearch.com> writes:
> It actually looks like the problem was with file descriptors.  Our
> shell only had 1024 set, and we also have mysql running and using up a
> bunch of those.  We just upped to limit to 8000 to see it that would
> give postgres more room to breathe.

Ah-hah.  You might also want to set max_files_per_process (in
postgresql.conf) to something small enough to ensure Postgres can't run
you out of descriptors.  Linux has a bad habit of promising more than
it can deliver when Postgres asks how many FDs are okay to use.  The
max_files_per_process setting is useful to prevent Postgres from
believing whatever fairy-tale sysconf(3) tells it.

            regards, tom lane

Re: WEIRD CRASH?!?!

From
Tom Lane
Date:
I wrote:
> Hmm ... I wonder if Noah's machine could have been running out of kernel
> file table slots, or something like that?  It does seem that it'd be
> more robust to use something like stat(2) to probe for an existing file.

I've applied a patch to do it that way in CVS HEAD.  After examining the
code further I'm inclined not to risk back-patching it into 7.3, though.
xlog.c is full of open() calls that will elog(PANIC) if they fail, so
I think there was only a very small window of opportunity for Noah to
see this failure and not another one.  The patch thus probably
contributes little real gain in reliability.

            regards, tom lane

Re: WEIRD CRASH?!?!

From
Andrew Sullivan
Date:
On Fri, Jan 24, 2003 at 07:08:29PM -0500, Noah Silverman wrote:
> Yes,
> but I'm the only one logged into this box, and I didn't kill anything.
> It appears to have died all by itself.

Is this on Linux, and were you short on memory?  Linux, in a
completely brain-dead design, runs around 'kill -9'-ing random
processes when it starts to think the machine is going to exhaust its
memory (or at least it used to.  I dunno if it still does).

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110