Thread: Same query, same performance
Hi all, First, sorry for the long mail... I have a system with 7 Million of records in 600 tables. My actual production machine is: P4 1.6G, 3 IDE 7200, 1GB PC133 My new machine production is: Dual Xeon 2.0G HT, 1GB DDR266 ECC 3 SCSI with HW Raid 5 The postgresql.conf is the SAME in both systems and I test with no other connections, only my local test. shared_buffers = 80000 effective_cache_size = 60000 random_page_cost = 2.5 cpu_tuple_cost = 0.001 cpu_index_tuple_cost = 0.0001 cpu_operator_cost = 0.00025 My question is: If I execute the same query executed a lot of times, the duration is praticaly the same in both systems ? 1) ! 1.185424 elapsed 1.090000 user 0.100000 system sec 2) ! 1.184415 elapsed 1.070000 user 0.120000 system sec 3) ! 1.185209 elapsed 1.100000 user 0.080000 system sec If the disks is not read directly, the system must find the rows in RAM. If it find in RAM, why so diffrents machines have the times of execution and why the times does not down ??? The variations of query show bellow have the times pratically equals and my system send thousands os this querys with a thousands of 1.18 seconds... :( Very thank´s Alexandre Query: [postgres@host1 data]$ psql -c "explain SELECT T2.fi15emp05, T2.fi15flagcf, T2.fi15codcf, T1.Fn06Emp07, T1.Fn06TipTit, T1.Fn06TitBan, T1.Fn06Conta1, T1.Fn06NumTit, T1.Fn06Desdob, T1.Fn05CodPre, T1.Fn06eCli1, T1.Fn06tCli1, T1.Fn06cCli1, T2.fi15nome FROM (FN06T T1 LEFT JOIN FI15T T2 ON T2.fi15emp05 = T1.Fn06eCli1 AND T2.fi15flagcf = T1.Fn06tCli1 AND T2.fi15codcf = T1.Fn06cCli1) WHERE ( T1.Fn06Emp07 = '1' AND T1.Fn06TipTit = 'R' ) AND ( T1.Fn06TitBan = '002021001525 ' ) ORDER BY T1.Fn06Emp07, T1.Fn06TipTit, T1.Fn06NumTit, T1.Fn06Desdob, T1.Fn05CodPre, T1.Fn06eCli1, T1.Fn06tCli1, T1.Fn06cCli1" Pro13Z QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=25875.53..25875.53 rows=15 width=155) Sort Key: t1.fn06emp07, t1.fn06tiptit, t1.fn06numtit, t1.fn06desdob, t1.fn05codpre, t1.fn06ecli1, t1.fn06tcli1, t1.fn06ccli1 -> Nested Loop (cost=0.00..25875.50 rows=15 width=155) -> Seq Scan on fn06t t1 (cost=0.00..25808.30 rows=15 width=95) Filter: ((fn06emp07 = 1::smallint) AND (fn06tiptit = 'R'::bpchar) AND (fn06titban = '002021001525 '::bpchar)) -> Index Scan using fi15t_pkey on fi15t t2 (cost=0.00..4.33 rows=1 width=60) Index Cond: ((t2.fi15emp05 = "outer".fn06ecli1) AND (t2.fi15flagcf = "outer".fn06tcli1) AND (t2.fi15codcf = "outer".fn06ccli1)) (7 rows) *** AND FROM LOG when a execute the query: 2003-01-23 00:09:49 [3372] LOG: duration: 1.285900 sec 2003-01-23 00:09:49 [3372] LOG: QUERY STATISTICS ! system usage stats: ! 1.286001 elapsed 1.240000 user 0.040000 system sec ! [1.250000 user 0.040000 sys total] ! 0/0 [0/0] filesystem blocks in/out ! 50526/130 [50693/372] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [0/0] voluntary/involuntary context switches ! buffer usage stats: ! Shared blocks: 0 read, 0 written, buffer hit rate = 100.00% ! Local blocks: 0 read, 0 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written
alexandre :: aldeia digital wrote: >Hi all, > >First, sorry for the long mail... > >I have a system with 7 Million of records in 600 tables. >My actual production machine is: P4 1.6G, 3 IDE 7200, 1GB PC133 >My new machine production is: Dual Xeon 2.0G HT, 1GB DDR266 ECC >3 SCSI with HW Raid 5 > >The postgresql.conf is the SAME in both systems and I test >with no other connections, only my local test. > >shared_buffers = 80000 >effective_cache_size = 60000 >random_page_cost = 2.5 >cpu_tuple_cost = 0.001 >cpu_index_tuple_cost = 0.0001 >cpu_operator_cost = 0.00025 > >My question is: > >If I execute the same query executed a lot of times, the >duration is praticaly the same in both systems ? > >1) ! 1.185424 elapsed 1.090000 user 0.100000 system sec >2) ! 1.184415 elapsed 1.070000 user 0.120000 system sec >3) ! 1.185209 elapsed 1.100000 user 0.080000 system sec > >If the disks is not read directly, the system must find >the rows in RAM. If it find in RAM, why so diffrents machines >have the times of execution and why the times does not down ??? Here is your problem: -> Seq Scan on fn06t t1 (cost=0.00..25808.30 rows=15 width=95) Filter: ((fn06emp07 = 1::smallint) AND (fn06tiptit = 'R'::bpchar) AND (fn06titban = '002021001525 '::bpchar)) Problably system has to read from disk whole table fn06t each time, beacuse it doesn't use index scan. Do you have any indexes on table fn06t? How selective are conditions above How big is this table? Can you use indexes on multiple fields on this table - it should help, because conditions above return only 15 rows? Regards, Tomasz Myrta
Alexandre, > I have a system with 7 Million of records in 600 tables. > My actual production machine is: P4 1.6G, 3 IDE 7200, 1GB PC133 > My new machine production is: Dual Xeon 2.0G HT, 1GB DDR266 ECC > 3 SCSI with HW Raid 5 Well, first of all, those two systems are almost equivalent as far as Postgres is concerned for simple queries. The extra processor power will only help you with very complex queries. 3-disk RAID 5 is no faster ... and sometimes slower ... than IDE for database purposes. The only real boost to the Xeon is the faster RAM ... which may not help you if your drive array is the bottleneck. > > The postgresql.conf is the SAME in both systems and I test > with no other connections, only my local test. > > shared_buffers = 80000 > effective_cache_size = 60000 > random_page_cost = 2.5 > cpu_tuple_cost = 0.001 > cpu_index_tuple_cost = 0.0001 > cpu_operator_cost = 0.00025 Not that it affects the query below, but what about SORT_MEM? > If I execute the same query executed a lot of times, the > duration is praticaly the same in both systems ? > > 1) ! 1.185424 elapsed 1.090000 user 0.100000 system sec > 2) ! 1.184415 elapsed 1.070000 user 0.120000 system sec > 3) ! 1.185209 elapsed 1.100000 user 0.080000 system sec > > If the disks is not read directly, the system must find > the rows in RAM. If it find in RAM, why so diffrents machines > have the times of execution and why the times does not down ??? I'm pretty sure that PostgreSQL always checks on disk, even when the same query is run repeatedly. Tom? > [postgres@host1 data]$ psql -c "explain SELECT T2.fi15emp05, > T2.fi15flagcf, T2.fi15codcf, T1.Fn06Emp07, T1.Fn06TipTit, > T1.Fn06TitBan, > T1.Fn06Conta1, T1.Fn06NumTit, T1.Fn06Desdob, T1.Fn05CodPre, > T1.Fn06eCli1, > T1.Fn06tCli1, T1.Fn06cCli1, T2.fi15nome FROM (FN06T T1 LEFT JOIN > FI15T > T2 ON T2.fi15emp05 = T1.Fn06eCli1 AND T2.fi15flagcf = T1.Fn06tCli1 > AND > T2.fi15codcf = T1.Fn06cCli1) WHERE ( T1.Fn06Emp07 = '1' AND > T1.Fn06TipTit = 'R' ) AND ( T1.Fn06TitBan = '002021001525 > > ' ) ORDER BY T1.Fn06Emp07, T1.Fn06TipTit, T1.Fn06NumTit, > T1.Fn06Desdob, > T1.Fn05CodPre, T1.Fn06eCli1, T1.Fn06tCli1, T1.Fn06cCli1" Pro13Z Actually, from your stats, Postgres is doing a pretty good job. 1.18 seconds to return 15 rows from a 7 million row table searching on not Indexed columns? I don't think you have anything to complain about. If you want less-than-1 second respose time: Add some indexes and keep the tables VACUUMed so the indexes work. Particularly, add a multi-column index on ( T1.Fn06Emp07, T1.Fn06TipTit, T1.Fn06TitBan ) If you want single-digit-msec response: Get a better disk set for Postgres: I recommend dual-channel RAID 1 (n addition to indexing). -Josh Berkus
Tomasz, >>1) ! 1.185424 elapsed 1.090000 user 0.100000 system sec >>2) ! 1.184415 elapsed 1.070000 user 0.120000 system sec >>3) ! 1.185209 elapsed 1.100000 user 0.080000 system sec >> >>If the disks is not read directly, the system must find >>the rows in RAM. If it find in RAM, why so diffrents machines >>have the times of execution and why the times does not down ??? > > Here is your problem: > -> Seq Scan on fn06t t1 (cost=0.00..25808.30 rows=15 width=95) > Filter: ((fn06emp07 = 1::smallint) AND (fn06tiptit = > 'R'::bpchar) AND (fn06titban = '002021001525 > '::bpchar)) Really! I do not attemp that fn06t does not have an index with fn06titban ... :) Now, tehe time of the querys are < 0.02 sec on P4 and <0.05 on Xeon. Very Thank´s Alexandre,
Josh, > Alexandre, > >> I have a system with 7 Million of records in 600 tables. >> My actual production machine is: P4 1.6G, 3 IDE 7200, 1GB PC133 >> My new machine production is: Dual Xeon 2.0G HT, 1GB DDR266 ECC >> 3 SCSI with HW Raid 5 > > Well, first of all, those two systems are almost equivalent as far as > Postgres is concerned for simple queries. The extra processor power > will only help you with very complex queries. 3-disk RAID 5 is no > faster ... and sometimes slower ... than IDE for database purposes. > The only real boost to the Xeon is the faster RAM ... which may not > help you if your drive array is the bottleneck. Today, I will add more one HD and I will make an RAID 10 ... In next week i will report my tests to the list... > >> >> The postgresql.conf is the SAME in both systems and I test >> with no other connections, only my local test. >> >> shared_buffers = 80000 >> effective_cache_size = 60000 >> random_page_cost = 2.5 >> cpu_tuple_cost = 0.001 >> cpu_index_tuple_cost = 0.0001 >> cpu_operator_cost = 0.00025 > > Not that it affects the query below, but what about SORT_MEM? Sort_mem = 32000 > Actually, from your stats, Postgres is doing a pretty good job. 1.18 > seconds to return 15 rows from a 7 million row table searching on not > Indexed columns? I don't think you have anything to complain about. The table have 300000 tuples, the entire database have 7 million. Tomazs answer the question: a missing index on fn06t ... But the query time difference of the systems continue. I will change the discs and tell to list after... Thank´s Josh, Alexandre
Short summary: On a large tables, I think the "correlation" pg_stats field as calculated by "vacuum analyze" or "analyze" can mislead the optimizer. By forcing index scans on some queries shown below, some queries in my database speed up from 197 seconds to under 30 seconds. I'd like feedback on whether or not having a smarter "analyze" function (which I think I could write as a separate utility) would help me situations like this. Longer: In particular, if I have a large table t with columns 'a','b','c', etc, and I cluster the table as follows: create table t_ordered as select * from t order by a,b; vacuum analyze t_ordered; Column "b" will (correctly) get a very low "correlation" in the pg_stats table -- but I think the optimizer would do better assuming a high correlation because similar 'b' values are still grouped closely on the same disk pages. Below is a real-world example of this issue. The table "fact" is a large one (reltuples = 1e8, relpages = 1082385) and contains about 1 years worth of data. The data was loaded sequentialy (ordered by dat,tim). logs=# \d fact; Table "fact" Column | Type | Modifiers --------+------------------------+----------- dat | date | tim | time without time zone | ip_id | integer | bid_id | integer | req_id | integer | ref_id | integer | uag_id | integer | Indexes: i_fact_2__bid_id, i_fact_2__dat, i_fact_2__tim, i_fact_2__ip_id, i_fact_2__ref_id, i_fact_2__req_id With a table this large, each day's worth of data contains about 3000 pages; or conversely, each page contains only about a 30 second range of values for "tim". As shown in the queries below, the optimizer wanted to do a sequential scan when looking at a 10 minute part of the day. However also as shown, forcing an index scan did much better. I'm guessing this happened because the optimizer saw the horrible correlation, and decided it would have to read an enormous number of pages if it did an index scan. =========================================== logs=# select tablename,attname,n_distinct,correlation from pg_stats where tablename='fact'; tablename | attname | n_distinct | correlation -----------+---------+------------+------------- fact | dat | 365 | 1 fact | tim | 80989 | -0.00281447 fact | ip_id | 44996 | 0.660689 fact | bid_id | 742850 | 0.969026 fact | req_id | 2778 | 0.67896 fact | ref_id | 595 | 0.258023 fact | uag_id | 633 | 0.234216 (7 rows) logs=# explain analyze select * from fact where tim<'00:10:00'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Seq Scan on fact (cost=0.00..1949838.40 rows=526340 width=32) (actual time=0.39..197447.50 rows=402929 loops=1) Filter: (tim < '00:10:00'::time without time zone) Total runtime: 197810.01 msec (3 rows) logs=# explain analyze select * from fact where tim<'00:10:00'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Seq Scan on fact (cost=0.00..1949838.40 rows=526340 width=32) (actual time=15.25..156705.76 rows=402929 loops=1) Filter: (tim < '00:10:00'::time without time zone) Total runtime: 157089.15 msec (3 rows) logs=# set enable_seqscan = off; SET logs=# explain analyze select * from fact where tim<'00:10:00'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Index Scan using i__fact__tim on fact (cost=0.00..2110978.39 rows=526340 width=32) (actual time=104.41..23307.84 rows=402929loops=1) Index Cond: (tim < '00:10:00'::time without time zone) Total runtime: 23660.95 msec (3 rows) logs=# explain analyze select * from fact where tim<'00:10:00'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Index Scan using i__fact__tim on fact (cost=0.00..2110978.39 rows=526340 width=32) (actual time=0.03..1477.35 rows=402929loops=1) Index Cond: (tim < '00:10:00'::time without time zone) Total runtime: 1827.94 msec (3 rows) logs=# ******************************************************************************* ******************************************************************************* So two questions: a) Am I on to something.... or is something else the reason why the optimizer chose the much slower sequential scan? b) If I did write an "analyze" that tried to set "correlation" values that took into account such local grouping of data, would anyone be interested? Ron
Ron Mayer <ron@intervideo.com> writes: > On a large tables, I think the "correlation" pg_stats field as calculated > by "vacuum analyze" or "analyze" can mislead the optimizer. If you look in the pghackers archives, you will find some discussion about changing the equation that cost_index() uses to estimate the impact of correlation on indexscan cost. The existing equation is ad-hoc and surely wrong, but so far no one's proposed a replacement that can be justified any better. If you've got such a replacement then we're all ears... > In particular, if I have a large table t with columns 'a','b','c', etc, > and I cluster the table as follows: > create table t_ordered as select * from t order by a,b; > vacuum analyze t_ordered; > Column "b" will (correctly) get a very low "correlation" in > the pg_stats table -- but I think the optimizer would do better > assuming a high correlation because similar 'b' values are still > grouped closely on the same disk pages. How would that be? They'll be separated by the stride of 'a'. It seems likely to me that a one-dimensional correlation statistic may be inadequate, but I haven't seen any proposals for better stats. regards, tom lane
On Fri, 24 Jan 2003, Tom Lane wrote: > Ron Mayer <ron@intervideo.com> writes: > > In particular, if I have a large table t with columns 'a','b','c', etc, > > and I cluster the table as follows: > > create table t_ordered as select * from t order by a,b; > > vacuum analyze t_ordered; > > Column "b" will (correctly) get a very low "correlation" in > > the pg_stats table -- but I think the optimizer would do better > > assuming a high correlation because similar 'b' values are still > > grouped closely on the same disk pages. > > How would that be? They'll be separated by the stride of 'a'. I think it's a clumping effect. For example, I made a table (ordered) with 20 values of a, 50 values of b (each showing up in each a) and 100 values of c (not used, just means 100 rows for each (a,b) combination. It's got 541 pages it looks like. Analyze sets the correlation to about 0.08 on the table and so a query like: select * from test1 where b=1; prefers a sequence scan (1791 vs 2231) while the index scan actually performs about 5 times better. I guess the reason is that in general, the index scan *really* is reading something on the order of 40 pages rather than the much larger estimate (I'd guess something on the order of say 300-400? I'm not sure how to find that except by trying to reverse engineer the estimate number), because pretty much each value of a will probably have 1 or 2 pages with b=1. I'm not really sure how to measure that, however.
On Fri, 24 Jan 2003, Tom Lane wrote: > > Ron Mayer <ron@intervideo.com> writes: > > On a large tables, I think the "correlation" pg_stats field as calculated > > by "vacuum analyze" or "analyze" can mislead the optimizer. > > If you look in the pghackers archives, you will find some discussion > about changing the equation that cost_index() uses to estimate the > impact of correlation on indexscan cost. The existing equation is > ad-hoc and surely wrong, but so far no one's proposed a replacement > that can be justified any better. If you've got such a replacement > then we're all ears... I've got a very slow one (full table scan perl script) that helps my database... I don't know if it's a good general purpose solution. That's why I'm asking if the concept is good here. :-) > > In particular, if I have a large table t with columns 'a','b','c', etc, > > and I cluster the table as follows: > > create table t_ordered as select * from t order by a,b; > > vacuum analyze t_ordered; > > Column "b" will (correctly) get a very low "correlation" in > > the pg_stats table -- but I think the optimizer would do better > > assuming a high correlation because similar 'b' values are still > > grouped closely on the same disk pages. > > How would that be? They'll be separated by the stride of 'a'. In the case of date/time (for the queries I showed) the issue was that 'a's were not at all unique so I had data like this: dat | time | value ------------|----------|-------------------------------- 2002-01-01 | 00:00:00 | whatever 2002-01-01 | 00:00:00 | 2002-01-01 | 00:00:00 | 2002-01-01 | 00:00:01 | 2002-01-01 | 00:00:01 | [many pages of 12am] 2002-01-01 | 00:00:01 | 2002-01-01 | 00:00:01 | ... thousands more rows.... 2002-01-01 | 00:00:59 | 2002-01-01 | 00:01:00 | [many pages of 1am] ... tens of thousands of rows. 2002-01-01 | 23:59:59 | 2002-01-01 | 23:59:59 | 2002-01-01 | 23:59:59 | [many pages of 11pm] 2002-01-02 | 00:00:00 | [many *MORE* pages of 12am] 2002-01-02 | 00:00:00 | 2002-01-02 | 00:00:00 | ... tens of thousands of rows... 2002-01-02 | 23:59:59 | [many pages of 11pm] 2002-01-03 | 00:00:00 | [many *MORE* pages of 12am] ... millions more rows ... A similar problem actually shows up again in the dimention tables of my database; where I bulk load many pages at a time (which can easily be ordered to give a good correlation for a single load) ... but then the next week's data gets appended to the end. id | value ------|---------------------------------- 1 | aalok mehta [many pages of all 'a's] 2 | aamir khan 3 | aaron beall | [...] 6234 | axel rose 6234 | austin wolf 6123 | barbara boxer [many pages of all 'b's] | [...] 123456 | young 123457 | zebra | [...data loaded later..] 123458 | aaron whatever [more pages of all 'a's] 123458 | aaron something else 123458 | aaron something else | [...] 512344 | zelany In this case I get many clustered blocks of "a" values, but these clustered blocks happen at many different times across the table. > It seems likely to me that a one-dimensional correlation statistic may > be inadequate, but I haven't seen any proposals for better stats. The idea is it walks the whole table and looks for more local correlations and replaces the correlation value with a "good" value if values "close" to each other on the disk are similar. This way a single "correlation" value still works ... so I didn't have to change the optimizer logic, just the "analyze" logic. Basically if data within each block is highly correlated, it doesn't matter as much (yeah, I now the issue about sequential reads vs. random reads). Ron
On Fri, 24 Jan 2003, Stephan Szabo wrote: > > I think it's a clumping effect. Yup, I think that's exactly the effect. A proposal.... (yes I I'm volunteering if people point me in the right direction)... would be to have a "plugable" set of analyze functions so that a huge database that runs analyze infrequently could choose to have a very slow analyze that might work better for it's data. I see no reason different analyze functions would to be compiled into the source code ... but could probably exists as PL/pgSQL languages. The one thing compiling it in would help with is to let me know the exact number of tuples on each individual page, but I guess reltuples/relpages from pg_class is a good estimate. > For example, I made a table (ordered) with 20 values of a, 50 values of b > (each showing up in each a) and 100 values of c (not used, just means 100 > rows for each (a,b) combination. It's got 541 pages it looks like. Analyze > sets the correlation to about 0.08 on the table and so a query like: > select * from test1 where b=1; prefers a sequence scan (1791 vs 2231) > while the index scan actually performs about 5 times better. That sounds like the same situation I was in. If my logic is right, this means you had about 184 tuples/page (200*50*100/541), so it looks to me like for each "a", you get half-a-page where "b=1". If you had 'c' have 200 values, I think you'd get even a bigger speedup because half the page is still "wasted" with b=2 values. If you had 'c' have 10000 values, I think you'd get even a slightly bigger speedup because you'd have so many b=1 pages next to each other you'd benefit from more sequential disk access. > I guess the reason is that in general, the index scan *really* is reading > something on the order of 40 pages rather than the much larger estimate > (I'd guess something on the order of say 300-400? I'm not sure how to > find that except by trying to reverse engineer the estimate number), Or by adding a printf()... I think it'd be in cost_index in costsize.c. > because pretty much each value of a will probably have 1 or 2 pages with > b=1. > > I'm not really sure how to measure that, however. As I said... I'm happy to volunteer and experiment if people point me in a good direction. Ron
Ron Mayer <ron@intervideo.com> writes: > A proposal.... (yes I I'm volunteering if people point me in the right > direction)... would be to have a "plugable" set of analyze functions so that a > huge database that runs analyze infrequently could choose to have a very slow > analyze that might work better for it's data. I do not think ANALYZE is the problem here; at least, it's premature to worry about that end of things until you've defined (a) what's to be stored in pg_statistic, and (b) what computation the planner needs to make to derive a cost estimate given the stats. regards, tom lane
Hi! Another fun question in our ongoing analysis on whether to switch from mysql to postgres. (Just as an update, Postgres has performed flawlessly on all of our stress tests so far.) We have a situation where we will be creating two fairly large and complex databases with many tables (thousands) each. From what I understand, postgres keeps everything in one big data directory. Would there be an advantage to putting each of the two databases into a separate directory and starting two instances of postgres? Is it better to just lump everything together. In a perfect world, we would buy another database server and raid for the second database, but being a small company, we just don't have the budget right now. The raid on our current server is much bigger than we need. Thanks, -N
Hi! Another fun question in our ongoing analysis on whether to switch from mysql to postgres. (Just as an update, Postgres has performed flawlessly on all of our stress tests so far.) We have a situation where we will be creating two fairly large and complex databases with many tables (thousands) each. From what I understand, postgres keeps everything in one big data directory. Would there be an advantage to putting each of the two databases into a separate directory and starting two instances of postgres? Is it better to just lump everything together. In a perfect world, we would buy another database server and raid for the second database, but being a small company, we just don't have the budget right now. The raid on our current server is much bigger than we need. Thanks, -N
On Fri, 24 Jan 2003, Tom Lane wrote: > > Ron Mayer <ron@intervideo.com> writes: > > A proposal.... (yes I I'm volunteering if people point me in the right > > direction)... > > I do not think ANALYZE is the problem here; at least, it's premature to > worry about that end of things until you've defined (a) what's to be > stored in pg_statistic, and (b) what computation the planner needs to > make to derive a cost estimate given the stats. Cool. Thanks for a good starting point. If I wanted to brainstorm further, should I do so here, or should I encourage interested people to take it off line with me (ron@intervideo.com) and I can post a summary of the conversation? Ron For those who do want to brainstorm with me, my starting point is this: With my particular table, I think the main issue is still that I have a lot of data that looks like: values: aaaaaaaaaaabbbbbbbbccccccccddddddddddaaaabbbbbbbccccccccddddd... disk page: |page 1|page 2|page 3|page 4|page 5|page 6|page 7|page 8|page 9| The problem I'm trying to address is that the current planner guesses that most of the pages will need to be read; however the local clustering means that in fact only a small subset need to be accessed. My first guess is that modifying the definition of "correlation" to account for page-sizes would be a good approach. I.e. Instead of the correlation across the whole table, for each row perform an auto-correlation (http://astronomy.swin.edu.au/~pbourke/analysis/correlate/) and keep only the values with a "delay" of less than 1 page-size. If you want to share thoughts offline (ron@intervideo.com), I'll gladly post a summary of responses here to save the bandwidth of the group.
Noah, > Would there be an advantage to putting each of the two databases into a > separate directory and starting two instances of postgres? Is it > better to just lump everything together. You can use the WITH LOCATION option in CREATE DATABASE to put the two databases into seperate directories *without* running two instances of postgres. For that matter, the databases each have their own directories, by OID number. Of course, this only helps you if the seperate directories are on seperate disks/arrays/channels. If everying is on the same disk or array, don't bother. -- -Josh Berkus Aglio Database Solutions San Francisco
I think my server crashed and then restarted itself. Does anybody know what all this means: 2003-01-24 18:28:06 PANIC: link from /RAID/pgsql/pg_xlog/00000009000000BC to /RAID/pgsql/pg_xlog/00000009000000C4 (initialization of log file 9, segment 196) failed: File exists 2003-01-24 18:28:06 LOG: server process (pid 1574) was terminated by signal 6 2003-01-24 18:28:06 LOG: terminating any other active server processes 2003-01-24 18:28:06 WARNING: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. 2003-01-24 18:28:06 LOG: all server processes terminated; reinitializing shared memory and semaphores 2003-01-24 18:28:06 LOG: database system was interrupted at 2003-01-24 18:28:06 EST 2003-01-24 18:28:06 LOG: checkpoint record is at 9/C4574974 2003-01-24 18:28:06 LOG: redo record is at 9/C200D144; undo record is at 0/0; shutdown FALSE 2003-01-24 18:28:06 LOG: next transaction id: 5159292; next oid: 50856954 2003-01-24 18:28:06 LOG: database system was not properly shut down; automatic recovery in progress 2003-01-24 18:28:06 LOG: redo starts at 9/C200D144 2003-01-24 18:28:13 LOG: ReadRecord: record with zero length at 9/C4578CC0 2003-01-24 18:28:13 LOG: redo done at 9/C4578C9C 2003-01-24 18:29:02 LOG: recycled transaction log file 00000009000000C0 2003-01-24 18:29:02 LOG: recycled transaction log file 00000009000000C1 2003-01-24 18:29:02 LOG: recycled transaction log file 00000009000000BC 2003-01-24 18:29:02 LOG: recycled transaction log file 00000009000000BD 2003-01-24 18:29:02 LOG: recycled transaction log file 00000009000000BE 2003-01-24 18:29:02 LOG: recycled transaction log file 00000009000000BF 2003-01-24 18:29:02 LOG: database system is ready
Noah, > I think my server crashed and then restarted itself. Does anybody know > what all this means: > > 2003-01-24 18:28:06 PANIC: link from > /RAID/pgsql/pg_xlog/00000009000000BC to > /RAID/pgsql/pg_xlog/00000009000000C4 (initialization of log file 9, > segment 196) failed: File exists > 2003-01-24 18:28:06 LOG: server process (pid 1574) was terminated by > signal 6 > 2003-01-24 18:28:06 LOG: terminating any other active server processes > 2003-01-24 18:28:06 WARNING: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am > going to terminate your database system connection and exit. > Please reconnect to the database system and repeat your query. This means that somebody KILL -9'd a postgres process or the postmaster, and Postgres restarted in order to clear the shared buffers. If the database started up again, you are fine. -- -Josh Berkus Aglio Database Solutions San Francisco
Yes, but I'm the only one logged into this box, and I didn't kill anything. It appears to have died all by itself. Thanks, -N On Friday, January 24, 2003, at 07:03 PM, Josh Berkus wrote: > Noah, > >> I think my server crashed and then restarted itself. Does anybody >> know >> what all this means: >> >> 2003-01-24 18:28:06 PANIC: link from >> /RAID/pgsql/pg_xlog/00000009000000BC to >> /RAID/pgsql/pg_xlog/00000009000000C4 (initialization of log file 9, >> segment 196) failed: File exists >> 2003-01-24 18:28:06 LOG: server process (pid 1574) was terminated by >> signal 6 >> 2003-01-24 18:28:06 LOG: terminating any other active server >> processes >> 2003-01-24 18:28:06 WARNING: Message from PostgreSQL backend: >> The Postmaster has informed me that some other backend >> died abnormally and possibly corrupted shared memory. >> I have rolled back the current transaction and am >> going to terminate your database system connection and exit. >> Please reconnect to the database system and repeat your >> query. > > This means that somebody KILL -9'd a postgres process or the > postmaster, and > Postgres restarted in order to clear the shared buffers. If the > database > started up again, you are fine. > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco > >
Noah, > but I'm the only one logged into this box, and I didn't kill anything. > It appears to have died all by itself. I'd check your disk array, then. It doesn't happen to be a Mylex, does it? -- -Josh Berkus Aglio Database Solutions San Francisco
We are using a 3ware escalade on this box. One clue. I actually moved the pg_xlog directory to another drive and then symbolically linked it back to the data directory. Another idea is that Linux killed one of the processes because postgres was using up too much memory. I belive the part of the kernel is called "oomkiller". We're not sure if this happened, just a guess. Thanks, -N On Friday, January 24, 2003, at 07:15 PM, Josh Berkus wrote: > > Noah, > >> but I'm the only one logged into this box, and I didn't kill anything. >> It appears to have died all by itself. > > I'd check your disk array, then. It doesn't happen to be a Mylex, > does it? > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco > >
On Fri, 24 Jan 2003, Noah Silverman wrote: > Yes, > but I'm the only one logged into this box, and I didn't kill anything. > It appears to have died all by itself. > It certainly sounds that way. Can you recreate the circumstances and make it happen reliably? If not, the likely it's just an isolated occurance and nothing to get too worried about. Your data is still coherent, that's why all the backends were forced to reset, to cleanse the buffers from possible corruption.
On Fri, 24 Jan 2003, Josh Berkus wrote: > Noah, > > > I think my server crashed and then restarted itself. Does anybody know > > what all this means: > > > > 2003-01-24 18:28:06 PANIC: link from > > /RAID/pgsql/pg_xlog/00000009000000BC to > > /RAID/pgsql/pg_xlog/00000009000000C4 (initialization of log file 9, > > segment 196) failed: File exists > > 2003-01-24 18:28:06 LOG: server process (pid 1574) was terminated by > > signal 6 > > 2003-01-24 18:28:06 LOG: terminating any other active server processes > > 2003-01-24 18:28:06 WARNING: Message from PostgreSQL backend: > > The Postmaster has informed me that some other backend > > died abnormally and possibly corrupted shared memory. > > I have rolled back the current transaction and am > > going to terminate your database system connection and exit. > > Please reconnect to the database system and repeat your query. > > This means that somebody KILL -9'd a postgres process or the postmaster, and > Postgres restarted in order to clear the shared buffers. If the database > started up again, you are fine. Actually, it looks like an abort() (signal 6) to me. Probably from the PANIC listed. The question is why did it get confused and end up linking to a filename that already existed?
Noah Silverman <noah@allresearch.com> writes: > We have a situation where we will be creating two fairly large and > complex databases with many tables (thousands) each. From what I > understand, postgres keeps everything in one big data directory. Yeah. You're kind of at the mercy of the operating system when you do that: if it copes well with big directories, no problem, but if lookups in big directories are slow then you'll take a performance hit. The first thing I'd ask is *why* you think you need thousands of tables. How will you keep track of them? Are there really thousands of different table schemas? Maybe you can combine tables by introducing an extra key column. Perhaps a little bit of rethinking will yield a small design screaming to get out of this big one ... regards, tom lane
Thanks, We're considering this. On an unrelated note, it looks like our crash was due to running out of file descriptors for the bash shell. Linux won't let me increase the limit for a user other than root. Does anyone know how to change this (We're running slackware) Thanks, -N On Friday, January 24, 2003, at 07:50 PM, Tom Lane wrote: > Noah Silverman <noah@allresearch.com> writes: >> We have a situation where we will be creating two fairly large and >> complex databases with many tables (thousands) each. From what I >> understand, postgres keeps everything in one big data directory. > > Yeah. You're kind of at the mercy of the operating system when you do > that: if it copes well with big directories, no problem, but if lookups > in big directories are slow then you'll take a performance hit. > > The first thing I'd ask is *why* you think you need thousands of > tables. How will you keep track of them? Are there really thousands > of > different table schemas? Maybe you can combine tables by introducing > an extra key column. > > Perhaps a little bit of rethinking will yield a small design screaming > to get out of this big one ... > > regards, tom lane >
Noah Silverman <noah@allresearch.com> writes: > One clue. > I actually moved the pg_xlog directory to another drive and then > symbolically linked it back to the data directory. Uh, did you have the postmaster shut down while you did that? This looks like a collision between two processes both trying to create the next segment of the xlog at about the same time. But there are interlocks that are supposed to prevent that. I don't think you need to worry about the integrity of your data; the panic reset should put everything right. But I'd sure be interested if you can reproduce this problem. regards, tom lane
On Fri, 2003-01-24 at 19:17, Noah Silverman wrote: > We are using a 3ware escalade on this box. > > One clue. > > I actually moved the pg_xlog directory to another drive and then > symbolically linked it back to the data directory. You shut it down first right? -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > The question is why did it get confused and end up linking to a filename > that already existed? The message comes from InstallXLogFileSegment(), which is careful to ensure that the link() cannot fail, either by unlinking the previous file, or searching for an unused name. But it failed anyway. It seems to me that there are only two possible explanations: a race condition (but holding ControlFileLock should prevent that) or BasicOpenFile() failed for a reason other than nonexistence of the file. Hmm ... I wonder if Noah's machine could have been running out of kernel file table slots, or something like that? It does seem that it'd be more robust to use something like stat(2) to probe for an existing file. regards, tom lane
OF COURSE! It actually looks like the problem was with file descriptors. Our shell only had 1024 set, and we also have mysql running and using up a bunch of those. We just upped to limit to 8000 to see it that would give postgres more room to breathe. -N On Friday, January 24, 2003, at 08:08 PM, Rod Taylor wrote: > On Fri, 2003-01-24 at 19:17, Noah Silverman wrote: >> We are using a 3ware escalade on this box. >> >> One clue. >> >> I actually moved the pg_xlog directory to another drive and then >> symbolically linked it back to the data directory. > > You shut it down first right? > > -- > Rod Taylor <rbt@rbt.ca> > > PGP Key: http://www.rbt.ca/rbtpub.asc > <signature.asc>
On Fri, 2003-01-24 at 20:14, Noah Silverman wrote: > OF COURSE! Sorry, but I've seen people try to do that stuff before. > On Friday, January 24, 2003, at 08:08 PM, Rod Taylor wrote: > > > On Fri, 2003-01-24 at 19:17, Noah Silverman wrote: > >> We are using a 3ware escalade on this box. > >> > >> One clue. > >> > >> I actually moved the pg_xlog directory to another drive and then > >> symbolically linked it back to the data directory. > > > > You shut it down first right? > > > > -- > > Rod Taylor <rbt@rbt.ca> > > > > PGP Key: http://www.rbt.ca/rbtpub.asc > > <signature.asc> -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
Noah Silverman <noah@allresearch.com> writes: > It actually looks like the problem was with file descriptors. Our > shell only had 1024 set, and we also have mysql running and using up a > bunch of those. We just upped to limit to 8000 to see it that would > give postgres more room to breathe. Ah-hah. You might also want to set max_files_per_process (in postgresql.conf) to something small enough to ensure Postgres can't run you out of descriptors. Linux has a bad habit of promising more than it can deliver when Postgres asks how many FDs are okay to use. The max_files_per_process setting is useful to prevent Postgres from believing whatever fairy-tale sysconf(3) tells it. regards, tom lane
I wrote: > Hmm ... I wonder if Noah's machine could have been running out of kernel > file table slots, or something like that? It does seem that it'd be > more robust to use something like stat(2) to probe for an existing file. I've applied a patch to do it that way in CVS HEAD. After examining the code further I'm inclined not to risk back-patching it into 7.3, though. xlog.c is full of open() calls that will elog(PANIC) if they fail, so I think there was only a very small window of opportunity for Noah to see this failure and not another one. The patch thus probably contributes little real gain in reliability. regards, tom lane
On Fri, Jan 24, 2003 at 07:08:29PM -0500, Noah Silverman wrote: > Yes, > but I'm the only one logged into this box, and I didn't kill anything. > It appears to have died all by itself. Is this on Linux, and were you short on memory? Linux, in a completely brain-dead design, runs around 'kill -9'-ing random processes when it starts to think the machine is going to exhaust its memory (or at least it used to. I dunno if it still does). A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110