Thread: Bug #624: case sensitivity in pg_dumpall
Volker Klemann (volkerk@gfz-potsdam.de) reports a bug with a severity of 2 The lower the number the more severe it is. Short Description case sensitivity in pg_dumpall Long Description Dear people from postgres, Don't know if I am the first where this bug appeared: While updating from 7.1.2 to 7.2 I used pg_dumpall to dump my data as recommended by the INSTALL manual. When re-instaling the database using psql -d template1 -f dumpfile I got the error psql:/home/volkerk/postgres/backups/backup22.03.02:46: \connect: FATAL 1: Database "rsl" does not exist in the system catalog. The corresponding lines in the script produced by pg_dumpall are: 45 CREATE DATABASE "RSL" WITH TEMPLATE = template0; 46 \connect RSL volkerk So, while considering upper and lower cases in 45: using collons the program missed them in 46:, and tried to log into rsl instead of RSL. Took me some time to find it, because first 'was searching for inconsistencies, in the new installation. Best wishes, Volker Klemann Sample Code No file was uploaded with this report
pgsql-bugs@postgresql.org writes: > case sensitivity in pg_dumpall This is fixed in 7.2.1. regards, tom lane
In porting a pretty large (10s of millions of records) data warehouse from Oracle to PostgreSQL, Once some of my tables started getting pretty large, PostgreSQL suddenly stopped using indexes when I use expressions like "col = value" decreasing performance by 20X. This meant that my daily reports started taking two days instead of 2 hours to run!!!!). Interestingly when I re-write the queries using >= and <= to produce identical results, the index works fine. Example queries in question include: select count(*) from fact where dat='2002-03-01'; select count(*) from fact where dat<='2002-03-01' and dat>='2002-03-01'; The distribution of values in "dat" are roughly evenly spaced from '2002-01-01' through '2002-03-25'. Attached below are A: Information about the table, including "\d" and "vacuum verbose analyze" B: Output of "explain analyze" from the above queries (showing the 20X slowdown) C: Version and configuration information. Any suggestions on what I should look at next would be appreciated. Thanks much, Ron PS: As a quite perverse workaround, I rewrote all my queries to have "col<=val and col>=val" everywhere I used to have "col=val" and everything is running fine again... but that's just wierd. ============================================================================ == A: Information about the table ============================================================================ logs2=# \d fact Table "fact" Column | Type | Modifiers --------+------------------------+----------- dat | date | tim | time without time zone | ip_id | integer | bid_id | integer | req_id | integer | ref_id | integer | Indexes: i_fact__bid_id, i_fact__dat, i_fact__ref_id, i_fact__req_id, i_fact__tim logs2=# select count(*) from fact; count ---------- 18410778 (1 row) logs2=# vacuum verbose analyze fact; NOTICE: --Relation fact-- NOTICE: Pages 144967: Changed 0, Empty 0; Tup 18410778: Vac 0, Keep 0, UnUsed 0. Total CPU 11.56s/2.97u sec elapsed 71.91 sec. NOTICE: Analyzing fact VACUUM ============================================================================ == B: Explain Analyze for the two queries. == Note that the <=, >= one was over 20X faster. ============================================================================ logs2=# explain analyze select count(*) from fact where dat='2002-03-01'; NOTICE: QUERY PLAN: Aggregate (cost=375631.14..375631.14 rows=1 width=0) (actual time=76689.42..76689.42 rows=1 loops=1) -> Seq Scan on fact (cost=0.00..375101.72 rows=211765 width=0) (actual time=20330.96..76391.94 rows=180\ 295 loops=1) Total runtime: 76707.92 msec EXPLAIN logs2=# explain analyze select count(*) from fact where dat<='2002-03-01' and dat >='2002-03-01'; NOTICE: QUERY PLAN: Aggregate (cost=5.98..5.98 rows=1 width=0) (actual time=2921.39..2921.39 rows=1 loops=1) -> Index Scan using i_fact__dat on fact (cost=0.00..5.98 rows=1 width=0) (actual time=73.55..2583.53 ro\ ws=180295 loops=1) Total runtime: 2921.55 msec EXPLAIN logs2=# ============================================================================ == C: Version and configuration information. ============================================================================ [17]localhost:~/apps/pgsql% psql -V psql (PostgreSQL) 7.2 contains support for: readline, history Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group Portions Copyright (c) 1996, Regents of the University of California Read the file COPYRIGHT or use the command \copyright to see the usage and distribution terms. [17]localhost:/scratch/pgsql/data% diff -wiu postgresql.conf postgresql.conf.bak --- postgresql.conf Sat Mar 23 15:39:34 2002 +++ postgresql.conf.bak Tue Mar 5 19:33:54 2002 @@ -50,7 +50,7 @@ #shared_buffers = 10000 # 2*max_connections, min 16 ## goes to about 84 meg with 4000. #shared_buffers = 4000 # 2*max_connections, min 16 -shared_buffers = 10000 # 2*max_connections, min 16 +shared_buffers = 8000 # 2*max_connections, min 16 #max_fsm_relations = 100 # min 10, fsm is free space map #max_fsm_pages = 10000 # min 1000, fsm is free space map -- Ronald Mayer Director of Web Business InterVideo, Inc.
Ron Mayer <ron@intervideo.com> writes: > Once some of my tables started getting pretty large, PostgreSQL > suddenly stopped using indexes when I use expressions like "col = value" > decreasing performance by 20X. Hmm. The EXPLAIN shows that the planner is not doing too badly at estimating the number of rows involved: > logs2=# explain analyze select count(*) from fact where dat='2002-03-01'; > NOTICE: QUERY PLAN: > Aggregate (cost=375631.14..375631.14 rows=1 width=0) (actual > time=76689.42..76689.42 rows=1 loops=1) > -> Seq Scan on fact (cost=0.00..375101.72 rows=211765 width=0) (actual > time=20330.96..76391.94 rows=180295 loops=1) > Total runtime: 76707.92 msec 212K estimate for 180K real is not bad at all. So the problem is in the cost models not the initial row count estimation. If you force an indexscan via "set enable_seqscan to off", what does EXPLAIN ANALYZE report? Also, what do you get from select * from pg_stats where tablename = 'fact'; I'm particularly interested in the correlation estimate for the dat column. (Would you happen to have an idea whether the data has been inserted more-or-less in dat order?) regards, tom lane
On Tue, 26 Mar 2002, Tom Lane wrote: > > Ron Mayer <ron@intervideo.com> writes: > > [...] pretty large, PostgreSQL suddenly stopped using indexes [...] > [...] > > 212K estimate for 180K real is not bad at all. So the problem is in the > cost models not the initial row count estimation. > > If you force an indexscan via "set enable_seqscan to off", what does > EXPLAIN ANALYZE report? It then uses the index: =================================================================== == logs2=# set enable_seqscan to off; == SET VARIABLE == logs2=# explain analyze select count(*) from fact where dat='2002-03-01'; == == NOTICE: QUERY PLAN: == == Aggregate (cost=840488.03..840488.03 rows=1 width=0) (actual == time=2753.82..2753.82 rows=1 loops=1) == -> Index Scan using i_fact__dat on fact (cost=0.00..839957.59 rows=212174 == width=0) (actual time=101.25..2434.00 rows=180295 loops=1) == Total runtime: 2754.24 msec =================================================================== > Also, what do you get from > select * from pg_stats where tablename = 'fact'; > I'm particularly interested in the correlation estimate for the dat > column. (Would you happen to have an idea whether the data has been > inserted more-or-less in dat order?) I've attached that output as an attachment. I beleve much of February was loaded first, then we back-filled January, and daily I've been adding March's results. I don't believe the index-usage stopped when we did the january fill... something happend a few days ago after a pretty routine daily load. Oh... one more interesting thing... There are a couple big exceptions to the even distribution of data. Almost every day has between 190000 and 270000 records except '2002-03-08' which has 404293 records and '2002-03-25' which has 6 records. For that particular day, the "<= ... >=" trick doesn't work either. =================================================================== ==logs2=# explain select count(*) from fact where dat<='2002-03-08' and ==dat>='2002-03-08'; ==NOTICE: QUERY PLAN: == ==Aggregate (cost=422125.92..422125.92 rows=1 width=0) == -> Seq Scan on fact (cost=0.00..421128.67 rows=398900 width=0) == ==EXPLAIN ==logs2=# ==logs2=# explain select count(*) from fact where dat<='2002-03-07' and ==dat>='2002-03-07'; ==NOTICE: QUERY PLAN: == ==Aggregate (cost=6.00..6.00 rows=1 width=0) == -> Index Scan using i_fact__dat on fact (cost=0.00..5.99 rows=1 width=0) == ==EXPLAIN =================================================================== I also believe that may have been the day when the index stopped working for "=" for all dates. Ron
I had an issue where my index was not always used on a very large table. The issue came down to the data distribution and not pulling in enough of a random sample to get an accurate estimate ( I think the default max value was around 3000 sample rows ( 300 * 10 default_samples -- see analyze.c ) rows. I fixed the issue by following Tom's advice and increased the statistics count on my table to pull in 300000 rows (1000 samples *300). I had to play with the value, re-analyze, and check the stats in the pg_stats table until most_common_freqs on some values were all fairily close. The explain plan still shows me a cost and row value way above what is physically in the table, but at least my indexes were being used. alter table table_name alter symbol_name set statistics 1000; --Michael Tom Lane wrote: >Ron Mayer <ron@intervideo.com> writes: > >> Once some of my tables started getting pretty large, PostgreSQL >>suddenly stopped using indexes when I use expressions like "col = value" >>decreasing performance by 20X. >> > >Hmm. The EXPLAIN shows that the planner is not doing too badly at >estimating the number of rows involved: > >>logs2=# explain analyze select count(*) from fact where dat='2002-03-01'; >>NOTICE: QUERY PLAN: >> > >>Aggregate (cost=375631.14..375631.14 rows=1 width=0) (actual >>time=76689.42..76689.42 rows=1 loops=1) >> -> Seq Scan on fact (cost=0.00..375101.72 rows=211765 width=0) (actual >>time=20330.96..76391.94 rows=180295 loops=1) >>Total runtime: 76707.92 msec >> > >212K estimate for 180K real is not bad at all. So the problem is in the >cost models not the initial row count estimation. > >If you force an indexscan via "set enable_seqscan to off", what does >EXPLAIN ANALYZE report? > >Also, what do you get from > select * from pg_stats where tablename = 'fact'; >I'm particularly interested in the correlation estimate for the dat >column. (Would you happen to have an idea whether the data has been >inserted more-or-less in dat order?) > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate >subscribe-nomail command to majordomo@postgresql.org so that your >message can get through to the mailing list cleanly >
Ron Mayer <ron@intervideo.com> writes: >> I'm particularly interested in the correlation estimate for the dat >> column. (Would you happen to have an idea whether the data has been >> inserted more-or-less in dat order?) > I beleve much of February was loaded first, then we back-filled January, > and daily I've been adding March's results. I don't believe the index-usage > stopped when we did the january fill... something happend a few days ago after > a pretty routine daily load. The correlation estimate for dat is pretty low (0.086088), which I think reflects the fact that on a macro level your data is not very orderly (Feb/Jan/Mar). However, if it's been loaded on a daily basis then the records for any single day will be together --- which is why the indexed probe for a single day is so fast. I don't see any way that we can expect the system to model this effect with only one ordering-correlation number :-( ... so a proper fix will have to wait for some future release when we can think about having more extensive stats about ordering. In the meantime, it would be interesting to see if re-establishing the big-picture order correlation would persuade the planner to do the right thing. Could you do something like this: CREATE TABLE foo AS SELECT * FROM fact ORDER BY dat; TRUNCATE TABLE fact; INSERT INTO fact SELECT * FROM foo; DROP TABLE foo; VACUUM ANALYZE fact; (this should leave you in a state where pg_stats shows correlation 1.0 for fact.dat) and then see what you get from EXPLAIN? regards, tom lane
First off, thanks to everyone on the list who suggested useful workarounds to me - and I wanted to start off by saying that with the workarounds my application is working wonderfully again. Anyway, here's some more information about the "=" vs. "<= and >=" question I had earlier today... On Tue, 26 Mar 2002, Tom Lane wrote: > > Ron Mayer <ron@intervideo.com> writes: > >> I'm particularly interested in the correlation estimate for the dat > >> column. [...] > > > > [...] > > The correlation estimate for dat is pretty low (0.086088), which I think > reflects the fact that on a macro level your data is not very orderly > [...] > > In the meantime, it would be interesting to see if re-establishing the > big-picture order correlation would persuade the planner to do the right > thing. [...] > (this should leave you in a state where pg_stats shows correlation 1.0 > for fact.dat) and then see what you get from EXPLAIN? Correlation is 1.0, but the optimizer still does not want to use the index. I tried two different extreme attempts.... one with the optimal ordering suggested above, and one with an exceptionally poor ordering (sorted by time of the day ... so that every day probably appears in every possible block). As expected, pg_stats shows the good ordering has a correlation of "1.0", and the poor ordering has a correlation of "-0.00133352". ============================================================ = logs2=# CREATE TABLE fact_by_dat AS SELECT * FROM fact ORDER BY dat; = SELECT = logs2=# CREATE TABLE fact_by_tim AS SELECT * FROM fact ORDER BY tim; = SELECT = logs2=# CREATE INDEX fact_by_dat__dat ON fact_by_dat(dat); = CREATE = logs2=# CREATE INDEX fact_by_tim__dat ON fact_by_tim(dat); = CREATE = logs2=# vacuum analyze fact_by_dat; = VACUUM = logs2=# vacuum analyze fact_by_tim; = VACUUM = logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01'; = NOTICE: QUERY PLAN: = Aggregate (cost=380347.31..380347.31 rows=1 width=0) (actual time=77785.14..77785.14 rows=1 loops=1) = -> Seq Scan on fact (cost=0.00..379816.25 rows=212423 width=0) (actual time=20486.16..77420.05 rows=180295 loops=1) = Total runtime: 77785.28 msec = EXPLAIN = logs2=# explain analyze select count(*) from fact_by_tim where dat='2002-03-01'; = NOTICE: QUERY PLAN: = Aggregate (cost=380341.09..380341.09 rows=1 width=0) (actual time=79308.22..79308.22 rows=1 loops=1) = -> Seq Scan on fact_by_tim (cost=0.00..379816.25 rows=209934 width=0) (actual time=24.35..78929.68 rows=180295 loops=1) = Total runtime: 79308.35 msec = EXPLAIN = logs2=# select correlation from pg_stats where tablename='fact_by_dat' and attname='dat'; = correlation = ------------- = 1 = (1 row) = = logs2=# select correlation from pg_stats where tablename='fact_by_tim' and attname='dat'; = correlation = ------------- = -0.00133352 = (1 row) = ============================================================ In neither case did it use the index. However as shown below, in the case where it was ordered by date the index would have helped a huge amount, while in the case where it was ordered by time using the index hurts a huge amount. ============================================================ = logs2=# set enable_seqscan to off; = SET VARIABLE = logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01'; = NOTICE: QUERY PLAN: = = Aggregate (cost=5502.57..5502.57 rows=1 width=0) (actual time=2671.20..2671.20 rows=1 loops=1) = -> Index Scan using fact_by_dat__dat on fact_by_dat (cost=0.00..4974.99 rows=211036 width=0) (actual time=90.24..2339.64rows=180295 loops=1) = Total runtime: 2693.87 msec = = EXPLAIN = logs2=# explain analyze select count(*) from fact_by_tim where = dat='2002-03-01'; = NOTICE: QUERY PLAN: = = Aggregate (cost=837849.27..837849.27 rows=1 width=0) (actual time=410705.02..410705.02 rows=1 loops=1) = -> Index Scan using fact_by_tim__dat on fact_by_tim (cost=0.00..837324.43 rows=209934 width=0) (actual time=56.14..410271.50rows=180295 loops=1) = Total runtime: 410705.17 msec = = EXPLAIN = logs2=# ===================================================================== So with the ideally ordered table the index would have helped by a factor of 30 (2.7 seconds vs. 77 seconds)... but with the bad ordering it hurt by a factor of 5 (411 seconds vs. 79 seconds). Very interesting... Just for my own education, could you bare with me for a few questions from a relative novice... *) Should the optimizer choose a plan that uses the index if the correlation is high enough? *) Instead of the overall correlation across the whole table, would a better metric be the average correlation for data within each page? Then it could recognize that while I had a low overall correlation because I loaded Feb, then Jan, then Mar ... within each block the correlation was good. *) If the optimizer sees one alternative that may take from 3 - 400 seconds, and another that will produce a very consistent 80 seconds, is the first better because it could result in a 30x speedup vs. only 5x slowdown, or is the second better, because it could result in only 77sec faster vs 320 second slower. Or do I simply not understand optimizers at all. :-) *) Any reason why "a=b" and "a<=b and a>=b" would act differently? I'm not complaining, because it seems like a convenient way to give the optimzer-hint that helps in my case :-), but it seems surprising. Anyway, as I said before, thanks for all the workarounds that got me up and running again! I'd be glad to continue to look if people want more information about these queries if anyone thinks any more experimentation would be helpful. Thanks, Ron
On Tue, 26 Mar 2002, Tom Lane wrote: > Ron Mayer <ron@intervideo.com> writes: > >> I'm particularly interested in the correlation estimate for the dat > >> column. (Would you happen to have an idea whether the data has been > >> inserted more-or-less in dat order?) > > > I beleve much of February was loaded first, then we back-filled January, > > and daily I've been adding March's results. I don't believe the index-usage > > stopped when we did the january fill... something happend a few days ago after > > a pretty routine daily load. > > The correlation estimate for dat is pretty low (0.086088), which I think > reflects the fact that on a macro level your data is not very orderly > (Feb/Jan/Mar). However, if it's been loaded on a daily basis then the > records for any single day will be together --- which is why the indexed > probe for a single day is so fast. I don't see any way that we can > expect the system to model this effect with only one ordering-correlation > number :-( ... so a proper fix will have to wait for some future release > when we can think about having more extensive stats about ordering. > > In the meantime, it would be interesting to see if re-establishing the > big-picture order correlation would persuade the planner to do the right > thing. Could you do something like this: > > CREATE TABLE foo AS SELECT * FROM fact ORDER BY dat; > TRUNCATE TABLE fact; > INSERT INTO fact SELECT * FROM foo; > DROP TABLE foo; > VACUUM ANALYZE fact; > > (this should leave you in a state where pg_stats shows correlation 1.0 > for fact.dat) and then see what you get from EXPLAIN? > > regards, tom lane I did quite a bit more playing with this, and no matter what the correlation was (1, -0.001), it never seemed to have any effect at all on the execution plan. Should it? With a high correlation the index scan is a much better choice. Ron --- --- create the table with a correlation of "1". --- logs2=# CREATE TABLE fact_by_dat AS SELECT * FROM fact ORDER BY dat; SELECT logs2=# CREATE INDEX fact_by_dat__dat ON fact_by_dat(dat); CREATE logs2=# vacuum analyze fact_by_dat; VACUUM logs2=# select correlation from pg_stats where tablename='fact_by_dat' and attname='dat'; correlation ------------- 1 (1 row) --- --- Still does the "Seq Scan" --- logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01'; NOTICE: QUERY PLAN: Aggregate (cost=380347.31..380347.31 rows=1 width=0) (actual time=77785.14..77785.14 rows=1 loops=1) -> Seq Scan on fact (cost=0.00..379816.25 rows=212423 width=0) (actual time=20486.16..77420.05 rows=180295 loops=1) Total runtime: 77785.28 msec EXPLAIN --- --- Disable Seq Scan... 30 times faster. --- logs2=# set enable_seqscan to off; SET VARIABLE logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01'; NOTICE: QUERY PLAN: Aggregate (cost=5502.57..5502.57 rows=1 width=0) (actual time=2671.20..2671.20 rows=1 loops=1) -> Index Scan using fact_by_dat__dat on fact_by_dat (cost=0.00..4974.99 rows=211036 width=0) (actual time=90.24..2339.64 rows=180295 loops=1) Total runtime: 2693.87 msec
Ron Mayer <ron@intervideo.com> writes: > I did quite a bit more playing with this, and no matter what the > correlation was (1, -0.001), it never seemed to have any effect > at all on the execution plan. > Should it? With a high correlation the index scan is a much better choice. I'm confused. Your examples show the planner correctly estimating the indexscan as much cheaper than the seqscan. > logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01'; > NOTICE: QUERY PLAN: > Aggregate (cost=380347.31..380347.31 rows=1 width=0) (actual time=77785.14..77785.14 rows=1 loops=1) > -> Seq Scan on fact (cost=0.00..379816.25 rows=212423 width=0) (actual time=20486.16..77420.05 rows=180295 loops=1) > Total runtime: 77785.28 msec Cut-and-paste mistake here somewhere, perhaps? The plan refers to fact not fact_by_dat. regards, tom lane
On Wed, 3 Apr 2002, Tom Lane wrote: > > I'm confused. Your examples show the planner correctly estimating the > indexscan as much cheaper than the seqscan. >... > Cut-and-paste mistake here somewhere, perhaps? The plan refers to fact > not fact_by_dat. My apologies... It was indeed doing the right thing on the table that was ordered by date. Sorry for the wasted bandwidth. Ron