Thread: query very slow when enable_seqscan=on
I have a very slow query when enable_seqscan=on and very fast when enable_seqscan=off. My schema looks like this (relevant columns only): create table organizations ( organization_id serial primary key, organization varchar(200) not null, organization_location varchar(55) not null -- and several irrelevant columns ); -- about 9000 records create table persons ( person_id serial primary key, surname varchar(50) not null, forename varchar(35) not null, organization_id int references organizations, -- and several irrelevant columns ); -- about 6500 records create index persons_surname_forename_person_id on persons ( surname, forename, lpad(person_id,10,'0') ); -- I was hoping this would speed up array comparisions The query looking for a position of a person of given person_id in a list sorted by surname, forename and person_id and filtered by some criteria. In this example person_id=1, forename~*'to' (about 400 people) and organization_location~*'warszawa' (about 2000 organizations): select count(*) as position from (select person_id, surname, forename from persons natural left join organizations where forename~*'to' and organization_location~*'warszawa' ) as person_filter where array[surname, forename, lpad(person_id,10,'0')] < (select array[surname, forename, lpad(person_id,10,'0')] from persons where person_id=1); This query take about 30 seconds when enable_seqscan=on and 65 milliseconds when off. When enable_seqscan=on: Aggregate (cost=785.72..785.73 rows=1 width=0) (actual time=27948.955..27948.956 rows=1 loops=1) InitPlan -> Index Scan using persons_pkey on persons (cost=0.00..3.11 rows=1 width=26) (actual time=0.019..0.019 rows=0 loops=1) Index Cond: (person_id = 1) -> Nested Loop (cost=0.00..782.60 rows=1 width=0) (actual time=27948.939..27948.939 rows=0 loops=1) Join Filter: ("inner".organization_id = "outer".organization_id) -> Seq Scan on organization (cost=0.00..480.95 rows=1 width=4) (actual time=0.071..69.702 rows=1892 loops=1) Filter: ((organization_location)::text ~* 'warszawa'::text) -> Seq Scan on persons (cost=0.00..296.10 rows=444 width=4) (actual time=14.720..14.720 rows=0 loops=1892) Filter: (((forename)::text ~* 'to'::text) AND (ARRAY[surname, forename, (lpad((person_id)::text, 10, '0'::text))::charactervarying] < $0)) Total runtime: 27949.106 ms When enable_seqscan=off: Aggregate (cost=100001710.26..100001710.27 rows=1 width=0) (actual time=66.788..66.789 rows=1 loops=1) InitPlan -> Index Scan using persons_pkey on persons (cost=0.00..3.11 rows=1 width=26) (actual time=0.019..0.019 rows=0 loops=1) Index Cond: (person_id = 1) -> Hash Join (cost=100001408.81..100001707.14 rows=1 width=0) (actual time=66.756..66.756 rows=0 loops=1) Hash Cond: ("outer".organization_id = "inner".organization_id) -> Seq Scan on persons (cost=100000000.00..100000296.10 rows=444 width=4) (actual time=14.972..14.972 rows=0 loops=1) Filter: (((forename)::text ~* 'to'::text) AND (ARRAY[surname, forename, (lpad((person_id)::text, 10, '0'::text))::charactervarying] < $0)) -> Hash (cost=1408.81..1408.81 rows=1 width=4) (actual time=51.763..51.763 rows=1892 loops=1) -> Index Scan using organizations_pkey on organizations (cost=0.00..1408.81 rows=1 width=4) (actual time=0.049..48.233rows=1892 loops=1) Filter: ((organization_location)::text ~* 'warszawa'::text) Total runtime: 66.933 ms Database is properly analyzed. postgresql-8.1.4 on Fedora Core 4. Regards Tometzky PS. Actual table and column names are different (they're in Polish) but I've translated them for better readability for english-speaking. PS. I wonder if it makes sense to "enable_seqscan=off" for every client if a database is small enough to fit in OS cache. -- ...although Eating Honey was a very good thing to do, there was a moment just before you began to eat it which was better than when you were... Winnie the Pooh
Tomasz Ostrowski <tometzky@batory.org.pl> writes: > I have a very slow query when enable_seqscan=on and very fast when > enable_seqscan=off. Here's your problem: > -> Seq Scan on organization (cost=0.00..480.95 rows=1 width=4) (actual time=0.071..69.702 rows=1892 loops=1) > Filter: ((organization_location)::text ~* 'warszawa'::text) If it were estimating something like the actual number of rows matching that filter, it'd never have chosen a nestloop plan like that. How many rows are there in the organization table? This is probably the fault of the pattern-selectivity heuristic: it's far too optimistic about long match strings eliminating a lot of rows. I think there's been some discussion of modifying that logic but no one's really stepped up with a better idea. regards, tom lane
On Mon, 2006-07-03 at 22:31 +0200, Tomasz Ostrowski wrote: > I have a very slow query when enable_seqscan=on and very fast when > enable_seqscan=off. My schema looks like this (relevant columns > only): > PS. Actual table and column names are different (they're in Polish) > but I've translated them for better readability for english-speaking. Thanks > PS. I wonder if it makes sense to "enable_seqscan=off" for every client > if a database is small enough to fit in OS cache. You can set this for individual statements if you choose to. > -> Seq Scan on organization (cost=0.00..480.95 rows=1 > width=4) (actual time=0.071..69.702 rows=1892 loops=1) > Filter: ((organization_location)::text ~* > 'warszawa'::text) The issue is caused by the under-estimation of the number of rows in the table as a result of the regular expression comparison. As a result the planner thinks it can choose a nested loops scan, though ends up doing 1892 seq scans of persons, when it thought it would do only one. The under estimation is a known issue. Posting to -perform for the record. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, 03 Jul 2006, Tom Lane wrote: > > -> Seq Scan on organization (cost=0.00..480.95 rows=1 width=4) (actual time=0.071..69.702 rows=1892 loops=1) > > Filter: ((organization_location)::text ~* 'warszawa'::text) > > How many rows are there in the organization table? About 9000. And about 6500 persons. "Warszawa" is a biggest city in Poland and a capital - many organizations are located there. > This is probably the fault of the pattern-selectivity heuristic: > it's far too optimistic about long match strings eliminating a lot > of rows. I think there's been some discussion of modifying that > logic but no one's really stepped up with a better idea. I think because there is no good solution to this - no statistical information is going to predict how much data will match a regular expression. Maybe in this situation an algorithm should be pessimistic - that it will return all rows, or all non-null rows or all rows no shorter than matching string (if it's a string and not for example regex like [abcdefghijklmnopqrstuvwxyz] which is long but will match basicaly everything). In my opinion it is better to overestimate most of the time than to risk underestimation by a factor of 1000 and more. For now I'm turning off seqscans. This is a second time I got terrible permormance with seqscans turned on because of bad estimation. And my database will probably fit in cache. Regards Tometzky -- ...although Eating Honey was a very good thing to do, there was a moment just before you began to eat it which was better than when you were... Winnie the Pooh
Tomasz Ostrowski <tometzky@batory.org.pl> writes: > I think because there is no good solution to this - no statistical > information is going to predict how much data will match a regular > expression. Well, it's certainly hard to imagine simple stats that would let the code guess that, say, "warsa" and "warsaw" match nearly the same (large) number of rows while "warsawq" matches nothing. I think the real problem here is that regex matching is the wrong tool for the job. Have you looked into a full-text index (tsearch2)? With something like that, the index operator has at least got the correct conceptual model, ie, looking for indexed words. I'm not sure if they have any decent statistical support for it :-( but in theory that seems doable, whereas regex estimation will always be a crapshoot. regards, tom lane
On Tue, 04 Jul 2006, Tom Lane wrote: > I think the real problem here is that regex matching is the wrong > tool for the job. Have you looked into a full-text index > (tsearch2)? So much to do with so little time... I've briefly looked into it but: - it's complicated; - it is not needed - basic scan is good enough for the amount of data we have (if a sane query plan is chosen by a database); - we have data in many languages (including based on cyryllic alphabet) - languages which use different forms of the same word based on context, for example: Warszawa Warszawy Warszawie Warszawê Warszaw± Warszawo All of the above could be translated to "Warsaw". So we need to support matching parts of words ("warszaw"), which I haven't seen in tsearch2 (maybe I've overlooked). We also have words, which different forms look like this: "stó³" "stole" "sto³u" (Polish for "table") - when we need to find it we'd need to list every possible form (about 10) or use a regex like: 'st[oó][l³]'. > With something like that, the index operator has at least got the > correct conceptual model, ie, looking for indexed words. I'm not sure > if they have any decent statistical support for it :-( but in theory > that seems doable, whereas regex estimation will always be a crapshoot. So why estimate regex expressions if there is no estimation possible? Let's set this estimate to be pessimistic (match everything or everything not null) and it will choose better plans. At least until somebody will figure out better approach. Pozdrawiam Tometzky -- Best of prhn - najzabawniejsze teksty polskiego UseNet-u http://prhn.dnsalias.org/ Chaos zawsze pokonuje porz±dek, gdy¿ jest lepiej zorganizowany. [ Terry Pratchett ]
On Tue, Jul 04, 2006 at 04:44:08PM +0200, Tomasz Ostrowski wrote: > On Tue, 04 Jul 2006, Tom Lane wrote: >=20 > > I think the real problem here is that regex matching is the wrong > > tool for the job. Have you looked into a full-text index > > (tsearch2)? >=20 > So much to do with so little time... For what it's worth, I've got pretty good results (at least taking the little amount of work I put into it) with trigram indexes, courtesy of $PGCONTRIB/pg_trgm.sql, just another amazing little piece by Bartunov and Sigaev. Let me know if you'd like to hear more. Regards -- tom=E1s
Tomasz Ostrowski <tometzky@batory.org.pl> writes: > So why estimate regex expressions if there is no estimation possible? > Let's set this estimate to be pessimistic (match everything or > everything not null) and it will choose better plans. Better plans for this specific example, worse plans for other cases. Life is not that simple. regards, tom lane
On Tue, 04 Jul 2006, Tom Lane wrote: > Tomasz Ostrowski <tometzky@batory.org.pl> writes: > > So why estimate regex expressions if there is no estimation possible? > > Let's set this estimate to be pessimistic (match everything or > > everything not null) and it will choose better plans. > > Better plans for this specific example, worse plans for other cases. > Life is not that simple. It isn't. This worse plans will be choosen only when pattern/regex matching is used and will be, say, 2 times worse. What I'm trying to point out is that some people use regular expressions for filtering rows. When the program is written it is often impossible to know what data will be put into it. And when a program is unexpectedly 2000 times slower than normal it is much worse than if it is 2 times slower, but predictable. I know Postgres uses probabilistic approach so there's always a probability that the planner chooses very wrong. But this probability is so small that it can be ignored. With pattern/regex matching it is not. Regards Tometzky -- ...although Eating Honey was a very good thing to do, there was a moment just before you began to eat it which was better than when you were... Winnie the Pooh