Thread: query plan with index having a btrim is different for strings of different length
query plan with index having a btrim is different for strings of different length
From
Richard Yen
Date:
Hi, I've discovered a peculiarity with using btrim in an index and was wondering if anyone has any input. My table is like this: Table "public.m_object_paper" Column | Type | Modifiers ---------------------+------------------------+------------------------ id | integer | not null title | character varying(200) | not null x_firstname | character varying(50) | x_lastname | character varying(50) | <...snip...> page_count | smallint | compare_to_database | bit varying | not null Indexes: "m_object_paper_pkey" PRIMARY KEY, btree (id) "last_name_fnc_idx" btree (lower(btrim(x_lastname::text))) "m_object_paper_assignment_idx" btree (assignment) "m_object_paper_owner_idx" btree (owner) CLUSTER <...snip to end...> My query is like this: SELECT m_object_paper.id FROM m_object_paper, m_assignment WHERE m_object_paper.assignment = m_assignment.id AND m_object_paper.owner=-1 AND m_assignment.class = 2450798 AND lower(btrim(x_firstname)) = lower(btrim($FIRSTNAME)) and lower(btrim(x_lastname)) = lower(btrim($LASTNAME)); Strangely, if $LASTNAME is 5 chars, the query plan looks like this: tii=# explain SELECT m_object_paper.id FROM m_object_paper, m_assignment WHERE m_object_paper.assignment = m_assignment.id AND m_object_paper.owner=-1 AND m_assignment.class = 2450798 AND lower(btrim(x_firstname)) = lower(btrim('Jordan')) and lower(btrim(x_lastname)) = lower(btrim('Smith')); QUERY PLAN --------------------------------------------------------------------------------------------------------------- Hash Join (cost=181704.85..291551.77 rows=1 width=4) Hash Cond: (m_object_paper.assignment = m_assignment.id) -> Bitmap Heap Scan on m_object_paper (cost=181524.86..291369.66 rows=562 width=8) Recheck Cond: ((lower(btrim((x_lastname)::text)) = 'smith'::text) AND (owner = (-1))) Filter: (lower(btrim((x_firstname)::text)) = 'jordan'::text) -> BitmapAnd (cost=181524.86..181524.86 rows=112429 width=0) -> Bitmap Index Scan on last_name_fnc_idx (cost=0.00..5468.29 rows=496740 width=0) Index Cond: (lower(btrim((x_lastname)::text)) = 'smith'::text) -> Bitmap Index Scan on m_object_paper_owner_idx (cost=0.00..176056.04 rows=16061244 width=0) Index Cond: (owner = (-1)) -> Hash (cost=177.82..177.82 rows=174 width=4) -> Index Scan using m_assignment_class_idx on m_assignment (cost=0.00..177.82 rows=174 width=4) Index Cond: (class = 2450798) (13 rows) However, if $LASTNAME is != 5 chars (1 char, 100 chars, doesn't make a difference), the query plan looks like this: tii=# explain SELECT m_object_paper.id FROM m_object_paper, m_assignment WHERE m_object_paper.assignment = m_assignment.id AND m_object_paper.owner=-1 AND m_assignment.class = 2450798 AND lower(btrim(x_firstname)) = lower(btrim('Jordan')) and lower(btrim(x_lastname)) = lower(btrim('Smithers')); QUERY PLAN --------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..10141.06 rows=1 width=4) -> Index Scan using last_name_fnc_idx on m_object_paper (cost=0.00..10114.24 rows=11 width=8) Index Cond: (lower(btrim((x_lastname)::text)) = 'smithers'::text) Filter: ((owner = (-1)) AND (lower(btrim((x_firstname)::text)) = 'jordan'::text)) -> Index Scan using m_assignment_pkey on m_assignment (cost=0.00..2.43 rows=1 width=4) Index Cond: (m_assignment.id = m_object_paper.assignment) Filter: (m_assignment.class = 2450798) (7 rows) In practice, the difference is 300+ seconds when $LASTNAME == 5 chars and <1 second when $LASTNAME != 5 chars. Would anyone know what's going on here? Is there something about the way btrim works, or perhaps with the way indexes are created? It's strange that the query plan would change for just one case ("Jones," "Smith," "Brown," etc., all cause the query plan to use that extra heap scan). Thanks for any help! --Richard
Re: query plan with index having a btrim is different for strings of different length
From
"David Wilson"
Date:
On Tue, Dec 9, 2008 at 2:56 PM, Richard Yen <dba@richyen.com> wrote: > In practice, the difference is 300+ seconds when $LASTNAME == 5 chars and <1 > second when $LASTNAME != 5 chars. > > Would anyone know what's going on here? Is there something about the way > btrim works, or perhaps with the way indexes are created? It's strange that > the query plan would change for just one case ("Jones," "Smith," "Brown," > etc., all cause the query plan to use that extra heap scan). Those are likely common names, and may be showing up in the table stats as common values, causing the planner to change things around. Does this hold even for non-existent 5-character lastname strings? Speaking of table statistics, might be worth upping the statistics target on that table/column, analyzing, and seeing if you get different results. -- - David T. Wilson david.t.wilson@gmail.com
Re: query plan with index having a btrim is different for strings of different length
From
Tom Lane
Date:
Richard Yen <dba@richyen.com> writes: > I've discovered a peculiarity with using btrim in an index and was > wondering if anyone has any input. What PG version is this? In particular, I'm wondering if it's one of the early 8.2.x releases, which had some bugs in and around choose_bitmap_and() that caused them to sometimes make weird choices of indexes in a BitmapAnd plan structure ... Like David, I'm pretty dubious that the behavior has anything to do with strings being 5 characters long exactly; but it could very much depend on whether the string you choose is a common last name or not. That would change the estimated number of matching rows and hence affect the apparent relative attractiveness of different indexes. regards, tom lane
Re: query plan with index having a btrim is different for strings of different length
From
Richard Yen
Date:
On Dec 9, 2008, at 3:27 PM, Tom Lane wrote: > Richard Yen <dba@richyen.com> writes: >> I've discovered a peculiarity with using btrim in an index and was >> wondering if anyone has any input. > > What PG version is this? This is running on 8.3.3 > In particular, I'm wondering if it's one of the early 8.2.x releases, > which had some bugs in and around choose_bitmap_and() that caused > them to sometimes make weird choices of indexes in a BitmapAnd plan > structure ... > > Like David, I'm pretty dubious that the behavior has anything to do > with > strings being 5 characters long exactly; but it could very much depend > on whether the string you choose is a common last name or not. That > would change the estimated number of matching rows and hence affect > the > apparent relative attractiveness of different indexes. You guys are right. I tried "Miller" and gave me the same result. Is there any way to tune this so that for the common last names, the query run time doesn't jump from <1s to >300s? Thanks for the help! --Richard
Re: query plan with index having a btrim is different for strings of different length
From
Tom Lane
Date:
Richard Yen <dba@richyen.com> writes: > You guys are right. I tried "Miller" and gave me the same result. Is > there any way to tune this so that for the common last names, the > query run time doesn't jump from <1s to >300s? If the planner's estimation is that far off then there must be something very weird about the table statistics, but you haven't given us any clue what. regards, tom lane
Re: query plan with index having a btrim is different for strings of different length
From
"Robert Haas"
Date:
> You guys are right. I tried "Miller" and gave me the same result. Is there > any way to tune this so that for the common last names, the query run time > doesn't jump from <1s to >300s? > Thanks for the help! Can you send the output of EXPLAIN ANALYZE for both cases? ...Robert
Re: query plan with index having a btrim is different for strings of different length
From
Tom Lane
Date:
BTW, if your queries typically constrain both lastname and firstname, it'd likely be worthwhile to make a 2-column index on lower(btrim(x_lastname)), lower(btrim(x_firstname)) regards, tom lane
Re: query plan with index having a btrim is different for strings of different length
From
Richard Yen
Date:
On Dec 10, 2008, at 11:34 AM, Tom Lane wrote: > Richard Yen <dba@richyen.com> writes: >> You guys are right. I tried "Miller" and gave me the same result. >> Is >> there any way to tune this so that for the common last names, the >> query run time doesn't jump from <1s to >300s? > > If the planner's estimation is that far off then there must be > something > very weird about the table statistics, but you haven't given us any > clue > what. Wow, thanks for helping me out here. I don't have much experience with deconstructing queries and working with stats, so here's what I could gather. If you need more information, please let me know. tii=# select * from pg_stat_all_tables where relname = 'm_object_paper' or relname = 'm_assignment'; -[ RECORD 1 ]----+------------------------------ relid | 17516 schemaname | public relname | m_assignment seq_scan | 274 seq_tup_read | 1039457272 idx_scan | 372379230 idx_tup_fetch | 2365235708 n_tup_ins | 5641638 n_tup_upd | 520684 n_tup_del | 30339 n_tup_hot_upd | 406929 n_live_tup | 5611665 n_dead_tup | 11877 last_vacuum | last_autovacuum | 2008-12-04 17:44:57.309717-08 last_analyze | 2008-10-20 15:09:50.943652-07 last_autoanalyze | 2008-08-15 17:16:14.588153-07 -[ RECORD 2 ]----+------------------------------ relid | 17792 schemaname | public relname | m_object_paper seq_scan | 83613 seq_tup_read | 184330159906 idx_scan | 685219945 idx_tup_fetch | 222892138627 n_tup_ins | 71564825 n_tup_upd | 27558792 n_tup_del | 3058 n_tup_hot_upd | 22410985 n_live_tup | 71559627 n_dead_tup | 585467 last_vacuum | 2008-10-24 14:36:45.134936-07 last_autovacuum | 2008-12-05 07:02:40.52712-08 last_analyze | 2008-11-25 14:42:04.185798-08 last_autoanalyze | 2008-08-15 17:20:28.42811-07 tii=# select * from pg_statio_all_tables where relname = 'm_object_paper' or relname = 'm_assignment'; -[ RECORD 1 ]---+--------------- relid | 17516 schemaname | public relname | m_assignment heap_blks_read | 22896372 heap_blks_hit | 1753777105 idx_blks_read | 7879634 idx_blks_hit | 1157729592 toast_blks_read | 0 toast_blks_hit | 0 tidx_blks_read | 0 tidx_blks_hit | 0 -[ RECORD 2 ]---+--------------- relid | 17792 schemaname | public relname | m_object_paper heap_blks_read | 2604944369 heap_blks_hit | 116307527781 idx_blks_read | 133534908 idx_blks_hit | 3601637440 toast_blks_read | 0 toast_blks_hit | 0 tidx_blks_read | 0 tidx_blks_hit | 0 Also, yes, we've kicked around the idea of doing an index on the concatenation of the first and last names--that would definitely be more unique, and I think we're actually going to move to that. Just thought I'd dig deeper here to learn more. Thanks! --Richard
Re: query plan with index having a btrim is different for strings of different length
From
Richard Yen
Date:
On Dec 10, 2008, at 11:39 AM, Robert Haas wrote: >> You guys are right. I tried "Miller" and gave me the same result. >> Is there >> any way to tune this so that for the common last names, the query >> run time >> doesn't jump from <1s to >300s? >> Thanks for the help! > > Can you send the output of EXPLAIN ANALYZE for both cases? tii=# explain analyze SELECT m_object_paper.id FROM m_object_paper, m_assignment WHERE m_object_paper.assignment = m_assignment.id AND m_object_paper.owner=-1 AND m_assignment.class = 2450798 AND lower(btrim(x_firstname)) = lower(btrim('Jordan')) and lower(btrim(x_lastname)) = lower(btrim('Smithers')); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..10141.07 rows=1 width=4) (actual time=33.004..33.004 rows=0 loops=1) -> Index Scan using last_name_fnc_idx on m_object_paper (cost=0.00..10114.25 rows=11 width=8) (actual time=33.003..33.003 rows=0 loops=1) Index Cond: (lower(btrim((x_lastname)::text)) = 'smithers'::text) Filter: ((owner = (-1)) AND (lower(btrim((x_firstname)::text)) = 'jordan'::text)) -> Index Scan using m_assignment_pkey on m_assignment (cost=0.00..2.43 rows=1 width=4) (never executed) Index Cond: (m_assignment.id = m_object_paper.assignment) Filter: (m_assignment.class = 2450798) Total runtime: 33.070 ms (8 rows) tii=# explain analyze SELECT m_object_paper.id FROM m_object_paper, m_assignment WHERE m_object_paper.assignment = m_assignment.id AND m_object_paper.owner=-1 AND m_assignment.class = 2450798 AND lower(btrim(x_firstname)) = lower(btrim('Jordan')) and lower(btrim(x_lastname)) = lower(btrim('Smith')); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=181867.87..291714.78 rows=1 width=4) (actual time=124746.426..139252.850 rows=1 loops=1) Hash Cond: (m_object_paper.assignment = m_assignment.id) -> Bitmap Heap Scan on m_object_paper (cost=181687.88..291532.67 rows=562 width=8) (actual time=124672.328..139248.919 rows=58 loops=1) Recheck Cond: ((lower(btrim((x_lastname)::text)) = 'smith'::text) AND (owner = (-1))) Filter: (lower(btrim((x_firstname)::text)) = 'jordan'::text) -> BitmapAnd (cost=181687.88..181687.88 rows=112429 width=0) (actual time=124245.890..124245.890 rows=0 loops=1) -> Bitmap Index Scan on last_name_fnc_idx (cost=0.00..5476.30 rows=496740 width=0) (actual time=16194.803..16194.803 rows=521382 loops=1) Index Cond: (lower(btrim((x_lastname)::text)) = 'smith'::text) -> Bitmap Index Scan on m_object_paper_owner_idx (cost=0.00..176211.04 rows=16061244 width=0) (actual time=107928.054..107928.054 rows=15494737 loops=1) Index Cond: (owner = (-1)) -> Hash (cost=177.82..177.82 rows=174 width=4) (actual time=3.764..3.764 rows=5 loops=1) -> Index Scan using m_assignment_class_idx on m_assignment (cost=0.00..177.82 rows=174 width=4) (actual time=0.039..3.756 rows=5 loops=1) Index Cond: (class = 2450798) Total runtime: 139255.109 ms (14 rows) This example doesn't have a > 300s run time, but there are a few in my log that are. In either case, I guess you get the picture. Thanks for the help! --Richard
Re: query plan with index having a btrim is different for strings of different length
From
"Vladimir Sitnikov"
Date:
tii=# explain analyze SELECT m_object_paper.id FROM m_object_paper, m_assignment WHERE m_object_paper.assignment = m_assignment.id AND m_object_paper.owner=-1 AND m_assignment.class = 2450798 AND lower(btrim(x_firstname)) = lower(btrim('Jordan')) and lower(btrim(x_lastname)) = lower(btrim('Smith'));
Is there an index on "m_object_paper.assignment"? It could solve the problem.
With current indices on "btrim(last_name)" and "owner" you are just throwing the rows away (you have 521382 rows with "smith", 15494737 with owner=-1 and only 58 of them have both "smith"/"jordan" and -1).
Consider creating index on m_object_paper using btree(lower(btrim(x_lastname))) where owner=-1; (it might add firstname column there as per Tom's suggestion)
Or just index on (owner, lower(...)) if you have other queries with different values for owner.
One more point that could improve bitmap scans is greater value for work_mem. You'll need 8*15494737 ~ 130Mb == 130000 for work_mem (however, that is way too high unless you have lots of RAM and just couple of active database sessions)
Regards,
Vladimir Sitnikov
Re: query plan with index having a btrim is different for strings of different length
From
Tom Lane
Date:
Richard Yen <dba@richyen.com> writes: > Is there any way to tune this so that for the common last names, the query > run time doesn't jump from <1s to >300s? Well, as near as I can tell there's factor of a couple hundred difference between the frequencies of 'smith' and 'smithers', so you shouldn't really expect similar runtimes for the two cases. Having said that, I still think you should try to index both first and last name. Also I wonder whether the index on owner is worth having at all. It definitely doesn't seem worthwhile to index the entries with owner = -1, since there are so many; so maybe you could make it a partial index that excludes those entries, in order to prevent the planner from trying to use it for this case. regards, tom lane
Re: query plan with index having a btrim is different for strings of different length
From
Richard Yen
Date:
On Dec 10, 2008, at 4:08 PM, Tom Lane wrote: > Richard Yen <dba@richyen.com> writes: >> Is there any way to tune this so that for the common last names, >> the query >> run time doesn't jump from <1s to >300s? > > Well, as near as I can tell there's factor of a couple hundred > difference between the frequencies of 'smith' and 'smithers', so > you shouldn't really expect similar runtimes for the two cases. > > Having said that, I still think you should try to index both first > and last name. Also I wonder whether the index on owner is worth > having at all. It definitely doesn't seem worthwhile to index the > entries with owner = -1, since there are so many; so maybe you could > make it a partial index that excludes those entries, in order to > prevent > the planner from trying to use it for this case. Created the 2-col index, and the query runs much faster on all permutations. Thanks much for all your help, --Richard