Thread: Understanding "seq scans"
Hi all, I'm doing some experiments to find the better layout for reimplementing an existing db (MySQL cough!) with PostgreSQL 9.4+. I noticed a strange plan coming out from a simple query joining two tables, both containing 10Mrecs (and both ANALYZEd): l10ntest=# \d master; Table "public.master" Column | Type | Modifiers --------+---------+------------------------------------------------------ num | integer | not null default nextval('master_num_seq'::regclass) Indexes: "master_pkey" PRIMARY KEY, btree (num) l10ntest=# \d master_l10n; Table "public.master_l10n" Column | Type | Modifiers --------+----------------------+----------------------------------------------------------- num | integer | not null default nextval('master_l10n_num_seq'::regclass) lang | character varying(2) | not null text | text | Indexes: "master_l10n_pkey" PRIMARY KEY, btree (num, lang) "l10n_text_index" btree (lower(text) text_pattern_ops) l10ntest=# EXPLAIN SELECT count(l.num) AS count_1 FROM master_l10n l WHERE l.lang = 'it' AND lower(l.text) LIKE 'quattro%'; QUERY PLAN ------------------------------------------------------------------------------------------------------ Aggregate (cost=309315.38..309315.39 rows=1 width=4) -> Bitmap Heap Scan on master_l10n l (cost=64700.56..307801.65 rows=605492 width=4) Filter: (((lang)::text = 'it'::text) AND (lower(text) ~~ 'quattro%'::text)) -> Bitmap Index Scan on l10n_text_index (cost=0.00..64549.19 rows=999662 width=0) Index Cond: ((lower(text) ~>=~ 'quattro'::text) AND (lower(text) ~<~ 'quattrp'::text)) (5 rows) Time: 1.665 ms l10ntest=# EXPLAIN SELECT count(m.num) AS count_1 FROM master_l10n l JOIN master m ON m.num = l.num WHERE l.lang = 'it'AND lower(l.text) LIKE 'quattro%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Aggregate (cost=676558.14..676558.15 rows=1 width=4) -> Hash Join (cost=373011.02..675044.41 rows=605492 width=4) Hash Cond: (l.num = m.num) -> Bitmap Heap Scan on master_l10n l (cost=64700.56..307801.65 rows=605492 width=4) Filter: (((lang)::text = 'it'::text) AND (lower(text) ~~ 'quattro%'::text)) -> Bitmap Index Scan on l10n_text_index (cost=0.00..64549.19 rows=999662 width=0) Index Cond: ((lower(text) ~>=~ 'quattro'::text) AND (lower(text) ~<~ 'quattrp'::text)) -> Hash (cost=144247.76..144247.76 rows=9999976 width=4) -> Seq Scan on master m (cost=0.00..144247.76 rows=9999976 width=4) (9 rows) Time: 1.244 ms l10ntest=# SELECT count(l.num) AS count_1 FROM master_l10n l WHERE l.lang = 'it' AND lower(l.text) LIKE 'quattro%'; count_1 --------- 1101101 (1 row) Time: 1221.941 ms l10ntest=# SELECT count(m.num) AS count_1 FROM master_l10n l JOIN master m ON m.num = l.num WHERE l.lang = 'it' AND lower(l.text)LIKE 'quattro%'; count_1 --------- 1101101 (1 row) Time: 3541.852 ms Why does the join on the master table require a "Seq Scan on master"? I tried different kinds of "JOIN", but the resulting explanation remains the same. Am I missing something, or should I stop worrying about that sequential scan? Thanks in advance, bye, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. lele@metapensiero.it | -- Fortunato Depero, 1929.
On Monday, October 12, 2015 2:52 PM, Lele Gaifax <lele@metapensiero.it> wrote: > I'm doing some experiments to find the better layout for > reimplementing an existing db (MySQL cough!) with PostgreSQL > 9.4+. > > I noticed a strange plan coming out from a simple query joining > two tables, both containing 10Mrecs (and both ANALYZEd): Your best bet to get good advice on this is to follow the steps outlined here: https://wiki.postgresql.org/wiki/SlowQueryQuestions The short answer to the question of why it is picking those plans is that that the optimizer looks at the various plans that can generate correct results, and picks the one with the lowest estimated cost based on your costing factors. To get a plan more like what you seem to be expecting you might need to adjust cost factors or create an index that allows more direct access to the data needed by the query. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/12/2015 12:06 PM, Lele Gaifax wrote: > Hi all, > > I'm doing some experiments to find the better layout for reimplementing > an existing db (MySQL cough!) with PostgreSQL 9.4+. > > I noticed a strange plan coming out from a simple query joining two tables, > both containing 10Mrecs (and both ANALYZEd): > > l10ntest=# \d master; > Table "public.master" > Column | Type | Modifiers > --------+---------+------------------------------------------------------ > num | integer | not null default nextval('master_num_seq'::regclass) > Indexes: > "master_pkey" PRIMARY KEY, btree (num) > > l10ntest=# \d master_l10n; > Table "public.master_l10n" > Column | Type | Modifiers > --------+----------------------+----------------------------------------------------------- > num | integer | not null default nextval('master_l10n_num_seq'::regclass) > lang | character varying(2) | not null > text | text | > Indexes: > "master_l10n_pkey" PRIMARY KEY, btree (num, lang) > "l10n_text_index" btree (lower(text) text_pattern_ops) > > l10ntest=# EXPLAIN SELECT count(l.num) AS count_1 FROM master_l10n l WHERE l.lang = 'it' AND lower(l.text) LIKE 'quattro%'; > QUERY PLAN > ------------------------------------------------------------------------------------------------------ > Aggregate (cost=309315.38..309315.39 rows=1 width=4) > -> Bitmap Heap Scan on master_l10n l (cost=64700.56..307801.65 rows=605492 width=4) > Filter: (((lang)::text = 'it'::text) AND (lower(text) ~~ 'quattro%'::text)) > -> Bitmap Index Scan on l10n_text_index (cost=0.00..64549.19 rows=999662 width=0) > Index Cond: ((lower(text) ~>=~ 'quattro'::text) AND (lower(text) ~<~ 'quattrp'::text)) > (5 rows) > > Time: 1.665 ms > > l10ntest=# EXPLAIN SELECT count(m.num) AS count_1 FROM master_l10n l JOIN master m ON m.num = l.num WHERE l.lang ='it' AND lower(l.text) LIKE 'quattro%'; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------ > Aggregate (cost=676558.14..676558.15 rows=1 width=4) > -> Hash Join (cost=373011.02..675044.41 rows=605492 width=4) > Hash Cond: (l.num = m.num) > -> Bitmap Heap Scan on master_l10n l (cost=64700.56..307801.65 rows=605492 width=4) > Filter: (((lang)::text = 'it'::text) AND (lower(text) ~~ 'quattro%'::text)) > -> Bitmap Index Scan on l10n_text_index (cost=0.00..64549.19 rows=999662 width=0) > Index Cond: ((lower(text) ~>=~ 'quattro'::text) AND (lower(text) ~<~ 'quattrp'::text)) > -> Hash (cost=144247.76..144247.76 rows=9999976 width=4) > -> Seq Scan on master m (cost=0.00..144247.76 rows=9999976 width=4) > (9 rows) > > Time: 1.244 ms > > l10ntest=# SELECT count(l.num) AS count_1 FROM master_l10n l WHERE l.lang = 'it' AND lower(l.text) LIKE 'quattro%'; > count_1 > --------- > 1101101 > (1 row) > > Time: 1221.941 ms > > l10ntest=# SELECT count(m.num) AS count_1 FROM master_l10n l JOIN master m ON m.num = l.num WHERE l.lang = 'it' ANDlower(l.text) LIKE 'quattro%'; > count_1 > --------- > 1101101 > (1 row) > > Time: 3541.852 ms > > Why does the join on the master table require a "Seq Scan on master"? I tried > different kinds of "JOIN", but the resulting explanation remains the same. > > Am I missing something, or should I stop worrying about that sequential scan? Off hand I would say it is because of this --> count(m.num). Try count(l.num) instead and see what happens. As your queries above show they are the same number. > > Thanks in advance, > bye, lele. > -- Adrian Klaver adrian.klaver@aklaver.com
Adrian Klaver <adrian.klaver@aklaver.com> writes: > Off hand I would say it is because of this --> count(m.num). Try count(l.num) instead and see > what happens. As your queries above show they are the same number. No, that's another thing I already tried tweaking and should have mentioned. Neither count(*) nor count(l.num) have any influence on the plan. Following Kevin's advice (thanks!) I read the SlowQueryQuestions wiki page and learned about the "buffers" EXPLAIN option: EXPLAIN (analyze,buffers) SELECT count(l.num) AS count_1 FROM master_l10n l JOIN master m ON m.num = l.num WHERE l.lang= 'it' AND lower(l.text) LIKE 'quattro%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=676558.14..676558.15 rows=1 width=4) (actual time=4133.991..4133.991 rows=1 loops=1) Buffers: shared hit=6 read=84710, temp read=32652 written=32398 -> Hash Join (cost=373011.02..675044.41 rows=605492 width=4) (actual time=1940.285..4074.654 rows=1101101 loops=1) Hash Cond: (l.num = m.num) Buffers: shared hit=6 read=84710, temp read=32652 written=32398 -> Bitmap Heap Scan on master_l10n l (cost=64700.56..307801.65 rows=605492 width=4) (actual time=201.132..1286.629rows=1101101 loops=1) Filter: (((lang)::text = 'it'::text) AND (lower(text) ~~ 'quattro%'::text)) Heap Blocks: exact=25621 Buffers: shared hit=1 read=40464 -> Bitmap Index Scan on l10n_text_index (cost=0.00..64549.19 rows=999662 width=0) (actual time=195.946..195.946rows=1101101 loops=1) Index Cond: ((lower(text) ~>=~ 'quattro'::text) AND (lower(text) ~<~ 'quattrp'::text)) Buffers: shared read=14844 -> Hash (cost=144247.76..144247.76 rows=9999976 width=4) (actual time=1738.180..1738.180 rows=9999999 loops=1) Buckets: 16384 Batches: 128 Memory Usage: 2778kB Buffers: shared hit=2 read=44246, temp written=29000 -> Seq Scan on master m (cost=0.00..144247.76 rows=9999976 width=4) (actual time=0.006..629.590 rows=9999999loops=1) Buffers: shared hit=2 read=44246 Planning time: 0.493 ms Execution time: 4134.144 ms (19 rows) # select version(); version -------------------------------------------------------------------------------------------------------- PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 5.2.1-21) 5.2.1 20151003, 64-bit (1 row) Thank you, ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. lele@metapensiero.it | -- Fortunato Depero, 1929.
Lele Gaifax wrote: > Hi all, > > I'm doing some experiments to find the better layout for reimplementing > an existing db (MySQL cough!) with PostgreSQL 9.4+. > > I noticed a strange plan coming out from a simple query joining two tables, > both containing 10Mrecs (and both ANALYZEd): > -> Bitmap Index Scan on l10n_text_index (cost=0.00..64549.19 rows=999662 width=0) > Index Cond: ((lower(text) ~>=~ 'quattro'::text) AND (lower(text) ~<~ 'quattrp'::text)) So 10% of your rows in the master_l10n table start with "quattro"? That's pretty odd, isn't it? How did you manufacture these data? I wonder if the l10n_text_index index should be on (lang, lower(text)). How often are you going to look for translated text without specifying a language? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Monday, October 12, 2015, Lele Gaifax <lele@metapensiero.it> wrote:
Adrian Klaver <adrian.klaver@aklaver.com> writes:
> Off hand I would say it is because of this --> count(m.num). Try count(l.num) instead and see
> what happens. As your queries above show they are the same number.
No, that's another thing I already tried tweaking and should have mentioned.
Neither count(*) nor count(l.num) have any influence on the plan.
Following Kevin's advice (thanks!) I read the SlowQueryQuestions wiki page and
learned about the "buffers" EXPLAIN option:
EXPLAIN (analyze,buffers) SELECT count(l.num) AS count_1 FROM master_l10n l JOIN master m ON m.num = l.num WHERE l.lang = 'it' AND lower(l.text) LIKE 'quattro%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=676558.14..676558.15 rows=1 width=4) (actual time=4133.991..4133.991 rows=1 loops=1)
Buffers: shared hit=6 read=84710, temp read=32652 written=32398
-> Hash Join (cost=373011.02..675044.41 rows=605492 width=4) (actual time=1940.285..4074.654 rows=1101101 loops=1)
Hash Cond: (l.num = m.num)
Buffers: shared hit=6 read=84710, temp read=32652 written=32398
-> Bitmap Heap Scan on master_l10n l (cost=64700.56..307801.65 rows=605492 width=4) (actual time=201.132..1286.629 rows=1101101 loops=1)
Filter: (((lang)::text = 'it'::text) AND (lower(text) ~~ 'quattro%'::text))
Heap Blocks: exact=25621
Buffers: shared hit=1 read=40464
-> Bitmap Index Scan on l10n_text_index (cost=0.00..64549.19 rows=999662 width=0) (actual time=195.946..195.946 rows=1101101 loops=1)
Index Cond: ((lower(text) ~>=~ 'quattro'::text) AND (lower(text) ~<~ 'quattrp'::text))
Buffers: shared read=14844
-> Hash (cost=144247.76..144247.76 rows=9999976 width=4) (actual time=1738.180..1738.180 rows=9999999 loops=1)
Buckets: 16384 Batches: 128 Memory Usage: 2778kB
Buffers: shared hit=2 read=44246, temp written=29000
-> Seq Scan on master m (cost=0.00..144247.76 rows=9999976 width=4) (actual time=0.006..629.590 rows=9999999 loops=1)
Buffers: shared hit=2 read=44246
Planning time: 0.493 ms
Execution time: 4134.144 ms
(19 rows)
# select version();
version
--------------------------------------------------------------------------------------------------------
PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 5.2.1-21) 5.2.1 20151003, 64-bit
(1 row)
Thank you,
ciao, lele.
--
Your data and indexes are organized such that an index is only marginally helpful, or so the planner thinks. Try:
1. Cranking effective_cache_size so the planner might think your data is cached.
2. Reducing random_page_cost to discourage random plans
3. Temporarily disabling seq scans
4. Composite index for better lookups.
merlin
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > So 10% of your rows in the master_l10n table start with "quattro"? > That's pretty odd, isn't it? How did you manufacture these data? Well, not a real scenario for sure, but definitely not odd: I just needed an "extremely" big dataset to test out several different strategies, both on table layout and indexes. The tables are populated by mechanically translating the integer primary key into the corresponding "in words" string (1 -> "one")... See https://gitlab.com/lelix/hstore4l10n for details. > I wonder if the l10n_text_index index should be on (lang, lower(text)). Oh wow, thank you for this enlightment! Sure, that's the problem indeed! > How often are you going to look for translated text without specifying a > language? Never. The most frequently used criteria is «LIKE '%word%'» in the context of a user session, and thus with a "preferred language". Thank you again, ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. lele@metapensiero.it | -- Fortunato Depero, 1929.
Merlin Moncure <mmoncure@gmail.com> writes: > Your data and indexes are organized such that an index is only marginally > helpful, or so the planner thinks. Yes, just wanted to get rid of that suspicious "seq scan", that was effectively caused by a bad index, as Alvaro pointed out. Thank you, ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. lele@metapensiero.it | -- Fortunato Depero, 1929.
Lele Gaifax wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > > So 10% of your rows in the master_l10n table start with "quattro"? > > That's pretty odd, isn't it? How did you manufacture these data? > > Well, not a real scenario for sure, but definitely not odd: I just needed an > "extremely" big dataset to test out several different strategies, both on > table layout and indexes. The tables are populated by mechanically translating > the integer primary key into the corresponding "in words" string (1 -> "one")... I imagined it would be something like that. It's not the most useful set of test data, precisely because it doesn't accurately reflect what you're going to have in practice. I suggest you enter some actual text, even if it's just text from Don Camillo or whatever. > > How often are you going to look for translated text without specifying a > > language? > > Never. The most frequently used criteria is «LIKE '%word%'» in the context of > a user session, and thus with a "preferred language". Be very careful with a % at the left. The index is not going to work at all there. It is not the same as looking for stuff without a % at the left. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I imagined it would be something like that. It's not the most useful > set of test data, precisely because it doesn't accurately reflect what > you're going to have in practice. I suggest you enter some actual text, > even if it's just text from Don Camillo or whatever. Sure, of course. The goal was more exercising the hstore type, which I have never used before, and to find a good index strategy. >> Never. The most frequently used criteria is «LIKE '%word%'» in the context of >> a user session, and thus with a "preferred language". > > Be very careful with a % at the left. The index is not going to work at > all there. It is not the same as looking for stuff without a % at the > left. Right, I know. I'm indeed surprised how fast it already is to scan the whole table, in particular in the JSONB and HSTORE cases, where I didn't create an index on the field! Just for fun, I will try to learn about full text searches, another thing I never used on PG, and if it works against an HSTORE field... I don't know yet if I will need that machinery (current DB is still well under the million records) though. Thank you, ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. lele@metapensiero.it | -- Fortunato Depero, 1929.
On Tue, Oct 13, 2015 at 11:54 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Lele Gaifax wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> >> > So 10% of your rows in the master_l10n table start with "quattro"? >> > That's pretty odd, isn't it? How did you manufacture these data? >> >> Well, not a real scenario for sure, but definitely not odd: I just needed an >> "extremely" big dataset to test out several different strategies, both on >> table layout and indexes. The tables are populated by mechanically translating >> the integer primary key into the corresponding "in words" string (1 -> "one")... > > I imagined it would be something like that. It's not the most useful > set of test data, precisely because it doesn't accurately reflect what > you're going to have in practice. I suggest you enter some actual text, > even if it's just text from Don Camillo or whatever. > >> > How often are you going to look for translated text without specifying a >> > language? >> >> Never. The most frequently used criteria is «LIKE '%word%'» in the context of >> a user session, and thus with a "preferred language". > > Be very careful with a % at the left. The index is not going to work at > all there. It is not the same as looking for stuff without a % at the > left. Note, you can optimize LIKE '%foo%' with pg_trgm/gin indexing. Trigram based indexing is kind of a mixed bag but is about to get a lot faster with recent enhancements so that it should mostly match or beat the brute force search. This is the preferred solution if you need to do partial string matching -- for most other cases of attribute searching I'd be looking at jsonb. Welcome to postgres OP! merlin
Merlin Moncure <mmoncure@gmail.com> writes: > On Tue, Oct 13, 2015 at 11:54 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Be very careful with a % at the left. The index is not going to work at >> all there. It is not the same as looking for stuff without a % at the >> left. > > Note, you can optimize LIKE '%foo%' with pg_trgm/gin indexing. > Trigram based indexing is kind of a mixed bag but is about to get a > lot faster with recent enhancements so that it should mostly match or > beat the brute force search. Right, it is indeed very efficient, in particular when trying it on PG 9.5b1! Thank you. As a bonus, it is case insensitive, so even ILIKE can take advantage of it. > This is the preferred solution if you need to do partial string matching -- > for most other cases of attribute searching I'd be looking at jsonb. Speaking of which, as this is exactly the goal of my experiments, I have now added one trigram index for each "key" of an hstore field, where the "key" is the user language and the value is a text in that language: CREATE INDEX "text_it_idx" ON test_hstore USING gin ((text->'it') gin_trgm_ops) CREATE INDEX "text_en_idx" ON test_hstore USING gin ((text->'en') gin_trgm_ops) Is this the right approach, or am I missing something clever that would allow me to have a single index? > Welcome to postgres OP! Thank you. Even if I'm not exactly new to PG, I'm very glad to have at least a little opportunity to convince my coworkers to replace an awful MySQL subsystem with a shiny new implementation based on PostgreSQL! bye, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. lele@metapensiero.it | -- Fortunato Depero, 1929.