Thread: single table - fighting a seq scan

single table - fighting a seq scan

From
Radoslav Nedyalkov
Date:
Hi Forum, 
I'm scratching my head around the following case:

te is a 80M rows, 100GB table. It is a bare simple select over indexed attribute of it.

EXPLAIN SELECT te.id FROM te WHERE te.current_pid IN (240900026,
 240900027,
 240900028,
 -- 200 entries ...

 Gather  (cost=1000.00..61517367.85 rows=3870 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on te  (cost=0.00..61515980.85 rows=1612 width=8)
         Filter: (current_pid = ANY ('{240900026,240900027,...240901129}'::bigint[]))
Execution time is about 5 minutes


Reducing number of current_pids to 100 changes the plan and it does index scan. (101 still does seq scan)

 Index Scan using te_current_pid_idx on te  (cost=0.57..731.26 rows=3832 width=8) (actual time=0.566..1.667 rows=600 loops=1)
   Index Cond: (current_pid = ANY ('{240900026,240900027,...240900194}'::bigint[]))
 Planning Time: 3.152 ms
 Execution Time: 1.732 ms


Selecting 200 pids rewritten with CTE goes for index too.

EXPLAIN ANALYZE
WITH cte as (
select * from unnest(ARRAY[
240900026,
 240900027,
 240900028,
...
 240901129
]))
SELECT te.id FROM te join cte on te.current_pid = cte.unnest;

                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1.58..1097.83 rows=3847 width=8) (actual time=0.882..14.927 rows=1468 loops=1)
   CTE cte
     ->  Function Scan on unnest  (cost=0.00..1.00 rows=100 width=4) (actual time=0.025..0.043 rows=205 loops=1)
   ->  CTE Scan on cte  (cost=0.00..2.00 rows=100 width=4) (actual time=0.027..0.083 rows=205 loops=1)
   ->  Index Scan using te_current_pid_idx on te  (cost=0.57..10.57 rows=38 width=16) (actual time=0.011..0.071 rows=7 loops=205)
         Index Cond: (current_pid = cte.unnest)
 Planning Time: 2.022 ms
 Execution Time: 15.044 ms



I tried random_page_cost=1, a couple of combinations with very low
cpu_index_tuple_cost and cpu_operator_cost. Only managed to get an index scan for a few more IN entries.
Did analyze. Bumped stats target for current_pid to 5000. Did not help.

I'm out of ideas. What is the right approach to solve this ?
Thank You!

Rado

Re: single table - fighting a seq scan

From
Michael Lewis
Date:
rows=3832
rows=3870

Your estimate changed very little when you included 100 values vs 200 values. That is interesting to me.

What does the below query give you? How many of those 200 values are found in the MCVs list? If n_distinct is low, and most of the values are NOT in the most common value list, and the fraction of the table covered by the MCVs is also low, then the planner will expect that each of the 200 values represents some deceivingly high portion of the table.

You said there are 80 million rows, yes? That seems likely that ndistinct and the MCVs list are not giving info very correlated with reality. You may want to increase the minimum table size for sequential to kick in. I cannot recall the name of that setting at the moment. You may also want to increase stats target on that column at least, analyze, and explain the query again.

SELECT
( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
tablename,
attname,
inherited,
null_frac,
n_distinct,
array_length(most_common_vals,1) n_mcv,
array_length(histogram_bounds,1) n_hist,
correlation,
*
FROM pg_stats
WHERE
schemaname = 'public'
AND tablename='te'
AND attname='current_pid';

Re: single table - fighting a seq scan

From
Radoslav Nedyalkov
Date:
Hi Michael,
full output from the query is attached.
here is the truncated lists version.

frac_mcv               | 0.00306267
tablename              | te
attname                | current_pid
inherited              | f
null_frac              | 0.261823
n_distinct             | 1.59236e+07
n_mcv                  | 560
n_hist                 | 5001
correlation            | 0.995502
schemaname             | public
tablename              | te
attname                | current_pid
inherited              | f
null_frac              | 0.261823
avg_width              | 8
n_distinct             | 1.59236e+07
most_common_vals       | {15026003,24951186,220707698,223344198,224236736,224355865,224359830,224371584,224380154,224382722,224388639,224390209,224394943,224396835,228259607,232148477,232173137,232379194,232729185,232913953,236304699,236797618,237355501,238860629,239082658}
most_common_freqs      | {4.46667e-05,1.93333e-05,4.66667e-06,4.66667e-06,...,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06}
histogram_bounds       | {13426761,13467316,13510844,..239215632,239302648,239371125,239466529,239532095,239571468,239611801,239634487}
correlation            | 0.995502
most_common_elems      | (null)
most_common_elem_freqs | (null)
elem_count_histogram   | (null)


Thank You very much for the response.
I'll try with settings as you propose.

Rado

On Tue, Jul 14, 2020 at 9:46 PM Michael Lewis <mlewis@entrata.com> wrote:
rows=3832
rows=3870

Your estimate changed very little when you included 100 values vs 200 values. That is interesting to me.

What does the below query give you? How many of those 200 values are found in the MCVs list? If n_distinct is low, and most of the values are NOT in the most common value list, and the fraction of the table covered by the MCVs is also low, then the planner will expect that each of the 200 values represents some deceivingly high portion of the table.

You said there are 80 million rows, yes? That seems likely that ndistinct and the MCVs list are not giving info very correlated with reality. You may want to increase the minimum table size for sequential to kick in. I cannot recall the name of that setting at the moment. You may also want to increase stats target on that column at least, analyze, and explain the query again.

SELECT
( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
tablename,
attname,
inherited,
null_frac,
n_distinct,
array_length(most_common_vals,1) n_mcv,
array_length(histogram_bounds,1) n_hist,
correlation,
*
FROM pg_stats
WHERE
schemaname = 'public'
AND tablename='te'
AND attname='current_pid';
Attachment

Re: single table - fighting a seq scan

From
Radoslav Nedyalkov
Date:
Ah, I could have messed up the examples I gave. Row numbers are different.
Once again the plans , sorry about that.

-- 200 entries

 Gather  (cost=1000.00..106905910.97 rows=7893 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on te  (cost=0.00..106904121.67 rows=3289 width=8)
         Filter: (current_pid = ANY ('{240900026,240900027,240900028,240900029,240900030,240900031,240900032,240900033,240900034,240900035,240900036,240900037,240900038,240900039,240900040,240900041,240900042,240900043,240900044,240900045,240900046,240900047,240900048,240900049,240900050,240900051,240900052,240900053,240900054,240900055,240900056,240900057,240900058,240900059,240900060,240900061,240900062,240900063,240900064,240900065,240900066,240900067,240900068,240900069,240900070,240900071,240900072,240900073,240900074,240900075,240900076,240900077,240900078,240900079,240900080,240900081,240900082,240900083,240900084,240900085,240900086,240900165,240900087,240900166,240900088,240900167,240900089,240900168,240900090,240900169,240900091,240900170,240900092,240900171,240900093,240900172,240900094,240900173,240900905,240900174,240900175,240900176,240900177,240900178,240900179,240900180,240900181,240900182,240900183,240900184,240900185,240900186,240900187,240900188,240900189,240900190,240900191,240900192,240900193,240900194,240900195,240900196,240900197,240900198,240900199,240900906,240900907,240900908,240900909,240900910,240900911,240900912,240900913,240900914,240900915,240900916,240900917,240900918,240900919,240900920,240900921,240900922,240900923,240900924,240900925,240900926,240900927,240900928,240901048,240901053,240901054,240901055,240901056,240901057,240901058,240901059,240901060,240901061,240901062,240901063,240901064,240901065,240901066,240901067,240901068,240901069,240901070,240901071,240901072,240901073,240901074,240901075,240901076,240901077,240901078,240901079,240901080,240901081,240901082,240901083,240901084,240901085,240901086,240901087,240901088,240901089,240901090,240901091,240901092,240901093,240901094,240901095,240901096,240901097,240901098,240901099,240901100,240901101,240901102,240901103,240901104,240901105,240901106,240901107,240901108,240901109,240901110,240901111,240901112,240901113,240901114,240901115,240901116,240901117,240901118,240901119,240901120,240901121,240901122,240901123,240901124,240901125,240901126,240901127,240901128,240901129}'::bigint[]))
(4 rows)

Time: 5.261 ms
========================================================================



-- 100 entries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Index Scan using te_current_pid_idx on te  (cost=0.57..731.26 rows=3832 width=8) (actual time=1.244..15.897 rows=600 loops=1)
   Index Cond: (current_pid = ANY ('{240900026,240900027,240900028,240900029,240900030,240900031,240900032,240900033,240900034,240900035,240900036,240900037,240900038,240900039,240900040,240900041,240900042,240900043,240900044,240900045,240900046,240900047,240900048,240900049,240900050,240900051,240900052,240900053,240900054,240900055,240900056,240900057,240900058,240900059,240900060,240900061,240900062,240900063,240900064,240900065,240900066,240900067,240900068,240900069,240900070,240900071,240900072,240900073,240900074,240900075,240900076,240900077,240900078,240900079,240900080,240900081,240900082,240900083,240900084,240900085,240900086,240900165,240900087,240900166,240900088,240900167,240900089,240900168,240900090,240900169,240900091,240900170,240900092,240900171,240900093,240900172,240900094,240900173,240900905,240900174,240900175,240900176,240900177,240900178,240900179,240900180,240900181,240900182,240900183,240900184,240900185,240900186,240900187,240900188,240900189,240900190,240900191,240900192,240900193,240900194}'::bigint[]))
 Planning Time: 3.430 ms
 Execution Time: 15.954 ms
(4 rows)

Time: 20.257 ms



On Tue, Jul 14, 2020 at 10:31 PM Radoslav Nedyalkov <rnedyalkov@gmail.com> wrote:
Hi Michael,
full output from the query is attached.
here is the truncated lists version.

frac_mcv               | 0.00306267
tablename              | te
attname                | current_pid
inherited              | f
null_frac              | 0.261823
n_distinct             | 1.59236e+07
n_mcv                  | 560
n_hist                 | 5001
correlation            | 0.995502
schemaname             | public
tablename              | te
attname                | current_pid
inherited              | f
null_frac              | 0.261823
avg_width              | 8
n_distinct             | 1.59236e+07
most_common_vals       | {15026003,24951186,220707698,223344198,224236736,224355865,224359830,224371584,224380154,224382722,224388639,224390209,224394943,224396835,228259607,232148477,232173137,232379194,232729185,232913953,236304699,236797618,237355501,238860629,239082658}
most_common_freqs      | {4.46667e-05,1.93333e-05,4.66667e-06,4.66667e-06,...,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06,4e-06}
histogram_bounds       | {13426761,13467316,13510844,..239215632,239302648,239371125,239466529,239532095,239571468,239611801,239634487}
correlation            | 0.995502
most_common_elems      | (null)
most_common_elem_freqs | (null)
elem_count_histogram   | (null)


Thank You very much for the response.
I'll try with settings as you propose.

Rado

On Tue, Jul 14, 2020 at 9:46 PM Michael Lewis <mlewis@entrata.com> wrote:
rows=3832
rows=3870

Your estimate changed very little when you included 100 values vs 200 values. That is interesting to me.

What does the below query give you? How many of those 200 values are found in the MCVs list? If n_distinct is low, and most of the values are NOT in the most common value list, and the fraction of the table covered by the MCVs is also low, then the planner will expect that each of the 200 values represents some deceivingly high portion of the table.

You said there are 80 million rows, yes? That seems likely that ndistinct and the MCVs list are not giving info very correlated with reality. You may want to increase the minimum table size for sequential to kick in. I cannot recall the name of that setting at the moment. You may also want to increase stats target on that column at least, analyze, and explain the query again.

SELECT
( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
tablename,
attname,
inherited,
null_frac,
n_distinct,
array_length(most_common_vals,1) n_mcv,
array_length(histogram_bounds,1) n_hist,
correlation,
*
FROM pg_stats
WHERE
schemaname = 'public'
AND tablename='te'
AND attname='current_pid';

Re: single table - fighting a seq scan

From
Tom Lane
Date:
Radoslav Nedyalkov <rnedyalkov@gmail.com> writes:
> Ah, I could have messed up the examples I gave. Row numbers are different.
> Once again the plans , sorry about that.

Given that it works at 100 entries and not 101, I can't escape the
suspicion that you're being burnt by predtest.c's MAX_SAOP_ARRAY_SIZE
limit.  However, that only affects the planner's willingness to make
constraint proofs involving the large IN clause, and nothing you've
mentioned here explains why such a proof would be needed.  Is there
something you're not telling us about this table's schema?  (I'm
wondering if the index is partial, for instance, though one would
think that the CTE form of the query wouldn't work either if so.)

            regards, tom lane



Re: single table - fighting a seq scan

From
Radoslav Nedyalkov
Date:
Shame on me. It's a partial index - where is not null.
Put the is not null predicate in place and planner always goes for index. (tested with thousands of IN entries)
CTE version always goes for index too even without is not null , which led to a slight confusion.

Thanks Tom, Michael,
Best
Rado

On Wed, Jul 15, 2020 at 1:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Radoslav Nedyalkov <rnedyalkov@gmail.com> writes:
> Ah, I could have messed up the examples I gave. Row numbers are different.
> Once again the plans , sorry about that.

Given that it works at 100 entries and not 101, I can't escape the
suspicion that you're being burnt by predtest.c's MAX_SAOP_ARRAY_SIZE
limit.  However, that only affects the planner's willingness to make
constraint proofs involving the large IN clause, and nothing you've
mentioned here explains why such a proof would be needed.  Is there
something you're not telling us about this table's schema?  (I'm
wondering if the index is partial, for instance, though one would
think that the CTE form of the query wouldn't work either if so.)

                        regards, tom lane

Re: single table - fighting a seq scan

From
Tom Lane
Date:
Radoslav Nedyalkov <rnedyalkov@gmail.com> writes:
> Shame on me. It's a partial index - *where is not null.*
> Put the* is not null *predicate in place and planner always goes for index.
> (tested with thousands of IN entries)
> CTE version always goes for index too even *without **is not null , *which
> led to a slight confusion.

Ah.  That's actually something we fixed in v12 (see [1]).  In the CTE
version, the planner can prove "x is not null" from "x = cte_value" even
without knowing what the CTE output value is, just on the basis that "="
is strict.  In the IN form, it's likewise possible to prove "x is not
null" from "x IN (list)", but you need a special test to recognize that.
With a short IN list, the planner converts IN to "x = this OR x = that
OR x = the-other ..."  and can make the proof from that formulation.
But we prevent it from trying that on long IN lists, because it'd eat
lots of cycles and perhaps not be able to prove the desired partial index
qual anyway.

            regards, tom lane

[1] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=65ce07e02