Expected accuracy of planner statistics - Mailing list pgsql-general
From | Casey Duncan |
---|---|
Subject | Expected accuracy of planner statistics |
Date | |
Msg-id | 9C9F5092-98D5-4711-AF7A-C97E81DF8CFE@pandora.com Whole thread Raw |
Responses |
Re: Expected accuracy of planner statistics
Re: Expected accuracy of planner statistics |
List | pgsql-general |
I have some databases that have grown significantly over time (as databases do). As the databases have grown, I have noticed that the statistics have grown less and less accurate. In particular, the n_distinct values have become many OOM too small for certain foreign key columns. Predictably this leads to poor query plans. The databases in question were all using the default stats target value, so naturally the thing to do is to increase that and see what happens. First I'll show you one table in question: qa_full=# \d fk Table "public.fk" Column | Type | Modifiers --------------+-----------------------------+--------------- fk_id | bigint | not null st_id | bigint | not null is_positive | boolean | not null mc_id | character varying(20) | not null matching_seed | character varying(20) | ft_id | character varying(20) | s_title | text | not null a_summary | text | not null date_created | timestamp without time zone | default now() qx_id | bigint | Indexes: "fk_pkey" PRIMARY KEY, btree (fk_id) "fk_st_mc_id_idx" UNIQUE, btree (st_id, mc_id) "fk_date_created_is_positive_idx" btree (is_positive, date_created) "fk_st_id_idx" btree (st_id) Foreign-key constraints: "fk_qx_id_fkey" FOREIGN KEY (qx_id) REFERENCES st(st_id) ON DELETE RESTRICT "fk_st_id_fkey" FOREIGN KEY (st_id) REFERENCES st(st_id) ON DELETE RESTRICT qa_full=# select count(*) from fk; count ----------- 195555889 Here are the n_distinct stats on the st_id column with stock stats settings: qa_full=# select n_distinct from pg_stats where tablename='fk' and attname='st_id''; attname | n_distinct -----------+------------- st_id | 14910 here's the actual distinct count: qa_full=# select count(distinct st_id) from fk; count ---------- 15191387 (1 row) Here's what it looks like after turning the stats target up to 100: qa_full=# select n_distinct from pg_stats where tablename='fk' and attname='st_id''; attname | n_distinct -----------+------------- st_id | 136977 Still way off (3 OOM), so let's pull out the stops and go for 1000: qa_full=# select n_distinct from pg_stats where tablename='fk' and attname='st_id''; attname | n_distinct -----------+------------- st_id | 860796 Better, but still way off. Here's more of the pg_stats row for the curious with the stats target at 1000: schemaname | public tablename | fk attname | st_id null_frac | 0 avg_width | 8 n_distinct | 860796 most_common_vals | {9822972459012807,81553350123749183,50260420266636724,16953859416556337, 57992478091506908,6789385517968759,13155841310992808,4649594156182905,11 950505984130111,19815690615418387,23232929805154508,24940819255590358,25 304517086243633,30084673952005845,33845252828401578,36510232790970904,44 301350711321256,47572440754042499,66302045808587415,106949745150210138,7 948257888859857,11709841786637953,12034360925626832,17311819170902574,21 933556169120032,31401742852411043,37178443803282644,39714175315169346,42 699954975194688,63648700912541567,73785794393665562,...many elided..} most_common_freqs | {7.33333e-05,6.66667e-05,5.33333e-05,5e-05,5e-05,4.66667e-05,4.66667e-05 , 4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05, 4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05, 4.33333e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05, 4e-05,4e-05,4e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666 7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666 7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.33333e-05,3.33333e-05,3.3333 3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333 3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333 3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333 3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3e-05,3e-05,3e-05, 3e-05,3e-05,3e-05,3e-05,3e-05,3e-05,..many elided..} histogram_bounds | {9474697855526,186642098833097,425502410065792,655064117100237,917344884 999940,1135224280975580,1510900775316064,1919850381534192,23918286327044 65,2773745714634569,3197981109338899,3601128214604953,3887435029566307,4 289757501117626,4604286546172963,5030605000015434,5410915764179364,57126 62986537560,6096452674229658,6531206443844232,6761515475182966,692428185 0823004,7145897868348599,7357502317108796,7537560231072453,7737194605867 515,7923617661480232,8094845122681350,8304911973154200,8504211340608556, 8735469559703009,9008968782181381,9233161779966219,..many elided..} correlation | 0.770339 The correlation is likely high here because this table has been clustered on this column in the past. I don't know if that contributes to the n_distinct inaccuracy, I don't know if I have the patience to reorder the table to find out ;^) Note that new st_ids are also being added all the time, at a rate roughly proportional to fk rows (fk rows being added more frequently). So actually a fractional value for the n_distinct here would be more ideal. The docs hint that analyze will sometimes decide to use a fractional (negative) value. What triggers that? I was also trying to figure out how big the sample really is. Does a stats target of 1000 mean 1000 rows sampled? If the sample really is a fixed number of rows, it would seem to my naive eyes that sampling a fraction of the rows (like 0.1% or something) would be better (especially in cases like this), but maybe it already tries to do that. Any insights appreciated. -Casey
pgsql-general by date: