Thread: Problems with adding a is not null to a query.
I reported this in the pgsql-general list and was instructed to send the analaze outputs here. have this query it runs reasonably quickly (but should be quicker IMHO) SELECT "consolidated_urls".* FROM "consolidated_urls" INNER JOIN "topical_urls" ON "consolidated_urls".id = "topical_urls".consolidated_url_id WHERE (("topical_urls".domain_id = 157) AND ((topical_urls.hidden = 'f'))) ORDER BY index_delta DESC LIMIT 10 Here is the analyze for it. "Limit (cost=29903.44..29903.46 rows=10 width=1880) (actual time=44.730..44.730 rows=0 loops=1)" " -> Sort (cost=29903.44..29910.04 rows=2642 width=1880) (actual time=44.722..44.722 rows=0 loops=1)" " Sort Key: consolidated_urls.index_delta" " Sort Method: quicksort Memory: 17kB" " -> Nested Loop (cost=105.29..29846.34 rows=2642 width=1880) (actual time=44.639..44.639 rows=0 loops=1)" " -> Bitmap Heap Scan on topical_urls (cost=105.29..7494.32 rows=2642 width=4) (actual time=44.635..44.635 rows=0 loops=1)" " Recheck Cond: (domain_id = 157)" " Filter: (NOT hidden)" " -> Bitmap Index Scan on index_topical_urls_on_domain_id_and_consolidated_url_id (cost=0.00..104.63 rows=2643 width=0) (actual time=44.629..44.629 rows=0 loops=1)" " Index Cond: (domain_id = 157)" " -> Index Scan using consolidated_urls_pkey on consolidated_urls (cost=0.00..8.45 rows=1 width=1880) (never executed)" " Index Cond: (consolidated_urls.id = topical_urls.consolidated_url_id)" "Total runtime: 45.023 ms" I add another where clause and it becomes completely unusable. SELECT "consolidated_urls".* FROM "consolidated_urls" INNER JOIN "topical_urls" ON "consolidated_urls".id = "topical_urls".consolidated_url_id WHERE (("topical_urls".domain_id = 157) AND ((topical_urls.hidden = 'f'))) AND (index_delta IS NOT NULL) ORDER BY index_delta DESC LIMIT 10 Limit (cost=0.00..20555.76 rows=10 width=1880) (actual time=3152032.072..3152032.072 rows=0 loops=1) -> Nested Loop (cost=0.00..5430830.93 rows=2642 width=1880) (actual time=3152032.065..3152032.065 rows=0 loops=1) -> Index Scan Backward using index_consolidateds_url_on_index_delta on consolidated_urls (cost=0.00..5316175.98 rows=15242 width=1880) (actual time=2928420.500..3151811.125 rows=14985 loops=1) Filter: (index_delta IS NOT NULL) -> Index Scan using index_topical_urls_on_domain_id_and_consolidated_url_id on topical_urls (cost=0.00..7.51 rows=1 width=4) (actual time=0.011..0.011 rows=0 loops=14985) Index Cond: ((topical_urls.domain_id = 157) AND (topical_urls.consolidated_url_id = consolidated_urls.id)) Filter: (NOT topical_urls.hidden) Total runtime: 3152066.335 ms (8 rows) I tried putting another indexed field in the query instead of index_delta but it didn't help. Hope this helps you guys figure out what the issue is.
Tim Uckun <timuckun@gmail.com> writes: > I reported this in the pgsql-general list and was instructed to send > the analaze outputs here. This isn't a bug, it's just a poor choice of plan based on a bad statistical estimate. The planner is estimating that there are 2643 rows having domain_id = 157, when actually there are none whatsoever, as can be seen here: > " -> Bitmap Index Scan on > index_topical_urls_on_domain_id_and_consolidated_url_id > (cost=0.00..104.63 rows=2643 width=0) (actual time=44.629..44.629 > rows=0 loops=1)" > " Index Cond: (domain_id = 157)" Possibly the table's never been ANALYZEd ... do you have autovacuum enabled? If it has been analyzed reasonably recently, then it might be necessary to crank up the statistics target to get a better estimate. It's difficult to give detailed advice when you haven't mentioned what PG version you're running. regards, tom lane
Both queries use the same row's estimation and cost is comparable. But execution time differs huge: 0.044s and 3100s. I think that the cost of backward index scan is too small. On 1/15/11, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tim Uckun <timuckun@gmail.com> writes: >> I reported this in the pgsql-general list and was instructed to send >> the analaze outputs here. > > This isn't a bug, it's just a poor choice of plan based on a bad > statistical estimate. The planner is estimating that there are 2643 > rows having domain_id = 157, when actually there are none whatsoever, > as can be seen here: > >> " -> Bitmap Index Scan on >> index_topical_urls_on_domain_id_and_consolidated_url_id >> (cost=0.00..104.63 rows=2643 width=0) (actual time=44.629..44.629 >> rows=0 loops=1)" >> " Index Cond: (domain_id = 157)" > > Possibly the table's never been ANALYZEd ... do you have autovacuum > enabled? If it has been analyzed reasonably recently, then it might be > necessary to crank up the statistics target to get a better estimate. > It's difficult to give detailed advice when you haven't mentioned what > PG version you're running. > > regards, tom lane > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs > -- Sent from my mobile device ------------ pasman
> > Possibly the table's never been ANALYZEd ... do you have autovacuum > enabled? =C2=A0If it has been analyzed reasonably recently, then it might= be > necessary to crank up the statistics target to get a better estimate. > It's difficult to give detailed advice when you haven't mentioned what > PG version you're running. > I do have autovacuum enabled and I am running 8.4 autovacuum =3D on vacuum_cost_limit =3D 150 log_autovacuum_min_duration =3D 0 autovacuum_max_workers =3D 4 autovacuum_naptime =3D 5min autovacuum_vacuum_cost_delay =3D 30ms
Tim Uckun <timuckun@gmail.com> writes: >> Possibly the table's never been ANALYZEd ... do you have autovacuum >> enabled? > I do have autovacuum enabled and I am running 8.4 Hmm, autovacuum *should* have been keeping track of things for you, but it might still be worth doing a manual ANALYZE against that table to see if the estimated rowcount changes. If not, you'll need to raise the statistics target for that column (and again ANALYZE). regards, tom lane
> Hmm, autovacuum *should* have been keeping track of things for you, > but it might still be worth doing a manual ANALYZE against that table > to see if the estimated rowcount changes. =C2=A0If not, you'll need to ra= ise > the statistics target for that column (and again ANALYZE). > I started a manual VACUUM ANALYZE on the table. It's been running for about an half hour now, The table gets quite busy so I am hoping there is no real harm in letting go to the end.
> > Hmm, autovacuum *should* have been keeping track of things for you, > but it might still be worth doing a manual ANALYZE against that table > to see if the estimated rowcount changes. =C2=A0If not, you'll need to ra= ise > the statistics target for that column (and again ANALYZE). The analyze finished. I re-ran the explain it was still taking a very long time. I stopped it eventually.
On Sun, Jan 16, 2011 at 5:47 PM, Tim Uckun <timuckun@gmail.com> wrote: >> Hmm, autovacuum *should* have been keeping track of things for you, >> but it might still be worth doing a manual ANALYZE against that table >> to see if the estimated rowcount changes. =A0If not, you'll need to raise >> the statistics target for that column (and again ANALYZE). > > > The analyze finished. I re-ran the explain it was still taking a very > long time. I stopped it eventually. Hmm. What do you get for: SELECT relname, pg_relation_size(oid), reltuples, relpages FROM pg_class WHERE relname IN ('consolidated_urls', 'consolidated_urls_pkey'); --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> > Hmm. =C2=A0What do you get for: > > SELECT relname, pg_relation_size(oid), reltuples, relpages FROM > pg_class WHERE relname IN ('consolidated_urls', > 'consolidated_urls_pkey'); > relname | pg_relation_size | reltuples | relpages ------------------------+------------------+-------------+---------- consolidated_urls | 1303060480 | 1.80192e+06 | 159065 consolidated_urls_pkey | 114745344 | 1.80192e+06 | 14007 (2 rows)
On Mon, Jan 17, 2011 at 8:23 AM, Tim Uckun <timuckun@gmail.com> wrote: >> Hmm. =A0What do you get for: >> >> SELECT relname, pg_relation_size(oid), reltuples, relpages FROM >> pg_class WHERE relname IN ('consolidated_urls', >> 'consolidated_urls_pkey'); > > =A0 =A0 =A0 =A0relname =A0 =A0 =A0 =A0 | pg_relation_size | =A0reltuples = =A0| relpages > ------------------------+------------------+-------------+---------- > =A0consolidated_urls =A0 =A0 =A0| =A0 =A0 =A0 1303060480 | 1.80192e+06 | = =A0 159065 > =A0consolidated_urls_pkey | =A0 =A0 =A0 =A0114745344 | 1.80192e+06 | =A0 = =A014007 > (2 rows) Hmm, I was thinking that the table might be bloated but it's not obvious from this. I think you might have the same general kind of problem reported here: http://archives.postgresql.org/pgsql-bugs/2009-09/msg00259.php and here: http://archives.postgresql.org/pgsql-performance/2010-04/msg00139.php ...and there are other reports as well. --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tim Uckun <timuckun@gmail.com> writes: > relname | pg_relation_size | reltuples | relpages > ------------------------+------------------+-------------+---------- > consolidated_urls | 1303060480 | 1.80192e+06 | 159065 > consolidated_urls_pkey | 114745344 | 1.80192e+06 | 14007 With a table that large, you're probably going to need a larger stats target in order to get reasonable estimates for low-frequency values. Am I right in guessing that pg_stats.n_distinct is much too low for the domain_id column? regards, tom lane
> With a table that large, you're probably going to need a larger stats > target in order to get reasonable estimates for low-frequency values. > Am I right in guessing that pg_stats.n_distinct is much too low for > the domain_id column? the domain_id is in the topical urls. A select count of domains shows that there are 700 domains, the pg_stats shows 170 which seems kind of low but maybe is not out of bounds because most domains are probably inactive. pg_stats on the consolidated_urls table shows a very low value for index_delta. There are 1.8 million records and the n_distinct for index_delta is 151. It's a floating point number so that seems weird. For index it's 818. Both of those numbers are wildly wrong I would think.
Tim Uckun <timuckun@gmail.com> writes: >> Am I right in guessing that pg_stats.n_distinct is much too low for >> the domain_id column? > the domain_id is in the topical urls. A select count of domains shows > that there are 700 domains, the pg_stats shows 170 which seems kind of > low but maybe is not out of bounds because most domains are probably > inactive. It sounds like you've got one of those long-tail distributions where there are lots and lots of the first couple hundred domains, and not many at all of the last few. The problem with the stats as you've got them is that the planner isn't aware of the long tail, so for this specific domain id that's not even there at all, you're getting an estimate of a couple of thousand matches --- which is why it goes for the indexscan-according-to-ORDER-BY plan. It's figuring it will hit one of those matches and be able to end the scan after reading much less than all of the table. As I've stated repeatedly, your next move needs to be to increase the stats target, at least for that column if not globally. You probably don't need to have it know about every last domain id, but you need to have it know about enough that it realizes that domains not included in the MCV list are going to appear less than a couple of thousand times. regards, tom lane
> > As I've stated repeatedly, your next move needs to be to increase the > stats target, at least for that column if not globally. =C2=A0You probably > don't need to have it know about every last domain id, but you need to > have it know about enough that it realizes that domains not included in > the MCV list are going to appear less than a couple of thousand times. > Ok How do I go about doing this. More importantly how do I go about making sure the autovacuum process does this. Thanks and please excuse my dumb questions.
Tim Uckun <timuckun@gmail.com> writes: >> As I've stated repeatedly, your next move needs to be to increase the >> stats target, at least for that column if not globally. > Ok How do I go about doing this. If you want to do it globally for the whole database: change default_statistics_target in postgresql.conf. If you just want to affect the one column, use "ALTER TABLE tab ALTER COLUMN col SET STATISTICS n". regards, tom lane