Thread: Odd behavior with indices
I've got a datavalue
table with ~200M rows or so, with indices on both site_id
and parameter_id
. I need to execute queries like "return all sites with data" and "return all parameters with data". The site
table has only 200 rows or so, and the parameter
table has only 100 or so rows.
The site
query is fast and uses the index:
EXPLAIN ANALYZE
select *
from site
where exists ( select 1 from datavalue where datavalue.site_id = site.id limit 1
);
Seq Scan on site (cost=0.00..64.47 rows=64 width=113) (actual time=0.046..1.106 rows=89 loops=1) Filter: (SubPlan 1) Rows Removed by Filter: 39 SubPlan 1 -> Limit (cost=0.44..0.47 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=128) -> Index Only Scan using ix_datavalue_site_id on datavalue (cost=0.44..8142.71 rows=248930 width=0) (actual time=0.008..0.008 rows=1 loops=128) Index Cond: (site_id = site.id) Heap Fetches: 0
Planning time: 0.361 ms
Execution time: 1.149 ms
The same query for parameters is rather slow and does NOT use the index:
EXPLAIN ANALYZE
select *
from parameter
where exists ( select 1 from datavalue where datavalue.parameter_id = parameter.id limit 1
);
Seq Scan on parameter (cost=0.00..20.50 rows=15 width=2648) (actual time=2895.972..21331.701 rows=15 loops=1) Filter: (SubPlan 1) Rows Removed by Filter: 6 SubPlan 1 -> Limit (cost=0.00..0.34 rows=1 width=0) (actual time=1015.790..1015.790 rows=1 loops=21) -> Seq Scan on datavalue (cost=0.00..502127.10 rows=1476987 width=0) (actual time=1015.786..1015.786 rows=1 loops=21) Filter: (parameter_id = parameter.id) Rows Removed by Filter: 7739355
Planning time: 0.123 ms
Execution time: 21331.736 ms
What the deuce is going on here? Alternatively, whats a good way to do this?
Any help/guidance appreciated!
Some of the table description:
\d datavalue
id BIGINT DEFAULT nextval('datavalue_id_seq'::regclass) NOT NULL,
value DOUBLE PRECISION NOT NULL,
site_id INTEGER NOT NULL,
parameter_id INTEGER NOT NULL,
deployment_id INTEGER,
instrument_id INTEGER,
invalid BOOLEAN,
Indexes: "datavalue_pkey" PRIMARY KEY, btree (id) "datavalue_datetime_utc_site_id_parameter_id_instrument_id_key" UNIQUE CONSTRAINT, btree (datetime_utc, site_id, parameter_id, instrument_id) "ix_datavalue_instrument_id" btree (instrument_id) "ix_datavalue_parameter_id" btree (parameter_id) "ix_datavalue_site_id" btree (site_id) "tmp_idx" btree (site_id, datetime_utc)
Foreign-key constraints: "datavalue_instrument_id_fkey" FOREIGN KEY (instrument_id) REFERENCES instrument(id) ON UPDATE CASCADE ON DELETE CASCADE "datavalue_parameter_id_fkey" FOREIGN KEY (parameter_id) REFERENCES parameter(id) ON UPDATE CASCADE ON DELETE CASCADE "datavalue_site_id_fkey" FOREIGN KEY (site_id) REFERENCES coastal.site(id) ON UPDATE CASCADE ON DELETE CASCADE "datavalue_statistic_type_id_fkey"
Also available on S.O.:I've got a
datavalue
table with ~200M rows or so, with indices on bothsite_id
andparameter_id
. I need to execute queries like "return all sites with data" and "return all parameters with data". Thesite
table has only 200 rows or so, and theparameter
table has only 100 or so rows.The
site
query is fast and uses the index:EXPLAIN ANALYZE select * from site where exists ( select 1 from datavalue where datavalue.site_id = site.id limit 1 ); Seq Scan on site (cost=0.00..64.47 rows=64 width=113) (actual time=0.046..1.106 rows=89 loops=1) Filter: (SubPlan 1) Rows Removed by Filter: 39 SubPlan 1 -> Limit (cost=0.44..0.47 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=128) -> Index Only Scan using ix_datavalue_site_id on datavalue (cost=0.44..8142.71 rows=248930 width=0) (actual time=0.008..0.008 rows=1 loops=128) Index Cond: (site_id = site.id) Heap Fetches: 0 Planning time: 0.361 ms Execution time: 1.149 ms
The same query for parameters is rather slow and does NOT use the index:
EXPLAIN ANALYZE select * from parameter where exists ( select 1 from datavalue where datavalue.parameter_id = parameter.id limit 1 ); Seq Scan on parameter (cost=0.00..20.50 rows=15 width=2648) (actual time=2895.972..21331.701 rows=15 loops=1) Filter: (SubPlan 1) Rows Removed by Filter: 6 SubPlan 1 -> Limit (cost=0.00..0.34 rows=1 width=0) (actual time=1015.790..1015.790 rows=1 loops=21) -> Seq Scan on datavalue (cost=0.00..502127.10 rows=1476987 width=0) (actual time=1015.786..1015.786 rows=1 loops=21) Filter: (parameter_id = parameter.id) Rows Removed by Filter: 7739355 Planning time: 0.123 ms Execution time: 21331.736 ms
What the deuce is going on here? Alternatively, whats a good way to do this?
Any help/guidance appreciated!
Some of the table description:
\d datavalue
id BIGINT DEFAULT nextval('datavalue_id_seq'::regclass) NOT NULL, value DOUBLE PRECISION NOT NULL, site_id INTEGER NOT NULL, parameter_id INTEGER NOT NULL, deployment_id INTEGER, instrument_id INTEGER, invalid BOOLEAN, Indexes: "datavalue_pkey" PRIMARY KEY, btree (id) "datavalue_datetime_utc_site_id_parameter_id_instrument_id_key" UNIQUE CONSTRAINT, btree (datetime_utc, site_id, parameter_id, instrument_id) "ix_datavalue_instrument_id" btree (instrument_id) "ix_datavalue_parameter_id" btree (parameter_id) "ix_datavalue_site_id" btree (site_id) "tmp_idx" btree (site_id, datetime_utc) Foreign-key constraints: "datavalue_instrument_id_fkey" FOREIGN KEY (instrument_id) REFERENCES instrument(id) ON UPDATE CASCADE ON DELETE CASCADE "datavalue_parameter_id_fkey" FOREIGN KEY (parameter_id) REFERENCES parameter(id) ON UPDATE CASCADE ON DELETE CASCADE "datavalue_site_id_fkey" FOREIGN KEY (site_id) REFERENCES coastal.site(id) ON UPDATE CASCADE ON DELETE CASCADE "datavalue_statistic_type_id_fkey"
Here's the distribution of parameter_id's
select count(parameter_id), parameter_id from datavalue group by parameter_id
88169 14
2889171 8
15805 17
8570 12
4257262 21
3947049 15
1225902 2
4091090 3
103877 10
633764 11
994442 18
49232 20
14935 4
563638 13
2955919 7
Also available on S.O.:I've got a
datavalue
table with ~200M rows or so, with indices on bothsite_id
andparameter_id
. I need to execute queries like "return all sites with data" and "return all parameters with data". Thesite
table has only 200 rows or so, and theparameter
table has only 100 or so rows.The
site
query is fast and uses the index:EXPLAIN ANALYZE select * from site where exists ( select 1 from datavalue where datavalue.site_id = site.id limit 1 ); Seq Scan on site (cost=0.00..64.47 rows=64 width=113) (actual time=0.046..1.106 rows=89 loops=1) Filter: (SubPlan 1) Rows Removed by Filter: 39 SubPlan 1 -> Limit (cost=0.44..0.47 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=128) -> Index Only Scan using ix_datavalue_site_id on datavalue (cost=0.44..8142.71 rows=248930 width=0) (actual time=0.008..0.008 rows=1 loops=128) Index Cond: (site_id = site.id) Heap Fetches: 0 Planning time: 0.361 ms Execution time: 1.149 ms
The same query for parameters is rather slow and does NOT use the index:
EXPLAIN ANALYZE select * from parameter where exists ( select 1 from datavalue where datavalue.parameter_id = parameter.id limit 1 ); Seq Scan on parameter (cost=0.00..20.50 rows=15 width=2648) (actual time=2895.972..21331.701 rows=15 loops=1) Filter: (SubPlan 1) Rows Removed by Filter: 6 SubPlan 1 -> Limit (cost=0.00..0.34 rows=1 width=0) (actual time=1015.790..1015.790 rows=1 loops=21) -> Seq Scan on datavalue (cost=0.00..502127.10 rows=1476987 width=0) (actual time=1015.786..1015.786 rows=1 loops=21) Filter: (parameter_id = parameter.id) Rows Removed by Filter: 7739355 Planning time: 0.123 ms Execution time: 21331.736 ms
What the deuce is going on here? Alternatively, whats a good way to do this?
Any help/guidance appreciated!
Some of the table description:
\d datavalue
id BIGINT DEFAULT nextval('datavalue_id_seq'::regclass) NOT NULL, value DOUBLE PRECISION NOT NULL, site_id INTEGER NOT NULL, parameter_id INTEGER NOT NULL, deployment_id INTEGER, instrument_id INTEGER, invalid BOOLEAN, Indexes: "datavalue_pkey" PRIMARY KEY, btree (id) "datavalue_datetime_utc_site_id_parameter_id_instrument_id_key" UNIQUE CONSTRAINT, btree (datetime_utc, site_id, parameter_id, instrument_id) "ix_datavalue_instrument_id" btree (instrument_id) "ix_datavalue_parameter_id" btree (parameter_id) "ix_datavalue_site_id" btree (site_id) "tmp_idx" btree (site_id, datetime_utc) Foreign-key constraints: "datavalue_instrument_id_fkey" FOREIGN KEY (instrument_id) REFERENCES instrument(id) ON UPDATE CASCADE ON DELETE CASCADE "datavalue_parameter_id_fkey" FOREIGN KEY (parameter_id) REFERENCES parameter(id) ON UPDATE CASCADE ON DELETE CASCADE "datavalue_site_id_fkey" FOREIGN KEY (site_id) REFERENCES coastal.site(id) ON UPDATE CASCADE ON DELETE CASCADE "datavalue_statistic_type_id_fkey"
I'm not great with the details but the short answer - aside from the fact that you should consider increasing the statistics on these columns - is that at a certain point querying the index and then subsequently checking the table for visibility is more expensive than simply scanning and then discarding the extra rows.The fact that you could perform an INDEX ONLY scan in the first query makes that cost go away since no subsequent heap check is required. In the parameters query the planner thinks it needs 1.5 million of the rows and will have to check each of them for visibility. It decided that scanning the entire table was more efficient.The LIMIT 1 in both queries should not be necessary. The planner is smart enough to stop once it finds what it is looking for. In fact the LIMIT's presence may be a contributing factor...but I cannot say for sure.A better query seems like it would be:WITH active_sites AS (SELECT DISTINCT site_id FROM datavalues;)SELECT *FROM sitesJOIN active_sites USING (site_id);David J.
Here's the distribution of parameter_id's
select count(parameter_id), parameter_id from datavalue group by parameter_id 88169 14 2889171 8 15805 17 8570 12 4257262 21 3947049 15 1225902 2 4091090 3 103877 10 633764 11 994442 18 49232 20 14935 4 563638 13 2955919 7
Em 26 de fev de 2016 4:44 PM, "joe meiring" <josephmeiring@gmail.com> escreveu:
>
> The same query for parameters is rather slow and does NOT use the index:
>
> EXPLAIN ANALYZE
> select *
> from parameter
> where exists (
> select 1 from datavalue
> where datavalue.parameter_id = parameter.id limit 1
> );
>
Please, could you execute both queries without the LIMIT 1 and show us the plans?
LIMIT in the inner query is like a fence and it caps some optimizations available for EXISTS, you'd better avoid it and see if you get a proper semi-join plan then.
Regards.
Matheus de Oliveira <matioli.matheus@gmail.com> writes: > Em 26 de fev de 2016 4:44 PM, "joe meiring" <josephmeiring@gmail.com> > escreveu: >> The same query for parameters is rather slow and does NOT use the index: >> >> EXPLAIN ANALYZE >> select * >> from parameter >> where exists ( >> select 1 from datavalue >> where datavalue.parameter_id = parameter.id limit 1 >> ); > Please, could you execute both queries without the LIMIT 1 and show us the > plans? > LIMIT in the inner query is like a fence and it caps some optimizations > available for EXISTS, you'd better avoid it and see if you get a proper > semi-join plan then. FWIW, PG >= 9.5 will ignore a LIMIT 1 inside an EXISTS, so that you get the same plan with or without it. But that does act as an optimization fence in earlier releases. regards, tom lane
On Mon, Feb 29, 2016 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > FWIW, PG >= 9.5 will ignore a LIMIT 1 inside an EXISTS, so that you get > the same plan with or without it. But that does act as an optimization > fence in earlier releases. Does 'offset 0' still work as it did? merlin
Merlin Moncure <mmoncure@gmail.com> writes: > On Mon, Feb 29, 2016 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> FWIW, PG >= 9.5 will ignore a LIMIT 1 inside an EXISTS, so that you get >> the same plan with or without it. But that does act as an optimization >> fence in earlier releases. > Does 'offset 0' still work as it did? Yes. regards, tom lane