Rewriting DISTINCT and losing performance - Mailing list pgsql-performance
From | Chuck D. |
---|---|
Subject | Rewriting DISTINCT and losing performance |
Date | |
Msg-id | 200705202228.32129.pgsql-performance@nullmx.com Whole thread Raw |
Responses |
Re: Rewriting DISTINCT and losing performance
Re: Rewriting DISTINCT and losing performance |
List | pgsql-performance |
Hi all, I know we've covered this before but I'm having trouble with it today. I have some geographic data in tables that I'm working with. I have a country, state and city table. I was selecting the country_name out of the country table but discovered that some countries (like Antarctica) didn't have cities in the city table. I resolved to query the country table for only country_name's which had country_id's in the city table - meaning the country had cities listed. The problem was I had a couple different sources (in separate tables) with some extraneous column data so I chose to consolidate the city tables from the different sources and column data that I don't need because I don't have the hardware to support it. That was the end of my query time. Here's the original table and query: # \d geo.world_city Table "geo.world_city" Column | Type | Modifiers ------------+------------------------+----------- city_id | integer | not null state_id | smallint | country_id | smallint | rc | smallint | latitude | numeric(9,7) | longitude | numeric(10,7) | dsg | character(5) | cc1 | character(2) | adm1 | character(2) | city_name | character varying(200) | Indexes: "world_city_pk" PRIMARY KEY, btree (city_id) "idx_world_city_cc1" btree (cc1) "idx_world_city_cc1_adm1" btree (cc1, adm1) "idx_world_city_country_id" btree (country_id) "idx_world_city_name_first_letter" btree (state_id, "substring"(lower(city_name::text), 1, 1)) "idx_world_city_state_id" btree (state_id) explain analyze SELECT country_id, country_name FROM geo.country WHERE country_id IN (select country_id FROM geo.world_city) ; QUERY PLAN ----------------------------------------------------------------------------- -------------------------------------------------------------------------- Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual time=85.502..3479.449 rows=231 loops=1) -> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual time=0.089..0.658 rows=244 loops=1) -> Index Scan using idx_world_city_country_id on world_city (cost=0.00..8185.05 rows=12602 width=2) (actual time=14.250..14.250 rows=1 loops=244) Index Cond: (country.country_id = world_city.country_id) Total runtime: 3479.921 ms Odd that it took 3 seconds because every previous run has been much quicker. The next run was: QUERY PLAN ----------------------------------------------------------------------------- ------------------------------------------------------------------------ Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual time=0.087..6.967 rows=231 loops=1) -> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual time=0.028..0.158 rows=244 loops=1) -> Index Scan using idx_world_city_country_id on world_city (cost=0.00..8185.05 rows=12602 width=2) (actual time=0.026..0.026 rows=1 loops=244) Index Cond: (country.country_id = world_city.country_id) Total runtime: 7.132 ms (5 rows) But that was irrelevant. I created a new table and eliminated the data and it looks like this: # \d geo.city Table "geo.city" Column | Type | Modifiers ------------+------------------------+----------- city_id | integer | not null state_id | smallint | country_id | smallint | latitude | numeric(9,7) | longitude | numeric(10,7) | city_name | character varying(100) | Indexes: "city_pk" PRIMARY KEY, btree (city_id) "idx_city_country_id" btree (country_id) CLUSTER Foreign-key constraints: "city_state_id_fk" FOREIGN KEY (state_id) REFERENCES geo.state(state_id) ON UPDATE CASCADE ON DELETE CASCADE explain analyze SELECT country_id, country_name FROM geo.country WHERE country_id IN (select country_id FROM geo.city) ; -- won't complete in a reasonable amount of time. This one won't use the country_id index. The two tables have almost the same number of rows: cmi=# select count(*) from geo.world_city; count --------- 1953314 (1 row) cmi=# select count(*) from geo.city; count --------- 2122712 (1 row) I tried to force it and didn't see any improvement. I've vacuummed, analyzed, clustered. Can someone help me to get only the countries who have cities in the city table in a reasonable amount of time? -------------------------------------------------------
pgsql-performance by date: