Rewriting DISTINCT and losing performance - Mailing list pgsql-performance

From Chuck D.
Subject Rewriting DISTINCT and losing performance
Date
Msg-id 200705202228.32129.pgsql-performance@nullmx.com
Whole thread Raw
Responses Re: Rewriting DISTINCT and losing performance
Re: Rewriting DISTINCT and losing performance
List pgsql-performance
Hi all,

I know we've covered this before but I'm having trouble with it today.

I have some geographic data in tables that I'm working with.  I have a
country, state and city table.  I was selecting the country_name out of the
country table but discovered that some countries (like Antarctica) didn't
have cities in the city table.

I resolved to query the country table for only country_name's which had
country_id's in the city table - meaning the country had cities listed.

The problem was I had a couple different sources (in separate tables) with
some extraneous column  data so I chose to consolidate the city tables from
the different sources and column data that I don't need because I don't have
the hardware to support it.

That was the end of my query time.

Here's the original table and query:

# \d geo.world_city
             Table "geo.world_city"
   Column   |          Type          | Modifiers
------------+------------------------+-----------
 city_id    | integer                | not null
 state_id   | smallint               |
 country_id | smallint               |
 rc         | smallint               |
 latitude   | numeric(9,7)           |
 longitude  | numeric(10,7)          |
 dsg        | character(5)           |
 cc1        | character(2)           |
 adm1       | character(2)           |
 city_name  | character varying(200) |
Indexes:
    "world_city_pk" PRIMARY KEY, btree (city_id)
    "idx_world_city_cc1" btree (cc1)
    "idx_world_city_cc1_adm1" btree (cc1, adm1)
    "idx_world_city_country_id" btree (country_id)
    "idx_world_city_name_first_letter" btree
(state_id, "substring"(lower(city_name::text), 1, 1))
    "idx_world_city_state_id" btree (state_id)

explain analyze
SELECT   country_id, country_name
FROM     geo.country
WHERE country_id IN
 (select country_id FROM geo.world_city)
;

                                                                      QUERY
PLAN
-----------------------------------------------------------------------------
--------------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..167.97 rows=155 width=15) (actual
time=85.502..3479.449 rows=231 loops=1)
   ->  Seq Scan on country  (cost=0.00..6.44 rows=244 width=15) (actual
time=0.089..0.658 rows=244 loops=1)
   ->  Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=14.250..14.250 rows=1
loops=244)
         Index Cond: (country.country_id = world_city.country_id)
 Total runtime: 3479.921 ms

Odd that it took 3 seconds because every previous run has been much quicker.
The next run was:

                                                                     QUERY
PLAN
-----------------------------------------------------------------------------
------------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..167.97 rows=155 width=15) (actual
time=0.087..6.967 rows=231 loops=1)
   ->  Seq Scan on country  (cost=0.00..6.44 rows=244 width=15) (actual
time=0.028..0.158 rows=244 loops=1)
   ->  Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=0.026..0.026 rows=1
loops=244)
         Index Cond: (country.country_id = world_city.country_id)
 Total runtime: 7.132 ms
(5 rows)


But that was irrelevant.  I created a new table and eliminated the data and
 it looks like this:

# \d geo.city
                Table "geo.city"
   Column   |          Type          | Modifiers
------------+------------------------+-----------
 city_id    | integer                | not null
 state_id   | smallint               |
 country_id | smallint               |
 latitude   | numeric(9,7)           |
 longitude  | numeric(10,7)          |
 city_name  | character varying(100) |
Indexes:
    "city_pk" PRIMARY KEY, btree (city_id)
    "idx_city_country_id" btree (country_id) CLUSTER
Foreign-key constraints:
    "city_state_id_fk" FOREIGN KEY (state_id) REFERENCES geo.state(state_id)
ON UPDATE CASCADE ON DELETE CASCADE

explain analyze
SELECT   country_id, country_name
FROM     geo.country
WHERE country_id IN
 (select country_id FROM geo.city)
;

-- won't complete in a reasonable amount of time.

This one won't use the country_id index. The two tables have almost the same
number of rows:

cmi=# select count(*) from geo.world_city;
  count
---------
 1953314
(1 row)

cmi=# select count(*) from geo.city;
  count
---------
 2122712
(1 row)


 I tried to force it and didn't see any improvement.  I've vacuummed,
analyzed, clustered.  Can someone help me to get only the countries who have
cities in the city table in a reasonable amount of time?

-------------------------------------------------------

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Ever Increasing IOWAIT
Next
From: "Ralph Mason"
Date:
Subject: Re: Ever Increasing IOWAIT