Thread: Re: Geographic data sources, queries and questions

Re: Geographic data sources, queries and questions

From

"btober"

Date:

24 May 2007, 10:57:54

>
> >
> > I don't believe this is good design.  You'll have to
have a trigger or
> > something to verify that the country_id+state_id on the
city table are
> > exactly equal to the country_id+state_id on the state
table.  If you
> > don't, you might have something like (using US city
names...) "country:
> > USA -> state: NY" -> "country: Zimbabwe -> state: NY ->
city: New
> > York".
>
> >
> > It isn't a problem of "any country and any state" on the
city table, but
> > a problem of "this state inside that particular
country".  I'd drop the
> > country column.
>
> You are right, this is a bad design.  The country_id on
the city table has to
> go.


I'm not sure it is a bad design. Country has a country_id.
That's the primary key. State has a state_id, and exactly
one country, so really state has a compound primary key,
namely (country_id, state_id). And similarly to produce
relational integrity between state and city, city needs to
reference the state primary key, which means state has to
have all three (country_id, state_id, city_id) as it's
primary key.

This ties in with a subject dear to my heart and discussed
at great length starting here:

"http://archives.postgresql.org/pgsql-general/2006-08/msg00535.php"

The tie-in is not that you necessarily need the infamous
"gap-less" sequence, but that you don't necessarily need
three sequences per se.

While it would be temptingly easy to simply declare all
three separately in each table as

country:
country_id SERIAL, ...

state:
country_id integer,
state_id SERIAL, ...

city:
country_id integer,
state_id integer,
city_id SERIAL, ...

with that naive approach, every row in state has a unique
state_id, and every row in city has a unique city_id. Then
you'll notice that values of country_id are repeated in
state, and state_id values are repeated in city.

And then you'll realize that really it is the combination of
(country_id, state_id) that defines a unique state, and
(country_id, state_id, city_id) that defines a unique city.
It would require the use of stored programs and triggers to
manage these compound keys. But that's what stored programs
and triggers are for.

If you allow the fact that two countries could lay claim to
the same geographic sub-region, then you need a separate
table for an n-m relation.

Then throw in postal codes. Not only can a city have
multiple postal codes, but a postal code can serve more than
one city. And the cities served might be in different
states! (I used to have an example of that, but I can't find
it right now.)

Same with telephone area codes. (You are going there,
eventually, right?)

Re: Geographic data sources, queries and questions

From

Michael Glaesemann

Date:

29 May 2007, 15:50:05

On May 24, 2007, at 8:57 , btober wrote:

> I'm not sure it is a bad design. Country has a country_id.
> That's the primary key. State has a state_id, and exactly
> one country, so really state has a compound primary key,
> namely (country_id, state_id).

While each state may have a single state_id and a single country,
that does not imply a composite (or compound) primary key. There's
been a lot written on database normalization, so I won't go into
depth here, but a quick way to think about it is how each row is
uniquely identified. For example;

Is each state uniquely identified by state_id? If so, that's a
(primary) key for the states table. That's often what people are
trying to do when they make a table of the form:

-- Listing 1
CREATE TABLE states
(
    state_id INTEGER PRIMARY KEY
    , state_name TEXT NOT NULL
);

If you are only dealing with one country, each state is (hopefully)
uniquely identified by its name as well, so you could add a UNIQUE
constraint to the state_name column, e..g,

-- Listing 2
CREATE TABLE states
(
    state_id INTEGER PRIMARY KEY
    , state_name TEXT NOT NULL UNIQUE
);

 From a logical point of view, PRIMARY KEY is equivalent to NOT NULL
UNIQUE, so there's no logical difference between Listing 2 and the
following:

-- Listing 3
CREATE TABLE states
(
    state_id INTEGER NOT NULL UNIQUE
    , state_name TEXT PRIMARY KEY
);

The state_id column is what is often referred to as a surrogate key:
it holds no information that really identifies the state in any real
sense. One integer is as good as another to identify the state. On
the other hand, the state_name column *is* associated with each state
in a real sense. Assigning arbitrary names to states would be less
than useful.

If you're handling more than one country, you'll most likely want to
associate the states with their respective countries.

-- Listing 4
CREATE TABLE countries
(
     country_id INTEGER PRIMARY KEY
);

CREATE TABLE states
(
     state_id INTEGER PRIMARY KEY
     , state_name TEXT NOT NULL
     , country_id INTEGER NOT NULL
         REFERENCES countries (country_id)
);

Note that there's no UNIQUE constraint on state_name. You may have
more than one state with the same state_name around the world so you
may want to make sure that for each country, each state_name is unique:

-- Listing 5
CREATE TABLE states
(
     state_id INTEGER PRIMARY KEY
     , state_name TEXT NOT NULL
     , country_id INTEGER NOT NULL
         REFERENCES countries (country_id)
     , UNIQUE (country_id, state_name)
);

> And similarly to produce
> relational integrity between state and city, city needs to
> reference the state primary key, which means state has to
> have all three (country_id, state_id, city_id) as it's
> primary key.

While each city does belong to a given state, each state in turn is
associated with a given country. To find out what country a given
city belongs to you'd just join through the states table. For example:

-- Listing 6
CREATE TABLE cities
(
     city_id INTEGER PRIMARY KEY
     , city_name TEXT NOT NULL
     , state_id INTEGER NOT NULL
         REFERENCES states (state_id)
     , UNIQUE (state_id, city_name)
);

I've also gone ahead and provided a UNIQUE constraint to prevent
city_name duplicates in the same state.

A PRIMARY KEY constraint of the form PRIMARY KEY (country_id,
state_id, city_id) would mean that the for each country_id and
state_id combination each city_id is unique. This means you could
potentially have the same city in multiple states in the same country
or in various countries and states. And there's nothing to prevent
something along the lines of (Mexico City, Nebraska, Canada). Note
that (Omaha, Nebraska, United States of America) would happily exist
in the same cities table!

To find the countries for each city:

-- Listing 7

SELECT city_name, state_name, country_name
FROM cities
NATURAL JOIN states
NATURAL JOIN countries;

> While it would be temptingly easy to simply declare all
> three separately in each table as
>
> country:
> country_id SERIAL, ...
>
> state:
> country_id integer,
> state_id SERIAL, ...
>
> city:
> country_id integer,
> state_id integer,
> city_id SERIAL, ...
>
> with that naive approach, every row in state has a unique
> state_id, and every row in city has a unique city_id. Then
> you'll notice that values of country_id are repeated in
> state, and state_id values are repeated in city.

This "naive" approach (other than adding country_id to the city
table) actually looks like proper normalization. The repetition you
have here is just providing the country for each state and the state
for each city. That's not duplication of information, if you want to
associate states with countries and cities with states.

> And then you'll realize that really it is the combination of
> (country_id, state_id) that defines a unique state, and
> (country_id, state_id, city_id) that defines a unique city.
> It would require the use of stored programs and triggers to
> manage these compound keys. But that's what stored programs
> and triggers are for.

As above, if you've got your database schema designed properly, you
don't need any stored procedures or triggers (other than those
provided under the covers by the foreign keys) to maintain the proper
referential integrity. Each city has a unique state (the state_id
column) and a unique country (joined through the states table). But
enforcing (city_id, state_id, country_id) uniqueness allows all kinds
of city/state/country mismatches.

> If you allow the fact that two countries could lay claim to
> the same geographic sub-region, then you need a separate
> table for an n-m relation.

This could be interesting :) Especially if the regions don't share
exactly the same borders! :)

> Then throw in postal codes. Not only can a city have
> multiple postal codes, but a postal code can serve more than
> one city.

-- Listing 8

CREATE TABLE postal_codes
(
    postal_code TEXT PRIMARY KEY
);

CREATE TABLE city_postal_codes
(
     city_id INTEGER NOT NULL
         REFERENCES cities (city_id)
     , postal_code TEXT NOT NULL
         REFERENCES postal_codes (postal_code)
     , PRIMARY KEY (city_id, postal_code)
);

> And the cities served might be in different
> states! (I used to have an example of that, but I can't find
> it right now.)

In Listing 8 there's nothing enforcing unique (state_id, postal_code)
associations, so you wouldn't run into trouble there. If you wanted
to find out which postal codes service which states you can use a join:

-- Listing 9

SELECT DISTINCT state_name, postal_code
FROM states
NATURAL JOIN city_postal_codes;

> Same with telephone area codes. (You are going there,
> eventually, right?)

Given the mobility of telephone numbers nowadays, you may not be
concerned with strict associations with telephone numbers,
subscribers, cities, and states.

Anyway, this has ended up much longer than I intended, but I didn't
want this to go unanswered. I've found Chris Date's books very
helpful, in particular "Introduction to Database Systems"[1] and
"Database in Depth: Relational Theory for Practice"[2].

Hope this helps.

Michael Glaesemann
grzm seespotcode net

[1](http://www.amazon.com/Introduction-Database-Systems-Eighth/dp/
0321197844/)
[2](http://www.amazon.com/Database-Depth-Relational-Theory-
Practitioners/dp/0596100124/)

Re: Geographic data sources, queries and questions

From

Oliver Elphick

Date:

29 May 2007, 16:50:58

On Tue, 2007-05-29 at 13:49 -0500, Michael Glaesemann wrote:
>
> If you're handling more than one country, you'll most likely want to
> associate the states with their respective countries.
>
> -- Listing 4
> CREATE TABLE countries
> (
>      country_id INTEGER PRIMARY KEY
> );
>
> CREATE TABLE states
> (
>      state_id INTEGER PRIMARY KEY
>      , state_name TEXT NOT NULL
>      , country_id INTEGER NOT NULL
>          REFERENCES countries (country_id)
> );
>
> Note that there's no UNIQUE constraint on state_name. You may have
> more than one state with the same state_name around the world so you
> may want to make sure that for each country, each state_name is
> unique:

You have assumed that state codes are unique integers, but for a
worldwide database that is probably a bad design.  The USA knows its
states by two-letter codes, as does India and one should surely not
invent a new set of codes for them.  I would make this field a
VARCHAR(3) with an upper-case constraint.  Furthermore, these codes are
not going to be unique.  For instance MH is the US abbreviation for the
Marshall Islands [US Post Office] and also the Indian abbreviation for
Maharashtra [Wikipedia].  In such a case I would always make the country
code part of the primary key and not just an attribute.  Again this
saves your having to invent a new set of codes when one exists already.

--
Oliver Elphick                                          olly@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
                 ========================================
   Do you want to know God?   http://www.lfix.co.uk/knowing_god.html

Re: Geographic data sources, queries and questions

From

"John D. Burger"

Date:

29 May 2007, 17:26:19

Oliver Elphick wrote:

> You have assumed that state codes are unique integers, but for a
> worldwide database that is probably a bad design.  The USA knows its
> states by two-letter codes, as does India and one should surely not
> invent a new set of codes for them.  I would make this field a
> VARCHAR(3) with an upper-case constraint.

In fact, the US postal codes are not what most govt. data sources use
- they are mandated to use FIPS codes, which are numeric and are not
guaranteed to be stable!!!

> Furthermore, these codes are
> not going to be unique.  For instance MH is the US abbreviation for
> the
> Marshall Islands [US Post Office] and also the Indian abbreviation for
> Maharashtra [Wikipedia].  In such a case I would always make the
> country
> code part of the primary key and not just an attribute.  Again this
> saves your having to invent a new set of codes when one exists
> already.

Even ISO country codes are not guaranteed to be stable - I think
Yugoslavia is one example where a code has been recycled recently.
As I said, we found the simplest approach was to use our own internal
IDs for these things, and have a table mapping these to the codes
used in various standards.

- John D. Burger
   MITRE

Re: Geographic data sources, queries and questions

From

Michael Glaesemann

Date:

29 May 2007, 19:36:35

On May 29, 2007, at 14:50 , Oliver Elphick wrote:

> On Tue, 2007-05-29 at 13:49 -0500, Michael Glaesemann wrote:
>>
>> If you're handling more than one country, you'll most likely want to
>> associate the states with their respective countries.
>>
>> -- Listing 4
>> CREATE TABLE countries
>> (
>>      country_id INTEGER PRIMARY KEY
>> );
>>
>> CREATE TABLE states
>> (
>>      state_id INTEGER PRIMARY KEY
>>      , state_name TEXT NOT NULL
>>      , country_id INTEGER NOT NULL
>>          REFERENCES countries (country_id)
>> );
>>
>> Note that there's no UNIQUE constraint on state_name. You may have
>> more than one state with the same state_name around the world so you
>> may want to make sure that for each country, each state_name is
>> unique:
>
> You have assumed that state codes are unique integers, but for a
> worldwide database that is probably a bad design.

Actually, my intent was to use state_id as a surrogate key for
state_name. I assumed unique state_names per country. If one wanted
state codes, (such as ISO 3166-2), you'd add columns for that.

> Furthermore, these codes are
> not going to be unique.  For instance MH is the US abbreviation for
> the
> Marshall Islands [US Post Office] and also the Indian abbreviation for
> Maharashtra [Wikipedia].  In such a case I would always make the
> country
> code part of the primary key and not just an attribute.  Again this
> saves your having to invent a new set of codes when one exists
> already.

This trends towards the discussion on whether or not to use surrogate
keys. In the schema I suggested, the natural key is (state_name,
country_id). If you wanted to use ISO 3166-1 codes in both countries
and states tables rather than the country_id surrogate key, that'd
work, too. Surrogate keys are tangential to the normalization issues
I was addressing.

Michael Glaesemann
grzm seespotcode net

Re: Geographic data sources, queries and questions

From

Michael Glaesemann

Date:

29 May 2007, 19:46:58

On May 29, 2007, at 15:28 , John D. Burger wrote:

> Even ISO country codes are not guaranteed to be stable

I'm not sure where the idea that primary keys must be stable comes
from. There's nothing necessarily wrong with updating a primary key.
All a primary key does is uniquely identify a row in a table. If that
id changes over time, that's fine, as long as the primary key columns
continue to uniquely identify each row in the table. SQL even
provides ON UPDATE CASCADE to make this convenient. There may be
performance arguments against updating a primary key (as the changes
need to propagate), but that depends on the needs of a particular
(benchmarked and tested) application environment.

Michael Glaesemann
grzm seespotcode net

Re: Geographic data sources, queries and questions

From

Ron Johnson

Date:

30 May 2007, 04:15:04

On 05/29/07 17:46, Michael Glaesemann wrote:
>
> On May 29, 2007, at 15:28 , John D. Burger wrote:
>
>> Even ISO country codes are not guaranteed to be stable
>
> I'm not sure where the idea that primary keys must be stable comes from.
> There's nothing necessarily wrong with updating a primary key. All a
> primary key does is uniquely identify a row in a table. If that id
> changes over time, that's fine, as long as the primary key columns
> continue to uniquely identify each row in the table.

And any archived data (for example, transaction detail that you must
keep for 7 years but don't still want in your database, since it
doubles your backup/restore times) will still have the old codes.

"Static" data needs to be static.

>                                                      SQL even provides
> ON UPDATE CASCADE to make this convenient. There may be performance
> arguments against updating a primary key (as the changes need to
> propagate), but that depends on the needs of a particular (benchmarked
> and tested) application environment.

--
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

Re: Geographic data sources, queries and questions

From

"John D. Burger"

Date:

30 May 2007, 13:00:36

>>> Even ISO country codes are not guaranteed to be stable
>>
>> I'm not sure where the idea that primary keys must be stable comes
>> from. There's nothing necessarily wrong with updating a primary
>> key. All a primary key does is uniquely identify a row in a table.
>> If that id changes over time, that's fine, as long as the primary
>> key columns continue to uniquely identify each row in the table.
>
> And any archived data (for example, transaction detail that you
> must keep for 7 years but don't still want in your database, since
> it doubles your backup/restore times) will still have the old codes.
>
> "Static" data needs to be static.

Yes, and then there is the question of what such a recycled code
actually =means= as a foreign key.

For example, CS used to be the code for Czechoslovakia, then it was
for Serbia and Montenegro, now it is in "transition" before being
deleted.  Czechoslovakia no longer has a code, since it no longer
exists, as far as ISO is concerned.  What do you want to do with your
biography database for 19th century Slavic poets, which indicate that
some people were born in Czechoslovakia.  Did those people move
(briefly) to Serbia and Montenegro?  Or did their birthplace change
to NULL?  If you want to give them a code, you have to find out what
part of Czechoslovakia they actually lived in, and what country that
region's now in.  Do you really want some external agency forcing you
to muck with you data like this?

Anyway, regardless of one's feelings along these lines, I thought
many might be implicitly assuming that all of these standards
guarantee such stability, and I wanted to disabuse folks of that.

- John Burger
   MITRE

Re: Geographic data sources, queries and questions

From

Gregory Stark

Date:

30 May 2007, 13:52:00

>>> I'm not sure where the idea that primary keys must be stable comes from.
>>> There's nothing necessarily wrong with updating a primary  key. All a
>>> primary key does is uniquely identify a row in a table.  If that id changes
>>> over time, that's fine, as long as the primary  key columns continue to
>>> uniquely identify each row in the table.

Firstly trying to update such a key you'll immediately bump into the practical
reasons why it doesn't work well. You have to update every record everywhere
in the database that references that key which represents a lot of potential
work.

But secondly, what happens to applications that have read that value into
their local state and then try to operate on it? When they return to perform a
second operation and use the primary key to specify the record they wish to
operate on they'll find it gone.

In short, you have to update every instance of the key, not only in the
database, but in every application and even in every other representation in
the real world. That could include changing people's bookmarks, notes in PDAs,
even paper reports sitting on people's desks -- a tall order for an SQL query.

In short primary keys that aren't static just aren't very useful.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Geographic data sources, queries and questions

From

PFC

Date:

30 May 2007, 14:01:55


> In short, you have to update every instance of the key, not only in the
> database, but in every application and even in every other
> representation in
> the real world. That could include changing people's bookmarks, notes in
> PDAs,
> even paper reports sitting on people's desks -- a tall order for an SQL
> query.

    This also applies to misguided databases that REUSE values from
"auto_increment" columns.
    I once had two orders with the same PK value.

    One had been mistakenly deleted, then another one took its place, and all
hell broke loose.

Re: Geographic data sources, queries and questions

From

Ron Johnson

Date:

30 May 2007, 14:44:35

On 05/30/07 11:01, John D. Burger wrote:
>>>> Even ISO country codes are not guaranteed to be stable
>>>
>>> I'm not sure where the idea that primary keys must be stable comes
>>> from. There's nothing necessarily wrong with updating a primary key.
>>> All a primary key does is uniquely identify a row in a table. If that
>>> id changes over time, that's fine, as long as the primary key columns
>>> continue to uniquely identify each row in the table.
>>
>> And any archived data (for example, transaction detail that you must
>> keep for 7 years but don't still want in your database, since it
>> doubles your backup/restore times) will still have the old codes.
>>
>> "Static" data needs to be static.
>
> Yes, and then there is the question of what such a recycled code
> actually =means= as a foreign key.
>
> For example, CS used to be the code for Czechoslovakia, then it was for
> Serbia and Montenegro, now it is in "transition" before being deleted.
> Czechoslovakia no longer has a code, since it no longer exists, as far
> as ISO is concerned.  What do you want to do with your biography
> database for 19th century Slavic poets, which indicate that some people
> were born in Czechoslovakia.  Did those people move (briefly) to Serbia
> and Montenegro?  Or did their birthplace change to NULL?  If you want to
> give them a code, you have to find out what part of Czechoslovakia they
> actually lived in, and what country that region's now in.  Do you really
> want some external agency forcing you to muck with you data like this?

In situations like this (a toll road schedule) we add beginning and
expiring timestamps, and the expiring timestamp is part of the
natural PK.

> Anyway, regardless of one's feelings along these lines, I thought many
> might be implicitly assuming that all of these standards guarantee such
> stability, and I wanted to disabuse folks of that.

It's very interesting and useful to know.

--
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

Re: Geographic data sources, queries and questions

From

Michael Glaesemann

Date:

01 June 2007, 09:53:39

On May 30, 2007, at 11:51 , Gregory Stark wrote:

> Firstly trying to update such a key you'll immediately bump into
> the practical
> reasons why it doesn't work well. You have to update every record
> everywhere
> in the database that references that key which represents a lot of
> potential
> work.

As I mentioned earlier, if you plan on updating keys, you'd have ON
UPDATE CASCADE set, so there shouldn't be any work here. There may be
performance issues as those changes propagate, however.

> But secondly, what happens to applications that have read that
> value into
> their local state and then try to operate on it? When they return
> to perform a
> second operation and use the primary key to specify the record they
> wish to
> operate on they'll find it gone.

This is an interesting point. What this boils down to is having a
method of cascade the key changes outside of the database. Depending
on how long the data is cached and how gracefully the application
deals with errors when the lookup value is no longer present. In a
case where you know primary keys are going to be updating, you'd
probably want some additional key attribute that *is* stable for use
outside of the database. Anyway, thanks for the food for thought!

Michael Glaesemann
grzm seespotcode net