Home > mailing lists

Thread: Using varchar primary keys.

Using varchar primary keys.

From

Tim Uckun

Date:

31 March 2013, 23:35:53

Consider the following scenario.

I have a typical tagging structure. There is a table called tags, there is a table called taggings. The taggings table acts as a many to many join table between the taggers and the tags.

The tags table has two fields id and tag. id is a serial data type.

The taggings has a tag_id field along with the context, tagger_id etc.

I don't think there is even one query in this scenario which does not join the taggings table to the tags table so I am wondering if there is any advantage at all of having that id field in the tags table. Why shouldn't I just put the tag itself as the primary key? The tags are quite short so if pg is smart about it I would guess they would take up less space than an integer in some cases.

I guess the same scenario happens in many of my lookup tables. I have all these tables with just two fields in them. id field and then some varchar field. Then every time I want to display that record I join with five other tables so I can look up all the associated lookup tables and display the text value to the user.

Given that you can still set foreign keys and cascade commands to adjust child records either way the tables would be properly normalized so I am wondering what I am gaining by using these serial ID fields.

Is this false economy?

Re: Using varchar primary keys.

From

Misa Simic

Date:

01 April 2013, 01:06:59

Hi Tim,

Debate "natural" vs "surrogate" key last for a too long time - with no
official winner...

I think it is more "religional" then real issue...

Advice will be simple: pick approach what best fit your needs, taking into
account now and tomorrow (probability of change)...

SQL standard and normal forms are fine with both religions... And it really
has very small impact on: speed and storage size (real things) - that is
not worth mentioning

I think, I personally could be put to "surrogate" religion - but in
described scenario, having tables with just two columns: pk and unique - I
would never suggest that, because of simply from practical, common sense,
view - it doesn't make sense...

So probably in your scenario you should pick "natural" approach - for
described scenarios...

Kind Regards,

Misa
------------------------------
From: Tim Uckun
Sent: 01/04/2013 01:36
To: pgsql-general
Subject: [GENERAL] Using varchar primary keys.

Consider the following scenario.

I have a typical tagging structure. There is a table called tags, there is
a table called taggings. The taggings table acts as a many to many join
table between the taggers and the tags.

The tags table has two fields id and tag. id is a serial data type.

The taggings has a tag_id field along with the context, tagger_id etc.

I don't think there is even one query in this scenario which does not join
the taggings table to the tags table so I am wondering if there is any
advantage at all of having that id field in the tags table. Why shouldn't I
just put the tag itself as the primary key?  The tags are quite short so if
pg is smart about it I would guess they would take up less space than an
integer in some cases.

I guess the same scenario happens in many of my lookup tables. I have all
these tables with just two fields in them.  id field and then some varchar
field.  Then every time I want to display that record I join with five
other tables so I can look up all the associated lookup tables and display
the text value to the user.

Given that you can still set foreign keys and cascade commands to adjust
child records either way the tables would be properly normalized so I am
wondering what I am gaining by using these serial ID fields.

Is this false economy?

Re: Using varchar primary keys.

From

Modulok

Date:

01 April 2013, 01:19:21

On 3/31/13, Tim Uckun <timuckun@gmail.com> wrote:
> Consider the following scenario.
>
> I have a typical tagging structure. There is a table called tags, there is
> a table called taggings. The taggings table acts as a many to many join
> table between the taggers and the tags.
>
> The tags table has two fields id and tag. id is a serial data type.
>
> The taggings has a tag_id field along with the context, tagger_id etc.
>
> I don't think there is even one query in this scenario which does not join
> the taggings table to the tags table so I am wondering if there is any
> advantage at all of having that id field in the tags table. Why shouldn't I
> just put the tag itself as the primary key?  The tags are quite short so if
> pg is smart about it I would guess they would take up less space than an
> integer in some cases.
>
> I guess the same scenario happens in many of my lookup tables. I have all
> these tables with just two fields in them.  id field and then some varchar
> field.  Then every time I want to display that record I join with five
> other tables so I can look up all the associated lookup tables and display
> the text value to the user.
>
> Given that you can still set foreign keys and cascade commands to adjust
> child records either way the tables would be properly normalized so I am
> wondering what I am gaining by using these serial ID fields.

You can. Arguably, if the natural key is relatively short and consists of only
one column E.F. Codd would probably have encourage you to use it. I'd leave the
serial column for convenience of identifying individual records on the command
line, especially if the varchar can store characters that are not easily typed
in a console.

As for the primary key being a 'natural key' vs. a 'synthetic' one,
that's a huge debate akin to emacs vs. vi. Depending on your situation, there's
nothing wrong with using natural keys. It is discussed at length and
in some cases encouraged here::

http://database-programmer.blogspot.com/2008/01/database-skills-sane-approach-to.html


-Modulok-

Re: Using varchar primary keys.

From

Julian

Date:

01 April 2013, 03:09:37

On 01/04/13 12:19, Modulok wrote:
> On 3/31/13, Tim Uckun <timuckun@gmail.com> wrote:
>> Consider the following scenario.
>>
>> I have a typical tagging structure. There is a table called tags, there is
>> a table called taggings. The taggings table acts as a many to many join
>> table between the taggers and the tags.
>>
>> The tags table has two fields id and tag. id is a serial data type.
>>
>> The taggings has a tag_id field along with the context, tagger_id etc.
>>
>> I don't think there is even one query in this scenario which does not join
>> the taggings table to the tags table so I am wondering if there is any
>> advantage at all of having that id field in the tags table. Why shouldn't I
>> just put the tag itself as the primary key?  The tags are quite short so if
>> pg is smart about it I would guess they would take up less space than an
>> integer in some cases.
>>
>> I guess the same scenario happens in many of my lookup tables. I have all
>> these tables with just two fields in them.  id field and then some varchar
>> field.  Then every time I want to display that record I join with five
>> other tables so I can look up all the associated lookup tables and display
>> the text value to the user.
>>
>> Given that you can still set foreign keys and cascade commands to adjust
>> child records either way the tables would be properly normalized so I am
>> wondering what I am gaining by using these serial ID fields.
>
> You can. Arguably, if the natural key is relatively short and consists of only
> one column E.F. Codd would probably have encourage you to use it. I'd leave the
> serial column for convenience of identifying individual records on the command
> line, especially if the varchar can store characters that are not easily typed
> in a console.
>
> As for the primary key being a 'natural key' vs. a 'synthetic' one,
> that's a huge debate akin to emacs vs. vi. Depending on your situation, there's
> nothing wrong with using natural keys. It is discussed at length and
> in some cases encouraged here::
>
> http://database-programmer.blogspot.com/2008/01/database-skills-sane-approach-to.html
>
>
> -Modulok-
>
>
I disagree with almost everything in that blog post:

Shouldn't the first question be to figure out why even a bigserial
(int8) would not suffice as a unique identifier for row data? There are
reasons where it wont, what are they?
(theres a great blog post by a twitter developer on the development of
their unique identifier (PK) - couldn't find it)

Natural Keys have a purpose but when do they exist in the database?

For example a Surrogate Key that is exposed to the world may get wrapped
in a checkbit algorithm or perhaps even a hmac encoded key to validate
the genuineness of incoming requests before an expensive database lookup
is initiated.

So is there a need to store the Natural Keys generated in the
application? It might be useful to external parties though they may even
call it "foos_pks".

What about the Natural Keys of an external source? Should that be stored
in a lookup table along with the integer based Surrogate Key?
Maybe I'll call it "bars_pks".

As to the original question, my response may seem off topic, but
exposing PK's to the world has some pitfalls (and not all to do with
security) and character based keys have pitfalls as compared to
Surrogate Keys that stay out of the displayed data.

CREATE TABLE tags (tag text PRIMARY KEY);
CREATE TABLE child_tag (parent_tag text REFERENCES tags);

INSERT INTO tags VALUES ('fu');
INSERT INTO child_tag VALUES ('fu');

--Spelling error.

UPDATE tags SET tag = 'foo' WHERE tag = 'fu';

This will fail unless you ON UPDATE CASCADE.

Some things to think about.

Regards,
Jules.

Re: Using varchar primary keys.

From

Gavan Schneider

Date:

01 April 2013, 04:02:16

On 1/4/13 at 10:35 AM, Tim Uckun wrote:

>Consider the following scenario.
>
Since I don't consider myself an expert I am talking in basic
terms below. Please don't misunderstand this plodding as
anything other than me attempting to be as clear as possible.

These comments reflect my understanding of Joe Celko's view as
expressed in his books. I fully understand that many respectable
SQL masters (respectfully) disagree with Celko in this. My take
is to use his approach as a starting point and only deviate when
a different approach is obviously much easier for me. (I doubt
it's better. :). YMMV.

>I have a typical tagging structure. There is a table called tags, there is
>a table called taggings. The taggings table acts as a many to many join
>table between the taggers and the tags.
>
>The tags table has two fields id and tag. id is a serial data type.
>
Celko would suggest the table name is the plural of its primary
key, so a table named tags has tag as the name of its primary
key. And this is a big hint: the sequential index is not needed.
And he further (very strongly) suggests that primary keys assume
their natural values and not be referenced by arbitrary
(sequential) values, i.e., avoid pseudo-keys, as neither SQL nor
modern data bases need them.

>The taggings has a tag_id field along with the context, tagger_id etc.
>
So this would instead have two columns named "tag" and "tagger"
(each referencing the column of the same name in tables "tags"
and "taggers") and whatever is implied by the "etc". if the pair
(tag,tagger) is meant to be unique then this is a good candidate
for primary key, otherwise a further term(s) needs to be dragged
into the composite index (e.g., a timestamp). Once again there
is no logical need for a sequential integer to be stored as well.

>
....
>
>Given that you can still set foreign keys and cascade commands to adjust
>child records either way the tables would be properly normalized so I am
>wondering what I am gaining by using these serial ID fields.
>
>Is this false economy?
>
I think so.

If the tags table has both sequential integer and the unique tag
value then there is likely storage and I/O associated with two
indexes, along with storage of both tag and associated integer.
If the list of tags is short enough it might be suitable to use
an ENUM to enforce both integrity and brevity.

If the list of tags is longer and/or might need updating then a
single column table will allow for this. In general, I think
lists of things (e.g., tag values) should be kept within the dB
so they can be used for integrity checking, etc. I don't like
the idea of externalising this job to the interface application.
Also, as you have mentioned, you can elegantly handle table
renovation, eg., if a given tag needs its name changed, it can
be done via a foreign key constraint with cascade. (Elegant here
refers to the syntax, the actual I/O implications may be very
ugly depending on the scale of data that needs rewriting. :)

In another situation the tags table might well contain columns
such as: tag, tag_full_name, tag_origin,
tag_authority_certificate/expiry/whatever. In such an instance
the tag is better if it's human readable/meaningful so reports
can be useful without always back joining the full name.
Sequential integers rarely fulfil this role as implied by the
original question.

Regards
Gavan Schneider

Re: Using varchar primary keys.

From

Misa Simic

Date:

01 April 2013, 05:06:45

Hi Tim,

Debate "natural" vs "surrogate" key last for a too long time - with no official winner...

I think it is more "religional" then real issue...

Advice will be simple: pick approach what best fit your needs, taking into account now and tomorrow (probability of change)...

SQL standard and normal forms are fine with both religions... And it really has very small impact on: speed and storage size (real things) - that is not worth mentioning

I think, I personally could be put to "surrogate" religion - but in described scenario, having tables with just two columns: pk and unique - I would never suggest that, because of simply from practical, common sense, view - it doesn't make sense...

So probably in your scenario you should pick "natural" approach - for described scenarios...

Kind Regards,

Misa

From: Tim Uckun
Sent: 01/04/2013 01:36
To: pgsql-general
Subject: [GENERAL] Using varchar primary keys.

Consider the following scenario.

I have a typical tagging structure. There is a table called tags, there is a table called taggings. The taggings table acts as a many to many join table between the taggers and the tags.

The tags table has two fields id and tag. id is a serial data type.

The taggings has a tag_id field along with the context, tagger_id etc.

I don't think there is even one query in this scenario which does not join the taggings table to the tags table so I am wondering if there is any advantage at all of having that id field in the tags table. Why shouldn't I just put the tag itself as the primary key? The tags are quite short so if pg is smart about it I would guess they would take up less space than an integer in some cases.

I guess the same scenario happens in many of my lookup tables. I have all these tables with just two fields in them. id field and then some varchar field. Then every time I want to display that record I join with five other tables so I can look up all the associated lookup tables and display the text value to the user.

Given that you can still set foreign keys and cascade commands to adjust child records either way the tables would be properly normalized so I am wondering what I am gaining by using these serial ID fields.

Is this false economy?

Re: Using varchar primary keys.

From

Tim Uckun

Date:

01 April 2013, 05:17:43

Natural Keys have a purpose but when do they exist in the database?

In my case it's the tags. Table tags has only two columns id and tag. Each one has a unique index. I have many other similar lookup tables. For example things like order_status, item_type etc.

What about the Natural Keys of an external source? Should that be stored
in a lookup table along with the integer based Surrogate Key?
Maybe I'll call it "bars_pks".

I always keep those. The API's I expose to clients and partners allow them to modify the records they send and I let them refer to items by their own primary keys. This has always worked out well for me.

--Spelling error.

UPDATE tags SET tag = 'foo' WHERE tag = 'fu';

This will fail unless you ON UPDATE CASCADE.

Yes of course you'd need the on update cascade. Then again maybe I don't even need that tags table. Tags could just be a view (select distinct tag from child_tags). I am not saying that's efficient or desirable but it's possible.

Re: Using varchar primary keys.

From

Jasen Betts

Date:

01 April 2013, 06:01:27

On 2013-04-01, Tim Uckun <timuckun@gmail.com> wrote:
> --14dae93996072d9ff304d945bd3c
> Content-Type: text/plain; charset=UTF-8
>
>>
>>
>> Natural Keys have a purpose but when do they exist in the database?
>>
>>
> In my case it's the tags. Table tags has only two columns id and tag. Each
> one has a unique index.  I have many other similar lookup tables. For
> example things like order_status, item_type etc.
>

how about using an enum instead of this table?


--
⚂⚃ 100% natural

Re: Using varchar primary keys.

From

Tim Uckun

Date:

01 April 2013, 06:08:22

how about using an enum instead of this table?

That's an interesting idea. Are enums mutable?

Re: Using varchar primary keys.

From

Jasen Betts

Date:

01 April 2013, 06:23:03

On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:
> >
> > how about using an enum instead of this table?
> >
> >
> That's an interesting idea.  Are enums mutable?


since 9.1 you can add values.

  http://www.postgresql.org/docs/9.1/static/sql-altertype.html

--
⚂⚃ 100% natural

Re: Using varchar primary keys.

From

Merlin Moncure

Date:

01 April 2013, 13:35:21

On Mon, Apr 1, 2013 at 1:22 AM, Jasen Betts <jasen@xnet.co.nz> wrote:
> On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:
>> >
>> > how about using an enum instead of this table?
>> >
>> >
>> That's an interesting idea.  Are enums mutable?
>
>
> since 9.1 you can add values.
>
>   http://www.postgresql.org/docs/9.1/static/sql-altertype.html

It's an interesting idea, but I don't think enums are designed to act
as a primary key except in cases where the data is basically static
and is relatively small.  For starters, any manipulation of the enum
requires a lock.

enums can be a real life saver when you need custom ordering built
into a string, especially if that ordering is floated over a composite
index.

merlin

Re: Using varchar primary keys.

From

jesusthefrog

Date:

01 April 2013, 19:35:31

On the topic of 'natural' versus 'synthetic' primary keys, I am generally in the camp that an extra ID field won't cost you too much, and while one may not need it for a simple table (i.e. id, name) one might add any number of columns later, and you'll be glad to have it.

I am, however, against using sequences (or serial integers in Postgres) for reasons of scaling and replication across multiple copies of a database running on different servers.

My preferred method is to give every table an ID column of UUID type and generate a UUID using the uuid-ossp contrib module. This also prevents someone not familiar with the database design from using an ID somewhere they should not (as is possible with natural PKs) or treating the ID as an integer, not an identifier (as is all too common with serial integers).

On Mon, Apr 1, 2013 at 9:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Mon, Apr 1, 2013 at 1:22 AM, Jasen Betts <jasen@xnet.co.nz> wrote:
> On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:
>> >
>> > how about using an enum instead of this table?
>> >
>> >
>> That's an interesting idea. Are enums mutable?
>
>
> since 9.1 you can add values.
>
> http://www.postgresql.org/docs/9.1/static/sql-altertype.html

It's an interesting idea, but I don't think enums are designed to act
as a primary key except in cases where the data is basically static
and is relatively small. For starters, any manipulation of the enum
requires a lock.

enums can be a real life saver when you need custom ordering built
into a string, especially if that ordering is floated over a composite
index.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GIT d+ s: a-- C++++(++) UL+++ P++++$ L+++>++++ E- W+ N o? K w--- O-
M- V? PS+++ PE(-) Y+ PGP->+++ t+>++ 5+++++ X(+) R>+ tv b+
DI++ D++>+++ G+ e* h! !r y**
------END GEEK CODE BLOCK------

Re: Using varchar primary keys.

From

Tim Uckun

Date:

01 April 2013, 20:11:47

On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com> wrote:

On the topic of 'natural' versus 'synthetic' primary keys, I am generally in the camp that an extra ID field won't cost you too much, and while one may not need it for a simple table (i.e. id, name) one might add any number of columns later, and you'll be glad to have it.

Nothing prevents you from adding more columns if you use varchar primary keys.

My preferred method is to give every table an ID column of UUID type and generate a UUID using the uuid-ossp contrib module. This also prevents someone not familiar with the database design from using an ID somewhere they should not (as is possible with natural PKs) or treating the ID as an integer, not an identifier (as is all too common with serial integers).

This would be a concern if you had multi master writes . As far as I know Postgres does not have a true multi master replication system so all the writes have to happen on one server right?

As for UUIDs I use them sometimes but I tend to also use one serial column because when I am communicating with people it makes it so much easier to say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to talk to people about the data and UUIDs make it very difficult to communicate with humans.

Re: Using varchar primary keys.

From

Julian

Date:

02 April 2013, 02:12:57

On 02/04/13 06:35, jesusthefrog wrote:
> On the topic of 'natural' versus 'synthetic' primary keys, I am
> generally in the camp that an extra ID field won't cost you too much,
> and while one may not need it for a simple table (i.e. id, name) one
> might add any number of columns later, and you'll be glad to have it.

Definitely, a lookup table can be extended to hold all kinds of
statistics and translations on a "tag". I didn't look too deeply into
the actual usage (not sure if any real examples where given). But ON
UPDATE CASCADE if not a feature I would recommend in this context
(referencing a PK) if at all.

> I am, however, against using sequences (or serial integers in Postgres)
> for reasons of scaling and replication across multiple copies of a
> database running on different servers.
>
> My preferred method is to give every table an ID column of UUID type and
> generate a UUID using the uuid-ossp contrib module. This also prevents
> someone not familiar with the database design from using an ID somewhere
> they should not (as is possible with natural PKs) or treating the ID as
> an integer, not an identifier (as is all too common with serial integers).
>

It was instagrams own implementation, but mentions twitters "Snowflake".
Interesting read still only requires a bigint.
http://goo.gl/gRlXC

IMO uuid is convenient but large and for all that data it doesn't hold
anything useful. I have been tempted though.

Jules.

Re: Using varchar primary keys.

From

Gavin Flower

Date:

02 April 2013, 05:55:15

On 02/04/13 08:35, jesusthefrog wrote:

On the topic of 'natural' versus 'synthetic' primary keys, I am generally in the camp that an extra ID field won't cost you too much, and while one may not need it for a simple table (i.e. id, name) one might add any number of columns later, and you'll be glad to have it.

I am, however, against using sequences (or serial integers in Postgres) for reasons of scaling and replication across multiple copies of a database running on different servers.

My preferred method is to give every table an ID column of UUID type and generate a UUID using the uuid-ossp contrib module. This also prevents someone not familiar with the database design from using an ID somewhere they should not (as is possible with natural PKs) or treating the ID as an integer, not an identifier (as is all too common with serial integers).

On Mon, Apr 1, 2013 at 9:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Apr 1, 2013 at 1:22 AM, Jasen Betts <jasen@xnet.co.nz> wrote:
> On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:
>> >
>> > how about using an enum instead of this table?
>> >
>> >
>> That's an interesting idea. Are enums mutable?
>
>
> since 9.1 you can add values.
>
> http://www.postgresql.org/docs/9.1/static/sql-altertype.html

It's an interesting idea, but I don't think enums are designed to act
as a primary key except in cases where the data is basically static
and is relatively small. For starters, any manipulation of the enum
requires a lock.

enums can be a real life saver when you need custom ordering built
into a string, especially if that ordering is floated over a composite
index.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GIT d+ s: a-- C++++(++) UL+++ P++++$ L+++>++++ E- W+ N o? K w--- O-
M- V? PS+++ PE(-) Y+ PGP->+++ t+>++ 5+++++ X(+) R>+ tv b+
DI++ D++>+++ G+ e* h! !r y**
------END GEEK CODE BLOCK------

I use synthetic primary keys, I want to minimise changes to the database because the user, directly or due to of some law change, changes a 'natural' value. Using synthetic primary keys, minimises changes to a database when a 'natural' value changes - if people's names are part of many natural keys, then when people change their name (like when a woman gets married), only one table needs to change. Likewise the customer nunber the manager swore would never change, now they want to change for a numeric key to an alphanumeric one.

Using 'natural' values for a primary key, seems to be deliberately adding potential time bombs. Almost as bad as the misguided idea that using 'sudo' is safer than the alternative when executing commands a root!

Cheers,
Gavin

Re: Using varchar primary keys.

From

Joe Van Dyk

Date:

02 April 2013, 15:35:39

On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:

On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com> wrote:
On the topic of 'natural' versus 'synthetic' primary keys, I am generally in the camp that an extra ID field won't cost you too much, and while one may not need it for a simple table (i.e. id, name) one might add any number of columns later, and you'll be glad to have it.

Nothing prevents you from adding more columns if you use varchar primary keys.

My preferred method is to give every table an ID column of UUID type and generate a UUID using the uuid-ossp contrib module. This also prevents someone not familiar with the database design from using an ID somewhere they should not (as is possible with natural PKs) or treating the ID as an integer, not an identifier (as is all too common with serial integers).

This would be a concern if you had multi master writes . As far as I know Postgres does not have a true multi master replication system so all the writes have to happen on one server right?

As for UUIDs I use them sometimes but I tend to also use one serial column because when I am communicating with people it makes it so much easier to say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to talk to people about the data and UUIDs make it very difficult to communicate with humans.

I've been wishing for a smaller uuid type for a while. Say you need to assign a Order #. Customers might be referencing the number, so you don't want it to be too long. But you don't want Order #'s to be guessable or have the characters/digits be transposed accidently.

I've been using a unique text column with a default of random_characters(12)

CREATE OR REPLACE FUNCTION public.random_characters(length integer)

RETURNS text

LANGUAGE sql

STABLE

AS $function$

SELECT array_to_string(array((

SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'

FROM mod((random()*31)::int, 31)+1 FOR 1)

FROM generate_series(1, $1))),'');

$function$;

This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can easily be mistyped or misread.

Re: Using varchar primary keys.

From

Merlin Moncure

Date:

02 April 2013, 18:16:09

On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote:
> On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:
>>
>>
>>
>>
>> On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com>
>> wrote:
>>>
>>> On the topic of 'natural' versus 'synthetic' primary keys, I am generally
>>> in the camp that an extra ID field won't cost you too much, and while one
>>> may not need it for a simple table (i.e. id, name) one might add any number
>>> of columns later, and you'll be glad to have it.
>>>
>>
>> Nothing prevents you from adding more columns if you use varchar primary
>> keys.
>>
>>>
>>>
>>> My preferred method is to give every table an ID column of UUID type and
>>> generate a UUID using the uuid-ossp contrib module. This also prevents
>>> someone not familiar with the database design from using an ID somewhere
>>> they should not (as is possible with natural PKs) or treating the ID as an
>>> integer, not an identifier (as is all too common with serial integers).
>>>
>>>
>>
>> This would be a concern if you had multi master writes . As far as I know
>> Postgres does not have a true multi master replication system so all the
>> writes have to happen on one server right?
>>
>> As for UUIDs I use them sometimes but I tend to also use one serial column
>> because when I am communicating with people it makes it so much easier to
>> say "dealer number X" than "dealer number SOME_HUGE_UUID".  I often have to
>> talk to people about the data and UUIDs make it very difficult to
>> communicate with humans.
>
>
> I've been wishing for a smaller uuid type for a while. Say you need to
> assign a Order #. Customers might be referencing the number, so you don't
> want it to be too long. But you don't want Order #'s to be guessable or have
> the characters/digits be transposed accidently.
>
> I've been using a unique text column with a default of random_characters(12)
>
> CREATE OR REPLACE FUNCTION public.random_characters(length integer)
>  RETURNS text
>  LANGUAGE sql
>  STABLE
> AS $function$
> SELECT array_to_string(array((
>       SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'
>         FROM mod((random()*31)::int, 31)+1 FOR 1)
>       FROM generate_series(1, $1))),'');
> $function$;
>
> This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can
> easily be mistyped or misread.

This is pseudo random and can be guessed, which is maybe dangerous
depending on circumstance.  For stronger random stream go to
pgcrypto.gen_random_bytes().  Also, now you have to worry about
collisions -- the whole point of uuid is to try and keep you from
having to deal with that.

My historical comments in this debate are noted.  To summarize, I
strongly believe that natural keys are often (but not always) better.

merlin

Re: Using varchar primary keys.

From

Merlin Moncure

Date:

02 April 2013, 19:37:18

On Tue, Apr 2, 2013 at 2:01 PM, Samantha Atkins <sjatkins@me.com> wrote:
> Natural keys are in user data space.  Thus they are not guaranteed invariant and therefore cannot serve as persistent
identity.

That is true, but irrelevant in most real world cases.  Also, nothing
is keeping you from using an extra marker if/when you need to provide
an invariant lookup.

> Also natural keys have the considerable defect of being of different type and arity per logical entity type.   This
meansthat very nice things like dynamic relationships and persistent collections of references to multiple types of
thingscannot be built.  It also increases ORM  and data cache complexity. 

OO evangelism.

> These are considerable weaknesses.
>
> You cannot guess a uuid from say uuid.uuid4().  If you think you can I look forward to seeing your proof.

I never claimed that.  I said that postgresql random() can be guessed,
which it can, since it's based on lrand48.

merlin

Re: Using varchar primary keys.

From

Joe Van Dyk

Date:

02 April 2013, 20:54:43

On Tue, Apr 2, 2013 at 11:16 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote:
> On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:
>>
>>
>>
>>
>> On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com>
>> wrote:
>>>
>>> On the topic of 'natural' versus 'synthetic' primary keys, I am generally
>>> in the camp that an extra ID field won't cost you too much, and while one
>>> may not need it for a simple table (i.e. id, name) one might add any number
>>> of columns later, and you'll be glad to have it.
>>>
>>
>> Nothing prevents you from adding more columns if you use varchar primary
>> keys.
>>
>>>
>>>
>>> My preferred method is to give every table an ID column of UUID type and
>>> generate a UUID using the uuid-ossp contrib module. This also prevents
>>> someone not familiar with the database design from using an ID somewhere
>>> they should not (as is possible with natural PKs) or treating the ID as an
>>> integer, not an identifier (as is all too common with serial integers).
>>>
>>>
>>
>> This would be a concern if you had multi master writes . As far as I know
>> Postgres does not have a true multi master replication system so all the
>> writes have to happen on one server right?
>>
>> As for UUIDs I use them sometimes but I tend to also use one serial column
>> because when I am communicating with people it makes it so much easier to
>> say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to
>> talk to people about the data and UUIDs make it very difficult to
>> communicate with humans.
>
>
> I've been wishing for a smaller uuid type for a while. Say you need to
> assign a Order #. Customers might be referencing the number, so you don't
> want it to be too long. But you don't want Order #'s to be guessable or have
> the characters/digits be transposed accidently.
>
> I've been using a unique text column with a default of random_characters(12)
>
> CREATE OR REPLACE FUNCTION public.random_characters(length integer)
> RETURNS text
> LANGUAGE sql
> STABLE
> AS $function$
> SELECT array_to_string(array((
> SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'
> FROM mod((random()*31)::int, 31)+1 FOR 1)
> FROM generate_series(1, $1))),'');
> $function$;
>
> This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can
> easily be mistyped or misread.

This is pseudo random and can be guessed, which is maybe dangerous
depending on circumstance. For stronger random stream go to
pgcrypto.gen_random_bytes(). Also, now you have to worry about
collisions -- the whole point of uuid is to try and keep you from
having to deal with that.

Right, but it's better than using serial's as far as being guessable.

The probability for collisions are fairly low, if you are using 12 or more characters (with 30 possible characters). Not sure what the math is on the probability of collisions (birthday problem) though.. and you could have a trigger that checked for the existence of a matching key before inserts/updates.

And using UUIDs would be too long for lots of purposes where people are working with the numbers, and where there might be external constraints on how long the IDs can be.

An example use case:

https://www.tanga.com/deals/cd8f90c81a/oral-b-sensitive-clean-6-extra-soft-replacement-brush-heads

where "cd8f90c81a" is the 'uuid' for that product. It's non-guessable enough, and I don't want to put a full UUID in the URL.

My historical comments in this debate are noted. To summarize, I
strongly believe that natural keys are often (but not always) better.

Re: Using varchar primary keys.

From

Julian

Date:

03 April 2013, 02:49:44

On 03/04/13 06:37, Merlin Moncure wrote:
> On Tue, Apr 2, 2013 at 2:01 PM, Samantha Atkins <sjatkins@me.com> wrote:
>> Natural keys are in user data space.  Thus they are not guaranteed invariant and therefore cannot serve as
persistentidentity. 

Can't find Samantha's original post. I agree but done right they can be
(persistent and unique)

> That is true, but irrelevant in most real world cases.  Also, nothing
> is keeping you from using an extra marker if/when you need to provide
> an invariant lookup.
>
>> Also natural keys have the considerable defect of being of different type and arity per logical entity type.   This
meansthat very nice things like dynamic relationships and persistent collections of references to multiple types of
thingscannot be built.  It also increases ORM  and data cache complexity. 
>
> OO evangelism.
>
>> These are considerable weaknesses.

The Google cache is filled with "foo vs bar"
i.e. natural vs surrogate and its amazing what you get, even surrogate
key under wikipedia comes with a "its sources remain unclear because it
lacks inline citations" disclaimer.
I consider it pretty much a non debate.

>> You cannot guess a uuid from say uuid.uuid4().  If you think you can I look forward to seeing your proof.
>
> I never claimed that.  I said that postgresql random() can be guessed,
> which it can, since it's based on lrand48.
>
> merlin
>
>
Trying to get back on topic with the original post.

I have the iso (?) country code table, I have no problem with the PK
being the char(2) country code.
These natural codes/keys are thought out, debated and eventually decided
by other people.
But I also don't have a problem with adding an integer (serial) column
and making that my PK, although that PK might be pretty useless to the
rest of the world.

So thats that, having to really think it out is probably a good sign
that you should stick to a surrogate unless you are really sure.
(again I don't advocate ON UPDATE CASCADE as a solution should you
change your mind)

As to the whole natural vs surrogate/synthetic key debate, as I
mentioned in an earlier post I use them both. The question is when is it
worthy of a PK.

1)cust_id=123 (surrogate: PK)
vs
2)cust_id=1231 (natural: checkbit such as barcode data etc)
vs
3)cust_id=<natural: uuencoded binary that spits out "123" after being
decoded by the app>

For me, 2) is more likely to be a PK than 3), but it is entirely
possible that neither would be a PK.

Global/Universal unique identifiers, useful with replication and
database partitioning (in my instance)

1)cust_id=<uuid>

vs

2)cust_id=<shard_id>-<something_extra>-<cust_id>

1) will work, but 128bits is alot of random data that could be useful to
the app.
2) cust_id is not as universally unique, but if that was ever a problem
I could also wrap that in a encoded binary with a small change to the
app and no change to the database now it resembles something truly random.

The difference is 2) is more useful and contains "routing" data.
These are all natural and exposed to the world. But the question still
remains are they worthy of being a PK within the database of origin?
So far the answer from me is "doesn't have to be" but everyone else
might think it is, they may even make it a PK.

Regards,
Jules.

Re: Using varchar primary keys.

From

Samantha Atkins

Date:

03 April 2013, 03:00:27

Natural keys are in user data space.  Thus they are not guaranteed invariant and therefore cannot serve as persistent
identity.

Also natural keys have the considerable defect of being of different type and arity per logical entity type.   This
meansthat very nice things like dynamic relationships and persistent collections of references to multiple types of
thingscannot be built.  It also increases ORM  and data cache complexity.  

These are considerable weaknesses.

You cannot guess a uuid from say uuid.uuid4().  If you think you can I look forward to seeing your proof.

- samantha


On Apr 2, 2013, at 11:16 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

> On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote:
>> On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:
>>>
>>>
>>>
>>>
>>> On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com>
>>> wrote:
>>>>
>>>> On the topic of 'natural' versus 'synthetic' primary keys, I am generally
>>>> in the camp that an extra ID field won't cost you too much, and while one
>>>> may not need it for a simple table (i.e. id, name) one might add any number
>>>> of columns later, and you'll be glad to have it.
>>>>
>>>
>>> Nothing prevents you from adding more columns if you use varchar primary
>>> keys.
>>>
>>>>
>>>>
>>>> My preferred method is to give every table an ID column of UUID type and
>>>> generate a UUID using the uuid-ossp contrib module. This also prevents
>>>> someone not familiar with the database design from using an ID somewhere
>>>> they should not (as is possible with natural PKs) or treating the ID as an
>>>> integer, not an identifier (as is all too common with serial integers).
>>>>
>>>>
>>>
>>> This would be a concern if you had multi master writes . As far as I know
>>> Postgres does not have a true multi master replication system so all the
>>> writes have to happen on one server right?
>>>
>>> As for UUIDs I use them sometimes but I tend to also use one serial column
>>> because when I am communicating with people it makes it so much easier to
>>> say "dealer number X" than "dealer number SOME_HUGE_UUID".  I often have to
>>> talk to people about the data and UUIDs make it very difficult to
>>> communicate with humans.
>>
>>
>> I've been wishing for a smaller uuid type for a while. Say you need to
>> assign a Order #. Customers might be referencing the number, so you don't
>> want it to be too long. But you don't want Order #'s to be guessable or have
>> the characters/digits be transposed accidently.
>>
>> I've been using a unique text column with a default of random_characters(12)
>>
>> CREATE OR REPLACE FUNCTION public.random_characters(length integer)
>> RETURNS text
>> LANGUAGE sql
>> STABLE
>> AS $function$
>> SELECT array_to_string(array((
>>      SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'
>>        FROM mod((random()*31)::int, 31)+1 FOR 1)
>>      FROM generate_series(1, $1))),'');
>> $function$;
>>
>> This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can
>> easily be mistyped or misread.
>
> This is pseudo random and can be guessed, which is maybe dangerous
> depending on circumstance.  For stronger random stream go to
> pgcrypto.gen_random_bytes().  Also, now you have to worry about
> collisions -- the whole point of uuid is to try and keep you from
> having to deal with that.
>
> My historical comments in this debate are noted.  To summarize, I
> strongly believe that natural keys are often (but not always) better.
>
> merlin
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Why not cascade? (was: Using varchar primary keys)

From

Gavan Schneider

Date:

03 April 2013, 07:58:31

On 3/4/13 at 1:49 PM, dix1wjifgt@sneakemail.com (Julian
tempura-at-internode.on.net |pg-gts/Basic|) wrote:

>... having to really think it out is probably a good sign that you
>should stick to a surrogate unless you are really sure. (again I don't
>advocate ON UPDATE CASCADE as a solution should you change your mind)
>
OK this is interesting.

Why not cascade?

Assuming someone makes the dB design as straight forward as
possible, avoids obfuscation of key values (since this mostly
only gets the present and the next developer into trouble, not
the mythical external hacker), and has constraints with cascaded
updates in place to keep it all consistent. Something changes in
the real world, the DBA makes the dB reflect this change and the
cascade ensures everything is still consistent. Where is the
problem with this?

When there is a lot of work involved this needs to be taken into
account, but what is the basis for such a general prohibition on
a modern SQL dB? why not use the feature?

Regards
Gavan Schneider

Re: Why not cascade? (was: Using varchar primary keys)

From

Merlin Moncure

Date:

04 April 2013, 16:36:05

On Wed, Apr 3, 2013 at 2:58 AM, Gavan Schneider <pg-gts@snkmail.com> wrote:
> On 3/4/13 at 1:49 PM, dix1wjifgt@sneakemail.com (Julian
> tempura-at-internode.on.net |pg-gts/Basic|) wrote:
>
>> ... having to really think it out is probably a good sign that you
>> should stick to a surrogate unless you are really sure. (again I don't
>> advocate ON UPDATE CASCADE as a solution should you change your mind)
>>
> OK this is interesting.
>
> Why not cascade?
>
> Assuming someone makes the dB design as straight forward as possible, avoids
> obfuscation of key values (since this mostly only gets the present and the
> next developer into trouble, not the mythical external hacker), and has
> constraints with cascaded updates in place to keep it all consistent.
> Something changes in the real world, the DBA makes the dB reflect this
> change and the cascade ensures everything is still consistent. Where is the
> problem with this?
>
> When there is a lot of work involved this needs to be taken into account,
> but what is the basis for such a general prohibition on a modern SQL dB? why
> not use the feature?

Well, the disadvantages are:
*) you could end up cascading to a large amount of records
*) with a consequently large amount of locks
*) a lot of database developers are oblivious to how RI works
*) a schema level change of a key is a lot easier to do (but really,
the value of this is overstated)

Basically, it boils down to, 'more expensive updates, and schema
changes are slightly easier'

At the end of the day, once you sift through the mountains of BS that
tend to characterize the natural/surrogate debate (which I am going to
very carefully define as using an automatically generated identifier
such as a UUID for purposes of relating) what you end up with is a
system that is optimized for updates at the expense of many other
things, especially large data mining queries (more indexes and more
joins) and general readability.

In many cases this is a fair, or even a good, trade-off...but not
always, so we have to be flexible.  What seals the deal for me though
is that there is a near-perfect correlation with 100% surrogate
databases and what I would describe as gross modeling errors.

merlin

Re: Using varchar primary keys.

From

Jasen Betts

Date:

06 April 2013, 19:11:26

On 2013-04-02, Joe Van Dyk <joe@tanga.com> wrote:
> On Tue, Apr 2, 2013 at 11:16 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote:
>> >
>> > I've been wishing for a smaller uuid type for a while.
>> > I've been using a unique text column with a default of
>> > random_characters(12)
>>
>> This is pseudo random and can be guessed, which is maybe dangerous
>> depending on circumstance.  For stronger random stream go to
>> pgcrypto.gen_random_bytes().
>
> Right, but it's better than using serial's as far as being guessable.
>
> The probability for collisions are fairly low, if you are using 12 or more
> characters (with 30 possible characters). Not sure what the math is on the
> probability of collisions (birthday problem) though..

 for n samples of p possibles it's approximately

  n(n-1)/2p

 for n^2 < p


a alphabet of 30 symbols is almost 5 bits per symbol
so for 12 symbols you get about 60 bits which almost half as wide as a
UUID (128b)

> and you could have a
> trigger that checked for the existence of a matching key before
> inserts/updates.

And the associated race condition... no thanks.

you could encrypt a serial type using some reversible encryption
eg: there's a feistel cypher implemented in plpgsql in the wiki
somewhere

>> My historical comments in this debate are noted.  To summarize, I
>> strongly believe that natural keys are often (but not always) better.

I'll use natural keys where they are short enough to be useful and
guaranteed constant.

--
⚂⚃ 100% natural