Thread: Using varchar primary keys.
Consider the following scenario.
I have a typical tagging structure. There is a table called tags, there is a table called taggings. The taggings table acts as a many to many join table between the taggers and the tags.
The tags table has two fields id and tag. id is a serial data type.I have a typical tagging structure. There is a table called tags, there is a table called taggings. The taggings table acts as a many to many join table between the taggers and the tags.
Given that you can still set foreign keys and cascade commands to adjust child records either way the tables would be properly normalized so I am wondering what I am gaining by using these serial ID fields.
Hi Tim, Debate "natural" vs "surrogate" key last for a too long time - with no official winner... I think it is more "religional" then real issue... Advice will be simple: pick approach what best fit your needs, taking into account now and tomorrow (probability of change)... SQL standard and normal forms are fine with both religions... And it really has very small impact on: speed and storage size (real things) - that is not worth mentioning I think, I personally could be put to "surrogate" religion - but in described scenario, having tables with just two columns: pk and unique - I would never suggest that, because of simply from practical, common sense, view - it doesn't make sense... So probably in your scenario you should pick "natural" approach - for described scenarios... Kind Regards, Misa ------------------------------ From: Tim Uckun Sent: 01/04/2013 01:36 To: pgsql-general Subject: [GENERAL] Using varchar primary keys. Consider the following scenario. I have a typical tagging structure. There is a table called tags, there is a table called taggings. The taggings table acts as a many to many join table between the taggers and the tags. The tags table has two fields id and tag. id is a serial data type. The taggings has a tag_id field along with the context, tagger_id etc. I don't think there is even one query in this scenario which does not join the taggings table to the tags table so I am wondering if there is any advantage at all of having that id field in the tags table. Why shouldn't I just put the tag itself as the primary key? The tags are quite short so if pg is smart about it I would guess they would take up less space than an integer in some cases. I guess the same scenario happens in many of my lookup tables. I have all these tables with just two fields in them. id field and then some varchar field. Then every time I want to display that record I join with five other tables so I can look up all the associated lookup tables and display the text value to the user. Given that you can still set foreign keys and cascade commands to adjust child records either way the tables would be properly normalized so I am wondering what I am gaining by using these serial ID fields. Is this false economy?
On 3/31/13, Tim Uckun <timuckun@gmail.com> wrote: > Consider the following scenario. > > I have a typical tagging structure. There is a table called tags, there is > a table called taggings. The taggings table acts as a many to many join > table between the taggers and the tags. > > The tags table has two fields id and tag. id is a serial data type. > > The taggings has a tag_id field along with the context, tagger_id etc. > > I don't think there is even one query in this scenario which does not join > the taggings table to the tags table so I am wondering if there is any > advantage at all of having that id field in the tags table. Why shouldn't I > just put the tag itself as the primary key? The tags are quite short so if > pg is smart about it I would guess they would take up less space than an > integer in some cases. > > I guess the same scenario happens in many of my lookup tables. I have all > these tables with just two fields in them. id field and then some varchar > field. Then every time I want to display that record I join with five > other tables so I can look up all the associated lookup tables and display > the text value to the user. > > Given that you can still set foreign keys and cascade commands to adjust > child records either way the tables would be properly normalized so I am > wondering what I am gaining by using these serial ID fields. You can. Arguably, if the natural key is relatively short and consists of only one column E.F. Codd would probably have encourage you to use it. I'd leave the serial column for convenience of identifying individual records on the command line, especially if the varchar can store characters that are not easily typed in a console. As for the primary key being a 'natural key' vs. a 'synthetic' one, that's a huge debate akin to emacs vs. vi. Depending on your situation, there's nothing wrong with using natural keys. It is discussed at length and in some cases encouraged here:: http://database-programmer.blogspot.com/2008/01/database-skills-sane-approach-to.html -Modulok-
On 01/04/13 12:19, Modulok wrote: > On 3/31/13, Tim Uckun <timuckun@gmail.com> wrote: >> Consider the following scenario. >> >> I have a typical tagging structure. There is a table called tags, there is >> a table called taggings. The taggings table acts as a many to many join >> table between the taggers and the tags. >> >> The tags table has two fields id and tag. id is a serial data type. >> >> The taggings has a tag_id field along with the context, tagger_id etc. >> >> I don't think there is even one query in this scenario which does not join >> the taggings table to the tags table so I am wondering if there is any >> advantage at all of having that id field in the tags table. Why shouldn't I >> just put the tag itself as the primary key? The tags are quite short so if >> pg is smart about it I would guess they would take up less space than an >> integer in some cases. >> >> I guess the same scenario happens in many of my lookup tables. I have all >> these tables with just two fields in them. id field and then some varchar >> field. Then every time I want to display that record I join with five >> other tables so I can look up all the associated lookup tables and display >> the text value to the user. >> >> Given that you can still set foreign keys and cascade commands to adjust >> child records either way the tables would be properly normalized so I am >> wondering what I am gaining by using these serial ID fields. > > You can. Arguably, if the natural key is relatively short and consists of only > one column E.F. Codd would probably have encourage you to use it. I'd leave the > serial column for convenience of identifying individual records on the command > line, especially if the varchar can store characters that are not easily typed > in a console. > > As for the primary key being a 'natural key' vs. a 'synthetic' one, > that's a huge debate akin to emacs vs. vi. Depending on your situation, there's > nothing wrong with using natural keys. It is discussed at length and > in some cases encouraged here:: > > http://database-programmer.blogspot.com/2008/01/database-skills-sane-approach-to.html > > > -Modulok- > > I disagree with almost everything in that blog post: Shouldn't the first question be to figure out why even a bigserial (int8) would not suffice as a unique identifier for row data? There are reasons where it wont, what are they? (theres a great blog post by a twitter developer on the development of their unique identifier (PK) - couldn't find it) Natural Keys have a purpose but when do they exist in the database? For example a Surrogate Key that is exposed to the world may get wrapped in a checkbit algorithm or perhaps even a hmac encoded key to validate the genuineness of incoming requests before an expensive database lookup is initiated. So is there a need to store the Natural Keys generated in the application? It might be useful to external parties though they may even call it "foos_pks". What about the Natural Keys of an external source? Should that be stored in a lookup table along with the integer based Surrogate Key? Maybe I'll call it "bars_pks". As to the original question, my response may seem off topic, but exposing PK's to the world has some pitfalls (and not all to do with security) and character based keys have pitfalls as compared to Surrogate Keys that stay out of the displayed data. CREATE TABLE tags (tag text PRIMARY KEY); CREATE TABLE child_tag (parent_tag text REFERENCES tags); INSERT INTO tags VALUES ('fu'); INSERT INTO child_tag VALUES ('fu'); --Spelling error. UPDATE tags SET tag = 'foo' WHERE tag = 'fu'; This will fail unless you ON UPDATE CASCADE. Some things to think about. Regards, Jules.
On 1/4/13 at 10:35 AM, Tim Uckun wrote: >Consider the following scenario. > Since I don't consider myself an expert I am talking in basic terms below. Please don't misunderstand this plodding as anything other than me attempting to be as clear as possible. These comments reflect my understanding of Joe Celko's view as expressed in his books. I fully understand that many respectable SQL masters (respectfully) disagree with Celko in this. My take is to use his approach as a starting point and only deviate when a different approach is obviously much easier for me. (I doubt it's better. :). YMMV. >I have a typical tagging structure. There is a table called tags, there is >a table called taggings. The taggings table acts as a many to many join >table between the taggers and the tags. > >The tags table has two fields id and tag. id is a serial data type. > Celko would suggest the table name is the plural of its primary key, so a table named tags has tag as the name of its primary key. And this is a big hint: the sequential index is not needed. And he further (very strongly) suggests that primary keys assume their natural values and not be referenced by arbitrary (sequential) values, i.e., avoid pseudo-keys, as neither SQL nor modern data bases need them. >The taggings has a tag_id field along with the context, tagger_id etc. > So this would instead have two columns named "tag" and "tagger" (each referencing the column of the same name in tables "tags" and "taggers") and whatever is implied by the "etc". if the pair (tag,tagger) is meant to be unique then this is a good candidate for primary key, otherwise a further term(s) needs to be dragged into the composite index (e.g., a timestamp). Once again there is no logical need for a sequential integer to be stored as well. > .... > >Given that you can still set foreign keys and cascade commands to adjust >child records either way the tables would be properly normalized so I am >wondering what I am gaining by using these serial ID fields. > >Is this false economy? > I think so. If the tags table has both sequential integer and the unique tag value then there is likely storage and I/O associated with two indexes, along with storage of both tag and associated integer. If the list of tags is short enough it might be suitable to use an ENUM to enforce both integrity and brevity. If the list of tags is longer and/or might need updating then a single column table will allow for this. In general, I think lists of things (e.g., tag values) should be kept within the dB so they can be used for integrity checking, etc. I don't like the idea of externalising this job to the interface application. Also, as you have mentioned, you can elegantly handle table renovation, eg., if a given tag needs its name changed, it can be done via a foreign key constraint with cascade. (Elegant here refers to the syntax, the actual I/O implications may be very ugly depending on the scale of data that needs rewriting. :) In another situation the tags table might well contain columns such as: tag, tag_full_name, tag_origin, tag_authority_certificate/expiry/whatever. In such an instance the tag is better if it's human readable/meaningful so reports can be useful without always back joining the full name. Sequential integers rarely fulfil this role as implied by the original question. Regards Gavan Schneider
Hi Tim,
Debate "natural" vs "surrogate" key last for a too long time - with no official winner...
I think it is more "religional" then real issue...
Advice will be simple: pick approach what best fit your needs, taking into account now and tomorrow (probability of change)...
SQL standard and normal forms are fine with both religions... And it really has very small impact on: speed and storage size (real things) - that is not worth mentioning
I think, I personally could be put to "surrogate" religion - but in described scenario, having tables with just two columns: pk and unique - I would never suggest that, because of simply from practical, common sense, view - it doesn't make sense...
So probably in your scenario you should pick "natural" approach - for described scenarios...
Kind Regards,
Misa
Debate "natural" vs "surrogate" key last for a too long time - with no official winner...
I think it is more "religional" then real issue...
Advice will be simple: pick approach what best fit your needs, taking into account now and tomorrow (probability of change)...
SQL standard and normal forms are fine with both religions... And it really has very small impact on: speed and storage size (real things) - that is not worth mentioning
I think, I personally could be put to "surrogate" religion - but in described scenario, having tables with just two columns: pk and unique - I would never suggest that, because of simply from practical, common sense, view - it doesn't make sense...
So probably in your scenario you should pick "natural" approach - for described scenarios...
Kind Regards,
Misa
From: Tim Uckun
Sent: 01/04/2013 01:36
To: pgsql-general
Subject: [GENERAL] Using varchar primary keys.
Consider the following scenario.
I have a typical tagging structure. There is a table called tags, there is a table called taggings. The taggings table acts as a many to many join table between the taggers and the tags.
The tags table has two fields id and tag. id is a serial data type.I have a typical tagging structure. There is a table called tags, there is a table called taggings. The taggings table acts as a many to many join table between the taggers and the tags.
Given that you can still set foreign keys and cascade commands to adjust child records either way the tables would be properly normalized so I am wondering what I am gaining by using these serial ID fields.
Natural Keys have a purpose but when do they exist in the database?
In my case it's the tags. Table tags has only two columns id and tag. Each one has a unique index. I have many other similar lookup tables. For example things like order_status, item_type etc.
What about the Natural Keys of an external source? Should that be stored
in a lookup table along with the integer based Surrogate Key?
Maybe I'll call it "bars_pks".
I always keep those. The API's I expose to clients and partners allow them to modify the records they send and I let them refer to items by their own primary keys. This has always worked out well for me.
--Spelling error.
UPDATE tags SET tag = 'foo' WHERE tag = 'fu';
This will fail unless you ON UPDATE CASCADE.
Yes of course you'd need the on update cascade. Then again maybe I don't even need that tags table. Tags could just be a view (select distinct tag from child_tags). I am not saying that's efficient or desirable but it's possible.
On 2013-04-01, Tim Uckun <timuckun@gmail.com> wrote: > --14dae93996072d9ff304d945bd3c > Content-Type: text/plain; charset=UTF-8 > >> >> >> Natural Keys have a purpose but when do they exist in the database? >> >> > In my case it's the tags. Table tags has only two columns id and tag. Each > one has a unique index. I have many other similar lookup tables. For > example things like order_status, item_type etc. > how about using an enum instead of this table? -- ⚂⚃ 100% natural
how about using an enum instead of this table?
That's an interesting idea. Are enums mutable?
On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote: > > > > how about using an enum instead of this table? > > > > > That's an interesting idea. Are enums mutable? since 9.1 you can add values. http://www.postgresql.org/docs/9.1/static/sql-altertype.html -- ⚂⚃ 100% natural
On Mon, Apr 1, 2013 at 1:22 AM, Jasen Betts <jasen@xnet.co.nz> wrote: > On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote: >> > >> > how about using an enum instead of this table? >> > >> > >> That's an interesting idea. Are enums mutable? > > > since 9.1 you can add values. > > http://www.postgresql.org/docs/9.1/static/sql-altertype.html It's an interesting idea, but I don't think enums are designed to act as a primary key except in cases where the data is basically static and is relatively small. For starters, any manipulation of the enum requires a lock. enums can be a real life saver when you need custom ordering built into a string, especially if that ordering is floated over a composite index. merlin
On the topic of 'natural' versus 'synthetic' primary keys, I am generally in the camp that an extra ID field won't cost you too much, and while one may not need it for a simple table (i.e. id, name) one might add any number of columns later, and you'll be glad to have it.
I am, however, against using sequences (or serial integers in Postgres) for reasons of scaling and replication across multiple copies of a database running on different servers.On Mon, Apr 1, 2013 at 9:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Apr 1, 2013 at 1:22 AM, Jasen Betts <jasen@xnet.co.nz> wrote:It's an interesting idea, but I don't think enums are designed to act
> On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:
>> >
>> > how about using an enum instead of this table?
>> >
>> >
>> That's an interesting idea. Are enums mutable?
>
>
> since 9.1 you can add values.
>
> http://www.postgresql.org/docs/9.1/static/sql-altertype.html
as a primary key except in cases where the data is basically static
and is relatively small. For starters, any manipulation of the enum
requires a lock.
enums can be a real life saver when you need custom ordering built
into a string, especially if that ordering is floated over a composite
index.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GIT d+ s: a-- C++++(++) UL+++ P++++$ L+++>++++ E- W+ N o? K w--- O-
M- V? PS+++ PE(-) Y+ PGP->+++ t+>++ 5+++++ X(+) R>+ tv b+
DI++ D++>+++ G+ e* h! !r y**
------END GEEK CODE BLOCK------
On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com> wrote:
On the topic of 'natural' versus 'synthetic' primary keys, I am generally in the camp that an extra ID field won't cost you too much, and while one may not need it for a simple table (i.e. id, name) one might add any number of columns later, and you'll be glad to have it.
Nothing prevents you from adding more columns if you use varchar primary keys.
My preferred method is to give every table an ID column of UUID type and generate a UUID using the uuid-ossp contrib module. This also prevents someone not familiar with the database design from using an ID somewhere they should not (as is possible with natural PKs) or treating the ID as an integer, not an identifier (as is all too common with serial integers).
This would be a concern if you had multi master writes . As far as I know Postgres does not have a true multi master replication system so all the writes have to happen on one server right?
As for UUIDs I use them sometimes but I tend to also use one serial column because when I am communicating with people it makes it so much easier to say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to talk to people about the data and UUIDs make it very difficult to communicate with humans.
On 02/04/13 06:35, jesusthefrog wrote: > On the topic of 'natural' versus 'synthetic' primary keys, I am > generally in the camp that an extra ID field won't cost you too much, > and while one may not need it for a simple table (i.e. id, name) one > might add any number of columns later, and you'll be glad to have it. Definitely, a lookup table can be extended to hold all kinds of statistics and translations on a "tag". I didn't look too deeply into the actual usage (not sure if any real examples where given). But ON UPDATE CASCADE if not a feature I would recommend in this context (referencing a PK) if at all. > I am, however, against using sequences (or serial integers in Postgres) > for reasons of scaling and replication across multiple copies of a > database running on different servers. > > My preferred method is to give every table an ID column of UUID type and > generate a UUID using the uuid-ossp contrib module. This also prevents > someone not familiar with the database design from using an ID somewhere > they should not (as is possible with natural PKs) or treating the ID as > an integer, not an identifier (as is all too common with serial integers). > It was instagrams own implementation, but mentions twitters "Snowflake". Interesting read still only requires a bigint. http://goo.gl/gRlXC IMO uuid is convenient but large and for all that data it doesn't hold anything useful. I have been tempted though. Jules.
On 02/04/13 08:35, jesusthefrog wrote:
I use synthetic primary keys, I want to minimise changes to the database because the user, directly or due to of some law change, changes a 'natural' value. Using synthetic primary keys, minimises changes to a database when a 'natural' value changes - if people's names are part of many natural keys, then when people change their name (like when a woman gets married), only one table needs to change. Likewise the customer nunber the manager swore would never change, now they want to change for a numeric key to an alphanumeric one.My preferred method is to give every table an ID column of UUID type and generate a UUID using the uuid-ossp contrib module. This also prevents someone not familiar with the database design from using an ID somewhere they should not (as is possible with natural PKs) or treating the ID as an integer, not an identifier (as is all too common with serial integers).On the topic of 'natural' versus 'synthetic' primary keys, I am generally in the camp that an extra ID field won't cost you too much, and while one may not need it for a simple table (i.e. id, name) one might add any number of columns later, and you'll be glad to have it.I am, however, against using sequences (or serial integers in Postgres) for reasons of scaling and replication across multiple copies of a database running on different servers.On Mon, Apr 1, 2013 at 9:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote:On Mon, Apr 1, 2013 at 1:22 AM, Jasen Betts <jasen@xnet.co.nz> wrote:It's an interesting idea, but I don't think enums are designed to act
> On Mon, Apr 01, 2013 at 07:08:15PM +1300, Tim Uckun wrote:
>> >
>> > how about using an enum instead of this table?
>> >
>> >
>> That's an interesting idea. Are enums mutable?
>
>
> since 9.1 you can add values.
>
> http://www.postgresql.org/docs/9.1/static/sql-altertype.html
as a primary key except in cases where the data is basically static
and is relatively small. For starters, any manipulation of the enum
requires a lock.
enums can be a real life saver when you need custom ordering built
into a string, especially if that ordering is floated over a composite
index.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GIT d+ s: a-- C++++(++) UL+++ P++++$ L+++>++++ E- W+ N o? K w--- O-
M- V? PS+++ PE(-) Y+ PGP->+++ t+>++ 5+++++ X(+) R>+ tv b+
DI++ D++>+++ G+ e* h! !r y**
------END GEEK CODE BLOCK------
Using 'natural' values for a primary key, seems to be deliberately adding potential time bombs. Almost as bad as the misguided idea that using 'sudo' is safer than the alternative when executing commands a root!
Cheers,
Gavin
On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:
On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com> wrote:On the topic of 'natural' versus 'synthetic' primary keys, I am generally in the camp that an extra ID field won't cost you too much, and while one may not need it for a simple table (i.e. id, name) one might add any number of columns later, and you'll be glad to have it.Nothing prevents you from adding more columns if you use varchar primary keys.My preferred method is to give every table an ID column of UUID type and generate a UUID using the uuid-ossp contrib module. This also prevents someone not familiar with the database design from using an ID somewhere they should not (as is possible with natural PKs) or treating the ID as an integer, not an identifier (as is all too common with serial integers).This would be a concern if you had multi master writes . As far as I know Postgres does not have a true multi master replication system so all the writes have to happen on one server right?As for UUIDs I use them sometimes but I tend to also use one serial column because when I am communicating with people it makes it so much easier to say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to talk to people about the data and UUIDs make it very difficult to communicate with humans.
I've been wishing for a smaller uuid type for a while. Say you need to assign a Order #. Customers might be referencing the number, so you don't want it to be too long. But you don't want Order #'s to be guessable or have the characters/digits be transposed accidently.
I've been using a unique text column with a default of random_characters(12)
CREATE OR REPLACE FUNCTION public.random_characters(length integer)
RETURNS text
LANGUAGE sql
STABLE
AS $function$
SELECT array_to_string(array((
SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'
FROM mod((random()*31)::int, 31)+1 FOR 1)
FROM generate_series(1, $1))),'');
$function$;
This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can easily be mistyped or misread.
On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote: > On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote: >> >> >> >> >> On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com> >> wrote: >>> >>> On the topic of 'natural' versus 'synthetic' primary keys, I am generally >>> in the camp that an extra ID field won't cost you too much, and while one >>> may not need it for a simple table (i.e. id, name) one might add any number >>> of columns later, and you'll be glad to have it. >>> >> >> Nothing prevents you from adding more columns if you use varchar primary >> keys. >> >>> >>> >>> My preferred method is to give every table an ID column of UUID type and >>> generate a UUID using the uuid-ossp contrib module. This also prevents >>> someone not familiar with the database design from using an ID somewhere >>> they should not (as is possible with natural PKs) or treating the ID as an >>> integer, not an identifier (as is all too common with serial integers). >>> >>> >> >> This would be a concern if you had multi master writes . As far as I know >> Postgres does not have a true multi master replication system so all the >> writes have to happen on one server right? >> >> As for UUIDs I use them sometimes but I tend to also use one serial column >> because when I am communicating with people it makes it so much easier to >> say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to >> talk to people about the data and UUIDs make it very difficult to >> communicate with humans. > > > I've been wishing for a smaller uuid type for a while. Say you need to > assign a Order #. Customers might be referencing the number, so you don't > want it to be too long. But you don't want Order #'s to be guessable or have > the characters/digits be transposed accidently. > > I've been using a unique text column with a default of random_characters(12) > > CREATE OR REPLACE FUNCTION public.random_characters(length integer) > RETURNS text > LANGUAGE sql > STABLE > AS $function$ > SELECT array_to_string(array(( > SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789' > FROM mod((random()*31)::int, 31)+1 FOR 1) > FROM generate_series(1, $1))),''); > $function$; > > This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can > easily be mistyped or misread. This is pseudo random and can be guessed, which is maybe dangerous depending on circumstance. For stronger random stream go to pgcrypto.gen_random_bytes(). Also, now you have to worry about collisions -- the whole point of uuid is to try and keep you from having to deal with that. My historical comments in this debate are noted. To summarize, I strongly believe that natural keys are often (but not always) better. merlin
On Tue, Apr 2, 2013 at 2:01 PM, Samantha Atkins <sjatkins@me.com> wrote: > Natural keys are in user data space. Thus they are not guaranteed invariant and therefore cannot serve as persistent identity. That is true, but irrelevant in most real world cases. Also, nothing is keeping you from using an extra marker if/when you need to provide an invariant lookup. > Also natural keys have the considerable defect of being of different type and arity per logical entity type. This meansthat very nice things like dynamic relationships and persistent collections of references to multiple types of thingscannot be built. It also increases ORM and data cache complexity. OO evangelism. > These are considerable weaknesses. > > You cannot guess a uuid from say uuid.uuid4(). If you think you can I look forward to seeing your proof. I never claimed that. I said that postgresql random() can be guessed, which it can, since it's based on lrand48. merlin
On Tue, Apr 2, 2013 at 11:16 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
This is pseudo random and can be guessed, which is maybe dangerousOn Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote:
> On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote:
>>
>>
>>
>>
>> On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com>
>> wrote:
>>>
>>> On the topic of 'natural' versus 'synthetic' primary keys, I am generally
>>> in the camp that an extra ID field won't cost you too much, and while one
>>> may not need it for a simple table (i.e. id, name) one might add any number
>>> of columns later, and you'll be glad to have it.
>>>
>>
>> Nothing prevents you from adding more columns if you use varchar primary
>> keys.
>>
>>>
>>>
>>> My preferred method is to give every table an ID column of UUID type and
>>> generate a UUID using the uuid-ossp contrib module. This also prevents
>>> someone not familiar with the database design from using an ID somewhere
>>> they should not (as is possible with natural PKs) or treating the ID as an
>>> integer, not an identifier (as is all too common with serial integers).
>>>
>>>
>>
>> This would be a concern if you had multi master writes . As far as I know
>> Postgres does not have a true multi master replication system so all the
>> writes have to happen on one server right?
>>
>> As for UUIDs I use them sometimes but I tend to also use one serial column
>> because when I am communicating with people it makes it so much easier to
>> say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to
>> talk to people about the data and UUIDs make it very difficult to
>> communicate with humans.
>
>
> I've been wishing for a smaller uuid type for a while. Say you need to
> assign a Order #. Customers might be referencing the number, so you don't
> want it to be too long. But you don't want Order #'s to be guessable or have
> the characters/digits be transposed accidently.
>
> I've been using a unique text column with a default of random_characters(12)
>
> CREATE OR REPLACE FUNCTION public.random_characters(length integer)
> RETURNS text
> LANGUAGE sql
> STABLE
> AS $function$
> SELECT array_to_string(array((
> SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'
> FROM mod((random()*31)::int, 31)+1 FOR 1)
> FROM generate_series(1, $1))),'');
> $function$;
>
> This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can
> easily be mistyped or misread.
depending on circumstance. For stronger random stream go to
pgcrypto.gen_random_bytes(). Also, now you have to worry about
collisions -- the whole point of uuid is to try and keep you from
having to deal with that.
Right, but it's better than using serial's as far as being guessable.
The probability for collisions are fairly low, if you are using 12 or more characters (with 30 possible characters). Not sure what the math is on the probability of collisions (birthday problem) though.. and you could have a trigger that checked for the existence of a matching key before inserts/updates.
And using UUIDs would be too long for lots of purposes where people are working with the numbers, and where there might be external constraints on how long the IDs can be.
An example use case:
where "cd8f90c81a" is the 'uuid' for that product. It's non-guessable enough, and I don't want to put a full UUID in the URL.
My historical comments in this debate are noted. To summarize, I
strongly believe that natural keys are often (but not always) better.
On 03/04/13 06:37, Merlin Moncure wrote: > On Tue, Apr 2, 2013 at 2:01 PM, Samantha Atkins <sjatkins@me.com> wrote: >> Natural keys are in user data space. Thus they are not guaranteed invariant and therefore cannot serve as persistentidentity. Can't find Samantha's original post. I agree but done right they can be (persistent and unique) > That is true, but irrelevant in most real world cases. Also, nothing > is keeping you from using an extra marker if/when you need to provide > an invariant lookup. > >> Also natural keys have the considerable defect of being of different type and arity per logical entity type. This meansthat very nice things like dynamic relationships and persistent collections of references to multiple types of thingscannot be built. It also increases ORM and data cache complexity. > > OO evangelism. > >> These are considerable weaknesses. The Google cache is filled with "foo vs bar" i.e. natural vs surrogate and its amazing what you get, even surrogate key under wikipedia comes with a "its sources remain unclear because it lacks inline citations" disclaimer. I consider it pretty much a non debate. >> You cannot guess a uuid from say uuid.uuid4(). If you think you can I look forward to seeing your proof. > > I never claimed that. I said that postgresql random() can be guessed, > which it can, since it's based on lrand48. > > merlin > > Trying to get back on topic with the original post. I have the iso (?) country code table, I have no problem with the PK being the char(2) country code. These natural codes/keys are thought out, debated and eventually decided by other people. But I also don't have a problem with adding an integer (serial) column and making that my PK, although that PK might be pretty useless to the rest of the world. So thats that, having to really think it out is probably a good sign that you should stick to a surrogate unless you are really sure. (again I don't advocate ON UPDATE CASCADE as a solution should you change your mind) As to the whole natural vs surrogate/synthetic key debate, as I mentioned in an earlier post I use them both. The question is when is it worthy of a PK. 1)cust_id=123 (surrogate: PK) vs 2)cust_id=1231 (natural: checkbit such as barcode data etc) vs 3)cust_id=<natural: uuencoded binary that spits out "123" after being decoded by the app> For me, 2) is more likely to be a PK than 3), but it is entirely possible that neither would be a PK. Global/Universal unique identifiers, useful with replication and database partitioning (in my instance) 1)cust_id=<uuid> vs 2)cust_id=<shard_id>-<something_extra>-<cust_id> 1) will work, but 128bits is alot of random data that could be useful to the app. 2) cust_id is not as universally unique, but if that was ever a problem I could also wrap that in a encoded binary with a small change to the app and no change to the database now it resembles something truly random. The difference is 2) is more useful and contains "routing" data. These are all natural and exposed to the world. But the question still remains are they worthy of being a PK within the database of origin? So far the answer from me is "doesn't have to be" but everyone else might think it is, they may even make it a PK. Regards, Jules.
Natural keys are in user data space. Thus they are not guaranteed invariant and therefore cannot serve as persistent identity. Also natural keys have the considerable defect of being of different type and arity per logical entity type. This meansthat very nice things like dynamic relationships and persistent collections of references to multiple types of thingscannot be built. It also increases ORM and data cache complexity. These are considerable weaknesses. You cannot guess a uuid from say uuid.uuid4(). If you think you can I look forward to seeing your proof. - samantha On Apr 2, 2013, at 11:16 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote: >> On Mon, Apr 1, 2013 at 1:11 PM, Tim Uckun <timuckun@gmail.com> wrote: >>> >>> >>> >>> >>> On Tue, Apr 2, 2013 at 8:35 AM, jesusthefrog <jesusthefrog@gmail.com> >>> wrote: >>>> >>>> On the topic of 'natural' versus 'synthetic' primary keys, I am generally >>>> in the camp that an extra ID field won't cost you too much, and while one >>>> may not need it for a simple table (i.e. id, name) one might add any number >>>> of columns later, and you'll be glad to have it. >>>> >>> >>> Nothing prevents you from adding more columns if you use varchar primary >>> keys. >>> >>>> >>>> >>>> My preferred method is to give every table an ID column of UUID type and >>>> generate a UUID using the uuid-ossp contrib module. This also prevents >>>> someone not familiar with the database design from using an ID somewhere >>>> they should not (as is possible with natural PKs) or treating the ID as an >>>> integer, not an identifier (as is all too common with serial integers). >>>> >>>> >>> >>> This would be a concern if you had multi master writes . As far as I know >>> Postgres does not have a true multi master replication system so all the >>> writes have to happen on one server right? >>> >>> As for UUIDs I use them sometimes but I tend to also use one serial column >>> because when I am communicating with people it makes it so much easier to >>> say "dealer number X" than "dealer number SOME_HUGE_UUID". I often have to >>> talk to people about the data and UUIDs make it very difficult to >>> communicate with humans. >> >> >> I've been wishing for a smaller uuid type for a while. Say you need to >> assign a Order #. Customers might be referencing the number, so you don't >> want it to be too long. But you don't want Order #'s to be guessable or have >> the characters/digits be transposed accidently. >> >> I've been using a unique text column with a default of random_characters(12) >> >> CREATE OR REPLACE FUNCTION public.random_characters(length integer) >> RETURNS text >> LANGUAGE sql >> STABLE >> AS $function$ >> SELECT array_to_string(array(( >> SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789' >> FROM mod((random()*31)::int, 31)+1 FOR 1) >> FROM generate_series(1, $1))),''); >> $function$; >> >> This seems to work ok. I don't allow 1's or l's or 0's or O's, as they can >> easily be mistyped or misread. > > This is pseudo random and can be guessed, which is maybe dangerous > depending on circumstance. For stronger random stream go to > pgcrypto.gen_random_bytes(). Also, now you have to worry about > collisions -- the whole point of uuid is to try and keep you from > having to deal with that. > > My historical comments in this debate are noted. To summarize, I > strongly believe that natural keys are often (but not always) better. > > merlin > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
On 3/4/13 at 1:49 PM, dix1wjifgt@sneakemail.com (Julian tempura-at-internode.on.net |pg-gts/Basic|) wrote: >... having to really think it out is probably a good sign that you >should stick to a surrogate unless you are really sure. (again I don't >advocate ON UPDATE CASCADE as a solution should you change your mind) > OK this is interesting. Why not cascade? Assuming someone makes the dB design as straight forward as possible, avoids obfuscation of key values (since this mostly only gets the present and the next developer into trouble, not the mythical external hacker), and has constraints with cascaded updates in place to keep it all consistent. Something changes in the real world, the DBA makes the dB reflect this change and the cascade ensures everything is still consistent. Where is the problem with this? When there is a lot of work involved this needs to be taken into account, but what is the basis for such a general prohibition on a modern SQL dB? why not use the feature? Regards Gavan Schneider
On Wed, Apr 3, 2013 at 2:58 AM, Gavan Schneider <pg-gts@snkmail.com> wrote: > On 3/4/13 at 1:49 PM, dix1wjifgt@sneakemail.com (Julian > tempura-at-internode.on.net |pg-gts/Basic|) wrote: > >> ... having to really think it out is probably a good sign that you >> should stick to a surrogate unless you are really sure. (again I don't >> advocate ON UPDATE CASCADE as a solution should you change your mind) >> > OK this is interesting. > > Why not cascade? > > Assuming someone makes the dB design as straight forward as possible, avoids > obfuscation of key values (since this mostly only gets the present and the > next developer into trouble, not the mythical external hacker), and has > constraints with cascaded updates in place to keep it all consistent. > Something changes in the real world, the DBA makes the dB reflect this > change and the cascade ensures everything is still consistent. Where is the > problem with this? > > When there is a lot of work involved this needs to be taken into account, > but what is the basis for such a general prohibition on a modern SQL dB? why > not use the feature? Well, the disadvantages are: *) you could end up cascading to a large amount of records *) with a consequently large amount of locks *) a lot of database developers are oblivious to how RI works *) a schema level change of a key is a lot easier to do (but really, the value of this is overstated) Basically, it boils down to, 'more expensive updates, and schema changes are slightly easier' At the end of the day, once you sift through the mountains of BS that tend to characterize the natural/surrogate debate (which I am going to very carefully define as using an automatically generated identifier such as a UUID for purposes of relating) what you end up with is a system that is optimized for updates at the expense of many other things, especially large data mining queries (more indexes and more joins) and general readability. In many cases this is a fair, or even a good, trade-off...but not always, so we have to be flexible. What seals the deal for me though is that there is a near-perfect correlation with 100% surrogate databases and what I would describe as gross modeling errors. merlin
On 2013-04-02, Joe Van Dyk <joe@tanga.com> wrote: > On Tue, Apr 2, 2013 at 11:16 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Tue, Apr 2, 2013 at 10:34 AM, Joe Van Dyk <joe@tanga.com> wrote: >> > >> > I've been wishing for a smaller uuid type for a while. >> > I've been using a unique text column with a default of >> > random_characters(12) >> >> This is pseudo random and can be guessed, which is maybe dangerous >> depending on circumstance. For stronger random stream go to >> pgcrypto.gen_random_bytes(). > > Right, but it's better than using serial's as far as being guessable. > > The probability for collisions are fairly low, if you are using 12 or more > characters (with 30 possible characters). Not sure what the math is on the > probability of collisions (birthday problem) though.. for n samples of p possibles it's approximately n(n-1)/2p for n^2 < p a alphabet of 30 symbols is almost 5 bits per symbol so for 12 symbols you get about 60 bits which almost half as wide as a UUID (128b) > and you could have a > trigger that checked for the existence of a matching key before > inserts/updates. And the associated race condition... no thanks. you could encrypt a serial type using some reversible encryption eg: there's a feistel cypher implemented in plpgsql in the wiki somewhere >> My historical comments in this debate are noted. To summarize, I >> strongly believe that natural keys are often (but not always) better. I'll use natural keys where they are short enough to be useful and guaranteed constant. -- ⚂⚃ 100% natural