Home > mailing lists

Thread: Basic Q on superfluous primary keys

Basic Q on superfluous primary keys

From

"Kynn Jones"

Date:

14 April 2007, 08:19:41

Consider these two very similar schemas:

Schema 1:

CREATE TABLE foo (

id serial PRIMARY KEY,

frobnitz character(varying 100) NOT NULL UNIQUE

);

CREATE TABLE bar (

id serial PRIMARY KEY,

foo_id int REFERENCES foo(id)

)

Schema 2:

CREATE TABLE foo (

frobnitz character(varying 100) PRIMARY KEY

);

CREATE TABLE bar (

id serial PRIMARY KEY,

frobnitz character(varying 100) REFERENCES foo(frobnitz)

)

The two situations are semantically identical: each record in table bar refers to a record in table foo. The difference is that in the first schema, this referencing is done through an "artificial" serial-integer primary key, while in the second schema this reference is done through a data field that happens to be unique and not null, so it can serve as primary key.

I find Schema 1 awkward and unnatural; more specifically, foo.id seems unnecessary in light of the non-null uniqueness of foo.frobnitz. But I remember once reading that "long" fields like foo.frobnitz did not make good primary keys.

Is the field foo.id in Schema 1 superfluous? For example, wouldn't the referencing from bar to foo really be done "behind the scenes" through some hidden field (oid?) instead of through the frobnitz text field? Which of the two schemas would give better perfornance?

Thanks!

kj

Re: Basic Q on superfluous primary keys

From

Bill Moran

Date:

14 April 2007, 08:59:54

In response to "Kynn Jones" <kynnjo@gmail.com>:

> Consider these two very similar schemas:
>
> Schema 1:
>
>
> CREATE TABLE foo (
>   id serial PRIMARY KEY,
>   frobnitz character(varying 100) NOT NULL UNIQUE
> );
>
>
> CREATE TABLE bar (
>   id serial PRIMARY KEY,
>   foo_id int REFERENCES foo(id)
> )
>
>
> Schema 2:
>
>
> CREATE TABLE foo (
>   frobnitz character(varying 100) PRIMARY KEY
> );
>
>
> CREATE TABLE bar (
>   id serial PRIMARY KEY,
>   frobnitz character(varying 100) REFERENCES foo(frobnitz)
> )
>
>
>
>
> The two situations are semantically identical: each record in table bar
> refers to a record in table foo.  The difference is that in the first
> schema, this referencing is done through an "artificial" serial-integer
> primary key, while in the second schema this reference is done through a
> data field that happens to be unique and not null, so it can serve as
> primary key.

The first case is call a "surrogate key".  A little googling on that term
will turn up a wealth of discussion -- both for and against.

> I find Schema 1 awkward and unnatural; more specifically, foo.id seems
> unnecessary in light of the non-null uniqueness of foo.frobnitz.  But I
> remember once reading that "long" fields like foo.frobnitz did not make good
> primary keys.

I had a discussion about this recently on the Drupal mailing lists, at the
end of which I promised to do some benchmarking to determine whether or
not text keys really do hurt performance of indexes.  Unfortunately, I
still haven't followed through on that promise -- maybe I'll get to it
tomorrow.

> Is the field foo.id in Schema 1 superfluous?  For example, wouldn't the
> referencing from bar to foo really be done "behind the scenes" through some
> hidden field (oid?) instead of through the frobnitz text field?  Which of
> the two schemas would give better perfornance?

--
Bill Moran
Collaborative Fusion Inc.

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

****************************************************************
IMPORTANT: This message contains confidential information and is
intended only for the individual named. If the reader of this
message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************

Re: Basic Q on superfluous primary keys

From

"Merlin Moncure"

Date:

16 April 2007, 10:04:22

On 4/14/07, Bill Moran <wmoran@collaborativefusion.com> wrote:
> In response to "Kynn Jones" <kynnjo@gmail.com>:
> > The two situations are semantically identical: each record in table bar
> > refers to a record in table foo.  The difference is that in the first
> > schema, this referencing is done through an "artificial" serial-integer
> > primary key, while in the second schema this reference is done through a
> > data field that happens to be unique and not null, so it can serve as
> > primary key.
>
> I had a discussion about this recently on the Drupal mailing lists, at the
> end of which I promised to do some benchmarking to determine whether or
> not text keys really do hurt performance of indexes.  Unfortunately, I
> still haven't followed through on that promise -- maybe I'll get to it
> tomorrow.
>

The main reason why integer indexes are faster than natural
counterparts is that the index is smaller and puts less pressure on
cache.  This is however offset by removing joins here and there and
you usually just end up indexing the data anyways.  Performance is
kind of tangential to the argument though -- I've seen databases using
all natural keys and found them to be very clean and performant.

Using surrogate keys is dangerous and can lead to very bad design
habits that are unfortunately so prevalent in the software industry
they are virtually taught in schools.  Many software frameworks assume
you use them and refuse to work without them (avoid!)  While there is
nothing wrong with them in principle (you are exchanging one key for
another as a performance optimization), they make it all too easy to
create denormalized designs and tables with no real identifying
criteria, etc, and the resultant stinky queries to put it all back
together again, (full of unions, self joins, extraneous groups, case
statements, etc).

A good compromise in your designs is to identify your natural key but
use the surrogate if you have valid performance reasons:

CREATE TABLE foo (
  frobnitz_id int unique,
  frobnitz character(varying 100) PRIMARY KEY\
 [...]
);

frobnitz_id is of course optional and not necessary in all tables.  It
may be a pain to relate a large table with a four or five part key and
judicious use of surrogates may be justified for performance or even
just to keep your queries smaller:

create table order_line_item_discount
(
  company_name text,
  order_no int,
  line_item_seq_no int,
  discount_code text,
  primary key(company_name, order_no, line_item_seq_no, discount_code)
)

becomes

create table order_line_item_discount
(
  order_line_item_id int,
  discount_code text,
  primary key (order_line_item_id, discount_code)
)

merlin

Re: Basic Q on superfluous primary keys

From

"Craig A. James"

Date:

16 April 2007, 11:55:01

Merlin Moncure wrote:
> Using surrogate keys is dangerous and can lead to very bad design
> habits that are unfortunately so prevalent in the software industry
> they are virtually taught in schools.  ...  While there is
> nothing wrong with them in principle (you are exchanging one key for
> another as a performance optimization), they make it all too easy to
> create denormalized designs and tables with no real identifying
> criteria, etc,...

Wow, that's the opposite of everything I've ever been taught, and all my experience in the last few decades.

I can't recall ever seeing a "natural" key that was immutable.  In my business (chemistry), we've seen several
disasteroussituations were companies picked keys they thought were natural and immutable, and years down the road they
discovered(for example) that chemical compounds they thought were pure were in fact isotopic mixtures, or simply the
wrongmolecule (as analytical techniques improved).  Or during a corporate takeover, they discovered that two companies
usingthe same "natural" keys had as much as 10% differences in their multi-million-compound databases.  These errors
ledto six-month to year-long delays, as each of the conflicting chemical record had to be examined by hand by a PhD
chemistto reclassify it. 

In other businesses, almost any natural identifier you pick is subject to simple typographical errors.  When you
discoverthe errors in a field you've used as a primary key, it can be quite hard to fix, particularly if you have
distributeddata across several systems and schemas. 

We've always recommended to our customers that all primary keys be completely information free.  They should be not
basedon any information or combination of information from the data records.  Every time the customer has not followed
thisadvice, they've later regretted it. 

I'm sure there are situations where a natural key is appropriate, but I haven't seen it in my work.

Craig

Re: Basic Q on superfluous primary keys

From

"Merlin Moncure"

Date:

16 April 2007, 12:55:36

On 4/16/07, Craig A. James <cjames@modgraph-usa.com> wrote:
> Merlin Moncure wrote:
> > Using surrogate keys is dangerous and can lead to very bad design
> > habits that are unfortunately so prevalent in the software industry
> > they are virtually taught in schools.  ...  While there is
> > nothing wrong with them in principle (you are exchanging one key for
> > another as a performance optimization), they make it all too easy to
> > create denormalized designs and tables with no real identifying
> > criteria, etc,...

> I can't recall ever seeing a "natural" key that was immutable.  In my business (chemistry), we've seen several
disasteroussituations were companies picked keys they thought were natural and immutable, and years down the road they
discovered(for example) that chemical compounds they thought were pure were in fact isotopic mixtures, or simply the
wrongmolecule (as analytical techniques improved).  Or during a corporate takeover, they discovered that two companies
usingthe same "natural" keys had as much as 10% differences in their multi-million-compound databases.  These errors
ledto six-month to year-long delays, as each of the conflicting chemical record had to be examined by hand by a PhD
chemistto reclassify it. 

while your example might be a good case study in proper
classification, it has nothing to do with key selection.  it is
especially unclear how adding an integer to a table will somehow
magically solve these problems.  are you claiming that a primary key
can't be changed?

mutability is strictly a performance argument.  since RI handles
cascading primary key changes, it's simply a matter of if you are
willing to wait for RI to do its work or not (if not, swap in the id
key as in my example).  the performance argument really only applies
to specific cases, and can be considered a attribute of certain
tables.  extraordinary cases do happen, like a company overhauling its
numbering systems, but such cases can be dealt with by a number of
methods including letting RI do its thing.

merlin

Re: Basic Q on superfluous primary keys

From

Ron Mayer

Date:

16 April 2007, 15:24:24

Craig A. James wrote:
> Merlin Moncure wrote:
>> Using surrogate keys is dangerous and can lead to very bad design
>> habits that are unfortunately so prevalent in the software industry
>> they are virtually taught in schools.  ...  While there is
>> nothing wrong with them in principle (you are exchanging one key for
>> another as a performance optimization), they make it all too easy to
>> create denormalized designs and tables with no real identifying
>> criteria, etc,...
>
> Wow, that's the opposite of everything I've ever been taught, and all my
> experience in the last few decades.
>
> ...chemistry...two companies using the same "natural"
> keys had as much as 10% differences in their multi-million-compound
> databases.  These errors led to six-month to year-long delays, as each
> of the conflicting chemical record had to be examined by hand by a PhD
> chemist to reclassify it.

That sounds almost like a feature, not a bug - giving information
about what assumptions that went into the "natural key" need to be
reconsidered.

And I don't see how it would have been improved by adding a surrogate
key - except that the data would have been just as messed up though
harder to see where the messups were.

> We've always recommended to our customers that all primary keys be
> completely information free.  They should be not based on any
> information or combination of information from the data records.  Every
> time the customer has not followed this advice, they've later regretted it.

Hmm... but then do you put a unique index on what the
otherwise-would-have-been-natural-primary-key columns?

If not, you tend to get into the odd situation of multiple
rows that only vary in their surrogate key -- and it seems
the surrogate key is redundant.

> I'm sure there are situations where a natural key is appropriate, but I
> haven't seen it in my work.

I've seen both - and indeed usually use surrogate keys for convenience;
but also find that having to fix incorrect assumptions in natural primary
keys tends to raise business issues that are worth addressing anyway.

Re: Basic Q on superfluous primary keys

From

Greg Smith

Date:

17 April 2007, 00:02:53

On Mon, 16 Apr 2007, Merlin Moncure wrote:

> extraordinary cases do happen, like a company overhauling its numbering
> systems, but such cases can be dealt with by a number of methods
> including letting RI do its thing.

I think the point Craig was trying to make is that what you refer to here
as "extraordinary cases" are, in fact, rather common.  I've never seen a
database built on natural keys that didn't at some point turn ugly when
some internal or external business need suddenly invalidated the believed
uniqueness of that key.

The last really bad one I saw was a manufacturing database that used a
combination of the customer code and the customer's part number as the
key.  Surely if the customer changes their part number, we should switch
ours to match so the orders are easy to process, right?  When this got fun
was when one large customer who released products on a yearly schedule
decided to change the bill of material for many of their parts for the new
year, but re-used the same part number; oh, and they were still ordering
the old parts as well.  Hilarity ensued.

> it is especially unclear how adding an integer to a table will somehow
> magically solve these problems.

If the key is a integer, it's always possible to figure out a trivial map
that renumbers the entire database programmatically in order to merge two
sets of data.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Basic Q on superfluous primary keys

From

"Merlin Moncure"

Date:

17 April 2007, 13:18:00

On 4/16/07, Greg Smith <gsmith@gregsmith.com> wrote:
> I think the point Craig was trying to make is that what you refer to here
> as "extraordinary cases" are, in fact, rather common.  I've never seen a
> database built on natural keys that didn't at some point turn ugly when
> some internal or external business need suddenly invalidated the believed
> uniqueness of that key.

I don't think it's so terrible to add a field to a key...I too have
worked on a ERP system based on natural keys and was quite amazed on
how well organized the database was.    When the company decided to
re-number all the items in the database, it was a minor pain.
Extending a critical key would be a project any way you organize the
database IMO.  Natural keys are most common in manufacturing and
accounting systems because of the COBOL heritage, when natural keys
were the only way to realistically do it.  Unfortunately SQL really
missed the boat on keys...otherwise they would behave more like a
composite type.

> The last really bad one I saw was a manufacturing database that used a
> combination of the customer code and the customer's part number as the
> key.  Surely if the customer changes their part number, we should switch
> ours to match so the orders are easy to process, right?  When this got fun
> was when one large customer who released products on a yearly schedule
> decided to change the bill of material for many of their parts for the new
> year, but re-used the same part number; oh, and they were still ordering
> the old parts as well.  Hilarity ensued.

In the context of this debate, I see this argument all the time, with
the implied suffix: 'If only we used integer keys we would not have
had this problem...'.  Either the customer identifies parts with a
part number or they don't...and if they do identify parts with a
number and recycle the numbers, you have a problem...period.  Adding a
integer key only moves the confusion to a separate place, unless it is
used by the user to identify the part number and then *becomes* the
key, or a part of it.  If you hide the id from the user, then I claim
the data model is pretty much busted.

merlin

Re: Basic Q on superfluous primary keys

From

"Craig A. James"

Date:

18 April 2007, 00:58:41

Merlin Moncure wrote:
> In the context of this debate, I see this argument all the time, with
> the implied suffix: 'If only we used integer keys we would not have
> had this problem...'.  Either the customer identifies parts with a
> part number or they don't...and if they do identify parts with a
> number and recycle the numbers, you have a problem...period.

On the contrary.  You create a new record with the same part number.  You mark the old part number "obsolete".
Everythingelse (the part's description, and all the relationships that it's in, such as order history, catalog
inclusion,revision history, etc.) is unaffected.  New orders are placed against the new part number's DB record; for
safetythe old part number can have a trigger that prevent new orders from being placed. 

Since the part number is NOT the primary key, duplicate part numbers are not a problem.  If you had instead used the
partnumber as the primary key, you'd be dead in the water. 

You can argue that the customer is making a dumb decision by reusing catalog numbers, and I'd agree.  But they do it,
andas database designers we have to handle it.  In my particular system, we aggregate information from several hundred
companies,and this exact scenario happens frequently.  Since we're only aggregating information, we have no control
overthe data that these companies provide.  If we'd used catalog numbers for primary keys, we'd have big problems. 

Craig

Re: Basic Q on superfluous primary keys

From

Richard Huxton

Date:

18 April 2007, 07:47:35

Craig A. James wrote:

> Since we're only aggregating information, we have
> no control over the data that these companies provide.

And at the end of the day that's the root of the problem. It's easy to
be lulled into "well it looks like a primary key" rather than being able
to guarantee it.

--
   Richard Huxton
   Archonet Ltd

Re: Basic Q on superfluous primary keys

From

Greg Smith

Date:

18 April 2007, 09:20:07

On Wed, 18 Apr 2007, Richard Huxton wrote:

> And at the end of the day that's the root of the problem. It's easy to be
> lulled into "well it looks like a primary key" rather than being able to
> guarantee it.

In some of these cases it is guaranteed to be a primary key given all
available information at the time.  The funny thing about unexpected
changes to a business model is that you never expect them.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Basic Q on superfluous primary keys

From

"Merlin Moncure"

Date:

18 April 2007, 11:44:59

On 4/18/07, Craig A. James <cjames@modgraph-usa.com> wrote:
> Merlin Moncure wrote:
> > In the context of this debate, I see this argument all the time, with
> > the implied suffix: 'If only we used integer keys we would not have
> > had this problem...'.  Either the customer identifies parts with a
> > part number or they don't...and if they do identify parts with a
> > number and recycle the numbers, you have a problem...period.
>
> On the contrary.  You create a new record with the same part number.  You mark the old part number "obsolete".
Everythingelse (the part's description, and all the relationships that it's in, such as order history, catalog
inclusion,revision history, etc.) is unaffected.  New orders are placed against the new part number's DB record; for
safetythe old part number can have a trigger that prevent new orders from being placed. 
>
> Since the part number is NOT the primary key, duplicate part numbers are not a problem.  If you had instead used the
partnumber as the primary key, you'd be dead in the water. 

You are redefining the primary key to be (part_number,
obsoletion_date).  Now, if you had not anticipated that in the
original design (likely enough), you do have to refactor queries that
join on the table...so what?  If that's too much work, you can use a
view to take care of the problem (which may be a good idea anyways).
*you have to refactor the system anyways because you are now allowing
duplicate part numbers where previously (from the perspective of the
user), they were unique *.

The hidden advantage of pushing the full key through the database is
it tends to expose holes in the application/business logic.  Chances
are some query is not properly distinguishing obsoleted parts and now
the real problems come...surrogate keys do not remove complexity, they
simply sweep it under the rug.

merlin

Re: Basic Q on superfluous primary keys

From

"Craig A. James"

Date:

18 April 2007, 13:03:31

Merlin Moncure wrote:
>> Since the part number is NOT the primary key, duplicate part numbers
>> are not a problem.  If you had instead used the part number as the
>> primary key, you'd be dead in the water.
>
> You are redefining the primary key to be (part_number,
> obsoletion_date).  Now, if you had not anticipated that in the
> original design (likely enough), you do have to refactor queries that
> join on the table...so what?  If that's too much work, you can use a
> view to take care of the problem (which may be a good idea anyways).
> *you have to refactor the system anyways because you are now allowing
> duplicate part numbers where previously (from the perspective of the
> user), they were unique *.
>
> The hidden advantage of pushing the full key through the database is
> it tends to expose holes in the application/business logic.  Chances
> are some query is not properly distinguishing obsoleted parts and now
> the real problems come...surrogate keys do not remove complexity, they
> simply sweep it under the rug.

This really boils down to an object-oriented perspective.  I have an object, a customer's catalog entry.  It has
propertiessuch as catalog number, description, etc, and whether it's obsolete or not.  Management of the object (its
relationto other objects, its history, etc.) should NOT depend on the object's specific definition. 

This is true whether the object is represented in Lisp, C++, Perl, or (in this case) an SQL schema.  Good object
orienteddesign abstracts the object and its behavior from management of the object.  In C++, Perl, etc., we manage
objectsvia a pointer or object reference.  In SQL, we reference objects by an *arbitrary* integer that is effectively a
pointerto the object. 

What you're suggesting is that I should break the object-oriented encapsulation by pulling out specific fields of the
object,exposing those internal object details to the applications, and spreading those details across the whole schema.
AndI argue that this is wrong, because it breaks encapsulation.  By exposing the details of the object, if the details
change,*all* of your relationships break, and all of your applications have to change.  And I've never seen a system
wherebreaking object-oriented encapsulation was a good long-term solution.  Systems change, and object-oriented
techniqueswere invented to help manage change. 

This is one of the reasons the Postgres project was started way back when: To bring object-oriented techniques to the
relational-databaseworld. 

Craig

Re: Basic Q on superfluous primary keys

From

"Dave Dutcher"

Date:

18 April 2007, 15:10:07

I think a database with all natural keys is unrealistic.  For example if you
have a table that refers to people, are you going to use their name as a
primary key?  Names change all the time due to things like marriage,
divorce, or trouble with the law.  We have tables with 20 million rows which
reference back to a table of people, and if I used the person's name as key,
it would be a major pain when somebody's name changes.  Even if there is
referential integrity, one person might be referred to by 25% of the 20
million rows, so the update would take quite a long time.  Also the table
will be filled with dead rows and the indexes will likely be bloated.  If I
want to clean that up, it will take a vacuum full or a cluster which will
lock the whole table and run for hours.  If I use a surrogate key, I can
change their name in one row and be done with it.

Just my 2 cents.

Dave

Re: Basic Q on superfluous primary keys

From

Jeff Davis

Date:

18 April 2007, 16:08:55

On Tue, 2007-04-17 at 21:06 -0700, Craig A. James wrote:
> Merlin Moncure wrote:
> > In the context of this debate, I see this argument all the time, with
> > the implied suffix: 'If only we used integer keys we would not have
> > had this problem...'.  Either the customer identifies parts with a
> > part number or they don't...and if they do identify parts with a
> > number and recycle the numbers, you have a problem...period.
>
> On the contrary.  You create a new record with the same part number.  You mark the old part number "obsolete".
Everythingelse (the part's description, and all the relationships that it's in, such as order history, catalog
inclusion,revision history, etc.) is unaffected.  New orders are placed against the new part number's DB record; for
safetythe old part number can have a trigger that prevent new orders from being placed. 
>
> Since the part number is NOT the primary key, duplicate part numbers are not a problem.  If you had instead used the
partnumber as the primary key, you'd be dead in the water. 
>
> You can argue that the customer is making a dumb decision by reusing catalog numbers, and I'd agree.  But they do it,
andas database designers we have to handle it.  In my particular system, we aggregate information from several hundred
companies,and this exact scenario happens frequently.  Since we're only aggregating information, we have no control
overthe data that these companies provide.  If we'd used catalog numbers for primary keys, we'd have big problems. 
>

Storing data is easy.

The difficulty lies in storing data in such a way that your assumptions
about the data remain valid and your queries still answer the questions
that you think they answer.

Because an internal ID field has no meaning outside of the database
(some auto-generated keys do have meaning outside the database, but I'm
not talking about those), you can't effectively query based on an
internal id any more than you can query by the ctid. So what do you
query by then? You query by natural keys anyway. Internal id fields are
an implementation detail related to performance (the real topic of this
discussion).

If you have two parts with the same part id, what does that mean? Sure,
you can store the data, but then the queries that assumed that data was
unique no longer hold. Sometimes you need two parts with the same part
id, but you have to know the meaning in order to query based on that
data.

Let me ask these questions:
 - Do you think that all of your relations have an internal id?
 - Do you think that all the internal ids you use are unique in the
relations in which they appear?

If you answer "yes" to either question, consider that every query on
that data is also a relation and so are subselects and intermediate
results. Do those all have an id? If not, why not? How do you join a
virtual relation to a physical relation if the virtual relation has no
internal id? Is the id field still unique in the result of a join or
Cartesian product?

Regards,
    Jeff Davis

Re: Basic Q on superfluous primary keys

From

"Merlin Moncure"

Date:

18 April 2007, 21:16:24

On 4/18/07, Dave Dutcher <dave@tridecap.com> wrote:
> I think a database with all natural keys is unrealistic.  For example if you
> have a table that refers to people, are you going to use their name as a
> primary key?  Names change all the time due to things like marriage,
> divorce, or trouble with the law.  We have tables with 20 million rows which
> reference back to a table of people, and if I used the person's name as key,
> it would be a major pain when somebody's name changes.  Even if there is
> referential integrity, one person might be referred to by 25% of the 20
> million rows, so the update would take quite a long time.  Also the table
> will be filled with dead rows and the indexes will likely be bloated.  If I
> want to clean that up, it will take a vacuum full or a cluster which will
> lock the whole table and run for hours.  If I use a surrogate key, I can
> change their name in one row and be done with it.

That's perfectly reasonable (I mentioned this upthread)...there are a
couple of corner cases where RI costs too much  Exchanging a surrogate
for a natural is a valid performance consideration.  Usually, the
performance win is marginal at best (and your example suggests
possible normalization issues in the child table), sometimes there is
no alternative....updating 5 million rows is obviously nasty.  That
said -- if the cost of update was zero, would you still do it that
way?  I'm trying to separate performance related issues, which are
reasonable and valid depending on the situation, with good design
principles.

merlin

index structure for 114-dimension vector

From

Andrew Lazarus

Date:

20 April 2007, 16:43:12

I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.

I imported the cube contrib package, and I tried creating an index on
a cube of the last 6 elements, which are the most important. Then I
tested the 2.5MM rows for being contained within a tolerance of the
last 6 elements of X, +/- 0.1 in each coordinate, figuring that would
be an indexed search (which I CLUSTERED on). I then ran the sort on
this smaller set. The index was used, but it was still too slow. I
also tried creating new columns with rounded int2 values of the last 6
coordinates and made a multicolumn index.

For each X the search is taking about 4-15 seconds which is above my
target at least one order of magnitude. Absolute numbers are dependent
on my hardware and settings, and some of this can be addressed with
configuration tweaks, etc., but first I think I need to know the
optimum data structure/indexing strategy.

Is anyone on the list experienced with this sort of issue?

Thanks.
Andrew Lazarus  andrew@pillette.com

Re: index structure for 114-dimension vector

From

Jeff Davis

Date:

20 April 2007, 18:41:35

On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote:
> I have a table with 2.5 million real[] arrays. (They are points in a
> time series.) Given a new array X, I'd like to find, say, the 25
> closest to X in some sense--for simplification, let's just say in the
> usual vector norm. Speed is critical here, and everything I have tried
> has been too slow.
>
> I imported the cube contrib package, and I tried creating an index on
> a cube of the last 6 elements, which are the most important. Then I

Have you tried just making normal indexes on each of the last 6 elements
and see if postgresql will use a BitmapAnd to combine them?

Regards,
    Jeff Davis

Re: index structure for 114-dimension vector

From

Mark Kirkwood

Date:

20 April 2007, 20:42:46

Jeff Davis wrote:
> On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote:
>> I have a table with 2.5 million real[] arrays. (They are points in a
>> time series.) Given a new array X, I'd like to find, say, the 25
>> closest to X in some sense--for simplification, let's just say in the
>> usual vector norm. Speed is critical here, and everything I have tried
>> has been too slow.
>>
>> I imported the cube contrib package, and I tried creating an index on
>> a cube of the last 6 elements, which are the most important. Then I
>
> Have you tried just making normal indexes on each of the last 6 elements
> and see if postgresql will use a BitmapAnd to combine them?
>
>

I don't think that will work for the vector norm i.e:

|x - y| = sqrt(sum over j ((x[j] - y[j])^2))


Cheers

Mark

Re: index structure for 114-dimension vector

From

Andrew Lazarus

Date:

20 April 2007, 21:05:33

Because I know the 25 closest are going to be fairly close in each
coordinate, I did try a multicolumn index on the last 6 columns and
used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside
that hypercube on the distribution of data in question.)

This hypercube tended to have 10-20K records, and took at least 4
seconds to retrieve. I was a little surprised by how long that took.
So I'm wondering if my data representation is off the wall.

I should mention I also tried a cube index using gist on all 114
elements, but CREATE INDEX hadn't finished in 36 hours, when I killed
it, and I wasn't in retrospect sure an index that took something like
6GB by itself would be helpful on a 2GB of RAM box.

MK> I don't think that will work for the vector norm i.e:

MK> |x - y| = sqrt(sum over j ((x[j] - y[j])^2))


MK> Cheers

MK> Mark


--
Sincerely,
 Andrew Lazarus        mailto:andrew@pillette.com

Attachment

vCard.VCF

Re: index structure for 114-dimension vector

From

Mark Kirkwood

Date:

20 April 2007, 21:42:42

Andrew Lazarus wrote:
> Because I know the 25 closest are going to be fairly close in each
> coordinate, I did try a multicolumn index on the last 6 columns and
> used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside
> that hypercube on the distribution of data in question.)
>
> This hypercube tended to have 10-20K records, and took at least 4
> seconds to retrieve. I was a little surprised by how long that took.
> So I'm wondering if my data representation is off the wall.
>
> I should mention I also tried a cube index using gist on all 114
> elements, but CREATE INDEX hadn't finished in 36 hours, when I killed
> it, and I wasn't in retrospect sure an index that took something like
> 6GB by itself would be helpful on a 2GB of RAM box.
>
> MK> I don't think that will work for the vector norm i.e:
>
> MK> |x - y| = sqrt(sum over j ((x[j] - y[j])^2))
>
>

Sorry, in that case it probably *is* worth trying out 6 single column
indexes and seeing if they get bitmap and'ed together...

Mark

Re: index structure for 114-dimension vector

From

Tom Lane

Date:

20 April 2007, 21:44:42

Andrew Lazarus <andrew@pillette.com> writes:
> Because I know the 25 closest are going to be fairly close in each
> coordinate, I did try a multicolumn index on the last 6 columns and
> used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside
> that hypercube on the distribution of data in question.)

> This hypercube tended to have 10-20K records, and took at least 4
> seconds to retrieve. I was a little surprised by how long that took.
> So I'm wondering if my data representation is off the wall.

A multicolumn btree index isn't going to be helpful at all.  Jeff's idea
of using six single-column indexes with the above query might work,
though.

            regards, tom lane

Re: Basic Q on superfluous primary keys

From

"Craig A. James"

Date:

23 April 2007, 01:15:24

Merlin Moncure wrote:
> In the context of this debate, I see this argument all the time, with
> the implied suffix: 'If only we used integer keys we would not have
> had this problem...'.  Either the customer identifies parts with a
> part number or they don't...and if they do identify parts with a
> number and recycle the numbers, you have a problem...period.

On the contrary.  You create a new record with the same part number.  You mark the old part number "obsolete".
Everythingelse (the part's description, and all the relationships that it's in, such as order history, catalog
inclusion,revision history, etc.) is unaffected.  New orders are placed against the new part number's DB record; for
safetythe old part number can have a trigger that prevent new orders from being placed. 

Since the part number is NOT the primary key, duplicate part numbers are not a problem.  If you had instead used the
partnumber as the primary key, you'd be dead in the water. 

You can argue that the customer is making a dumb decision by reusing catalog numbers, and I'd agree.  But they do it,
andas database designers we have to handle it.  In my particular system, we aggregate information from several hundred
companies,and this exact scenario happens frequently.  Since we're only aggregating information, we have no control
overthe data that these companies provide.  If we'd used catalog numbers for primary keys, we'd have big problems. 

Craig

Re: index structure for 114-dimension vector

From

C Storm

Date:

26 April 2007, 17:58:45

On Apr 20, 12:07 pm, and...@pillette.com (Andrew Lazarus) wrote:
> I have a table with 2.5 million real[] arrays. (They are points in a
> time series.) Given a new array X, I'd like to find, say, the 25
> closest to X in some sense--for simplification, let's just say in the
> usualvectornorm. Speed is critical here, and everything I have tried
> has been too slow.
>
> I imported the cube contrib package, and I tried creating an index on
> a cube of the last 6 elements, which are the most important. Then I
> tested the 2.5MM rows for being contained within a tolerance of the
> last 6 elements of X, +/- 0.1 in each coordinate, figuring that would
> be an indexed search (which I CLUSTERED on). I then ran the sort on
> this smaller set. The index was used, but it was still too slow. I
> also tried creating new columns with rounded int2 values of the last 6
> coordinates and made a multicolumn index.
>
> For each X the search is taking about 4-15 seconds which is above my
> target at least one order of magnitude. Absolute numbers are dependent
> on my hardware and settings, and some of this can be addressed with
> configuration tweaks, etc., but first I think I need to know the
> optimum data structure/indexing strategy.
>
> Is anyone on the list experienced with this sort of issue?
>
> Thanks.
> Andrew Lazarus  and...@pillette.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

Having worked in high dimensional spaces a lot in my career I think
you'll find that there are
mathematical limits in terms of speed.  In practical terms, a seq_scan
will be unavoidable since
on first approximation you are limited to doing an exhaustive search
in 101-dimensional space unless
you make approximations or dimensionality reductions of some kind.

Read up on the Curse of Dimensionality:  http://en.wikipedia.org/wiki/Curse_of_dimensionality

Have you considered dimension reduction techniques such as Singular
Value Decomposition,
Principal Components Analysis, etc.?

Re: index structure for 114-dimension vector

From

"Alexander Staubo"

Date:

26 April 2007, 19:34:44

On 4/20/07, Andrew Lazarus <andrew@pillette.com> wrote:
> I have a table with 2.5 million real[] arrays. (They are points in a
> time series.) Given a new array X, I'd like to find, say, the 25
> closest to X in some sense--for simplification, let's just say in the
> usual vector norm. Speed is critical here, and everything I have tried
> has been too slow.

Let me chime in with the observation that this is a multidimensional
nearest neighbour (reverse nearest neighbour and its close cousin,
k-NN) that is well known in statistics, and particularly relevant to
statistical learning and classification. Knowing the jargon might help
you dig up efficient algorithms to mine your data; there are tons of
fascinating papers available through Citeseer.

In particular, I recommend the paper "Efficient k-NN Search on
Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here:
http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
proposes an algorithm called BOND to drastically reduce the search
space by probalistic means. They give an example using image
histograms, but the algorithm applies to any multidimensional data.
Briefly put, it points out that proximity comparison can be computed
vertically, a few dimensions at a time, and entire subsets can be
thrown away when it's apparent that they are below a statistically
derived lower bound. The only gotcha is that the algorithm derives
much of its performance from the assumption that your data is
vertically decomposed, one table per dimension, otherwise the search
effectively incurs a sequential scan of the entire dataset, and then
you're pretty much back to square one.

The most common approach to nearest neighbour search is to use a
spatial data structure. The classic algorithm is the kd-tree
(http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
tree, neither of which are available in PostgreSQL. If I remember
correctly, R-trees have also been shown to be useful for high numbers
of dimensions; with PostgreSQL you have R-trees and even better
R-tree-equivalent support through GiST. I have no idea whether you can
actually munge your integer vectors into something GiST can index and
search, but it's a thought. (GiST, presumably, can also theoretically
index kd-trees and other spatial trees.)

Alexander.

Re: index structure for 114-dimension vector

From

Oleg Bartunov

Date:

27 April 2007, 01:43:33

On Fri, 27 Apr 2007, Alexander Staubo wrote:

> On 4/20/07, Andrew Lazarus <andrew@pillette.com> wrote:
>> I have a table with 2.5 million real[] arrays. (They are points in a
>> time series.) Given a new array X, I'd like to find, say, the 25
>> closest to X in some sense--for simplification, let's just say in the
>> usual vector norm. Speed is critical here, and everything I have tried
>> has been too slow.
>
> Let me chime in with the observation that this is a multidimensional
> nearest neighbour (reverse nearest neighbour and its close cousin,
> k-NN) that is well known in statistics, and particularly relevant to
> statistical learning and classification. Knowing the jargon might help
> you dig up efficient algorithms to mine your data; there are tons of
> fascinating papers available through Citeseer.
>
> In particular, I recommend the paper "Efficient k-NN Search on
> Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here:
> http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
> proposes an algorithm called BOND to drastically reduce the search
> space by probalistic means. They give an example using image
> histograms, but the algorithm applies to any multidimensional data.
> Briefly put, it points out that proximity comparison can be computed
> vertically, a few dimensions at a time, and entire subsets can be
> thrown away when it's apparent that they are below a statistically
> derived lower bound. The only gotcha is that the algorithm derives
> much of its performance from the assumption that your data is
> vertically decomposed, one table per dimension, otherwise the search
> effectively incurs a sequential scan of the entire dataset, and then
> you're pretty much back to square one.
>
> The most common approach to nearest neighbour search is to use a
> spatial data structure. The classic algorithm is the kd-tree
> (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
> tree, neither of which are available in PostgreSQL. If I remember
> correctly, R-trees have also been shown to be useful for high numbers
> of dimensions; with PostgreSQL you have R-trees and even better
> R-tree-equivalent support through GiST. I have no idea whether you can
> actually munge your integer vectors into something GiST can index and
> search, but it's a thought. (GiST, presumably, can also theoretically
> index kd-trees and other spatial trees.)

you're right, but currently  only theoretically due to interface restriction.
We have plan to improve it sometime. There was SP-GiST project, which
could be used for k-d-b tree,  see http://www.cs.purdue.edu/spgist/
I don't know if it works with 8.2 version. Also, it doesn't supports
concurrency and recovery

>
> Alexander.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>              http://archives.postgresql.org
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: index structure for 114-dimension vector

From

Arjen van der Meijden

Date:

27 April 2007, 03:07:49

On 21-4-2007 1:42 Mark Kirkwood wrote:
> I don't think that will work for the vector norm i.e:
>
> |x - y| = sqrt(sum over j ((x[j] - y[j])^2))

I don't know if this is usefull here, but I was able to rewrite that
algorithm for a set of very sparse vectors (i.e. they had very little
overlapping factors) to something like:
|x - y| = sum over j (x[j]^2) + sum over j (y[j]^2)
        + for each j where x[j] and y[j] are both non-zero: - (x[j]^2 +
y[j]^2) + (x[j] - y[j])^2

The first two parts sums can be calculated only once. So if you have
very little overlap, this is therefore much more efficient (if there is
no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this
rewritten calculation allows you to store the X and Y vectors using a
trivial table-layout vector(x,i,value) which is only filled with
non-zero's and which you can trivially self-join to find the closest
matches. You don't care about the j's where there is either no x or
y-value anyway with this rewrite.

I can compare over 1000 y's of on average 100 elements to two x's of
over 1000 elements on just a single 1.8Ghz amd processor. (I use it for
a bi-kmeans algorithm, so there are only two buckets to compare to).

So it might be possible to rewrite your algorithm to be less
calculation-intensive. Obviously, with a dense-matrix this isn't going
to work, but there may be other ways to circumvent parts of the
algorithm or to cache large parts of it.
It might also help to extract only the 6 relevant columns into a
seperate temporary table which will have much smaller records and thus
can fit more records per page.

Best regards,

Arjen

Re: index structure for 114-dimension vector

From

Andrew Lazarus

Date:

01 May 2007, 14:46:06

Let me just thank the list, especially for the references. (I found
similar papers myself with Google: and to think I have a university
library alumni card and barely need it any more!)

I'll write again on the sorts of results I get.

Attachment

vCard.VCF

Re: index structure for 114-dimension vector

From

"Alexander Staubo"

Date:

01 May 2007, 17:27:30

On 5/1/07, Andrew Lazarus <andrew@pillette.com> wrote:
> Let me just thank the list, especially for the references. (I found
> similar papers myself with Google: and to think I have a university
> library alumni card and barely need it any more!)
>
> I'll write again on the sorts of results I get.

Looking forward to hearing about them. I have worked with such dataset
problems, but never attempted to apply them to a relational database
such as PostgreSQL. If you want speed, nothing beats in-memory vectors
on a modern CPU architecture.

Alexander.