Thread: Basic Q on superfluous primary keys
Consider these two very similar schemas:
Schema 1:
CREATE TABLE foo (
id serial PRIMARY KEY,
frobnitz character(varying 100) NOT NULL UNIQUE
);
CREATE TABLE bar (
id serial PRIMARY KEY,
foo_id int REFERENCES foo(id)
)
Schema 2:
CREATE TABLE foo (
frobnitz character(varying 100) PRIMARY KEY
);
CREATE TABLE bar (
id serial PRIMARY KEY,
frobnitz character(varying 100) REFERENCES foo(frobnitz)
)
The two situations are semantically identical: each record in table bar refers to a record in table foo. The difference is that in the first schema, this referencing is done through an "artificial" serial-integer primary key, while in the second schema this reference is done through a data field that happens to be unique and not null, so it can serve as primary key.
I find Schema 1 awkward and unnatural; more specifically, foo.id seems unnecessary in light of the non-null uniqueness of foo.frobnitz. But I remember once reading that "long" fields like foo.frobnitz did not make good primary keys.
Is the field foo.id in Schema 1 superfluous? For example, wouldn't the referencing from bar to foo really be done "behind the scenes" through some hidden field (oid?) instead of through the frobnitz text field? Which of the two schemas would give better perfornance?
Thanks!
kj
In response to "Kynn Jones" <kynnjo@gmail.com>: > Consider these two very similar schemas: > > Schema 1: > > > CREATE TABLE foo ( > id serial PRIMARY KEY, > frobnitz character(varying 100) NOT NULL UNIQUE > ); > > > CREATE TABLE bar ( > id serial PRIMARY KEY, > foo_id int REFERENCES foo(id) > ) > > > Schema 2: > > > CREATE TABLE foo ( > frobnitz character(varying 100) PRIMARY KEY > ); > > > CREATE TABLE bar ( > id serial PRIMARY KEY, > frobnitz character(varying 100) REFERENCES foo(frobnitz) > ) > > > > > The two situations are semantically identical: each record in table bar > refers to a record in table foo. The difference is that in the first > schema, this referencing is done through an "artificial" serial-integer > primary key, while in the second schema this reference is done through a > data field that happens to be unique and not null, so it can serve as > primary key. The first case is call a "surrogate key". A little googling on that term will turn up a wealth of discussion -- both for and against. > I find Schema 1 awkward and unnatural; more specifically, foo.id seems > unnecessary in light of the non-null uniqueness of foo.frobnitz. But I > remember once reading that "long" fields like foo.frobnitz did not make good > primary keys. I had a discussion about this recently on the Drupal mailing lists, at the end of which I promised to do some benchmarking to determine whether or not text keys really do hurt performance of indexes. Unfortunately, I still haven't followed through on that promise -- maybe I'll get to it tomorrow. > Is the field foo.id in Schema 1 superfluous? For example, wouldn't the > referencing from bar to foo really be done "behind the scenes" through some > hidden field (oid?) instead of through the frobnitz text field? Which of > the two schemas would give better perfornance? -- Bill Moran Collaborative Fusion Inc. wmoran@collaborativefusion.com Phone: 412-422-3463x4023 **************************************************************** IMPORTANT: This message contains confidential information and is intended only for the individual named. If the reader of this message is not an intended recipient (or the individual responsible for the delivery of this message to an intended recipient), please be advised that any re-use, dissemination, distribution or copying of this message is prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****************************************************************
On 4/14/07, Bill Moran <wmoran@collaborativefusion.com> wrote: > In response to "Kynn Jones" <kynnjo@gmail.com>: > > The two situations are semantically identical: each record in table bar > > refers to a record in table foo. The difference is that in the first > > schema, this referencing is done through an "artificial" serial-integer > > primary key, while in the second schema this reference is done through a > > data field that happens to be unique and not null, so it can serve as > > primary key. > > I had a discussion about this recently on the Drupal mailing lists, at the > end of which I promised to do some benchmarking to determine whether or > not text keys really do hurt performance of indexes. Unfortunately, I > still haven't followed through on that promise -- maybe I'll get to it > tomorrow. > The main reason why integer indexes are faster than natural counterparts is that the index is smaller and puts less pressure on cache. This is however offset by removing joins here and there and you usually just end up indexing the data anyways. Performance is kind of tangential to the argument though -- I've seen databases using all natural keys and found them to be very clean and performant. Using surrogate keys is dangerous and can lead to very bad design habits that are unfortunately so prevalent in the software industry they are virtually taught in schools. Many software frameworks assume you use them and refuse to work without them (avoid!) While there is nothing wrong with them in principle (you are exchanging one key for another as a performance optimization), they make it all too easy to create denormalized designs and tables with no real identifying criteria, etc, and the resultant stinky queries to put it all back together again, (full of unions, self joins, extraneous groups, case statements, etc). A good compromise in your designs is to identify your natural key but use the surrogate if you have valid performance reasons: CREATE TABLE foo ( frobnitz_id int unique, frobnitz character(varying 100) PRIMARY KEY\ [...] ); frobnitz_id is of course optional and not necessary in all tables. It may be a pain to relate a large table with a four or five part key and judicious use of surrogates may be justified for performance or even just to keep your queries smaller: create table order_line_item_discount ( company_name text, order_no int, line_item_seq_no int, discount_code text, primary key(company_name, order_no, line_item_seq_no, discount_code) ) becomes create table order_line_item_discount ( order_line_item_id int, discount_code text, primary key (order_line_item_id, discount_code) ) merlin
Merlin Moncure wrote: > Using surrogate keys is dangerous and can lead to very bad design > habits that are unfortunately so prevalent in the software industry > they are virtually taught in schools. ... While there is > nothing wrong with them in principle (you are exchanging one key for > another as a performance optimization), they make it all too easy to > create denormalized designs and tables with no real identifying > criteria, etc,... Wow, that's the opposite of everything I've ever been taught, and all my experience in the last few decades. I can't recall ever seeing a "natural" key that was immutable. In my business (chemistry), we've seen several disasteroussituations were companies picked keys they thought were natural and immutable, and years down the road they discovered(for example) that chemical compounds they thought were pure were in fact isotopic mixtures, or simply the wrongmolecule (as analytical techniques improved). Or during a corporate takeover, they discovered that two companies usingthe same "natural" keys had as much as 10% differences in their multi-million-compound databases. These errors ledto six-month to year-long delays, as each of the conflicting chemical record had to be examined by hand by a PhD chemistto reclassify it. In other businesses, almost any natural identifier you pick is subject to simple typographical errors. When you discoverthe errors in a field you've used as a primary key, it can be quite hard to fix, particularly if you have distributeddata across several systems and schemas. We've always recommended to our customers that all primary keys be completely information free. They should be not basedon any information or combination of information from the data records. Every time the customer has not followed thisadvice, they've later regretted it. I'm sure there are situations where a natural key is appropriate, but I haven't seen it in my work. Craig
On 4/16/07, Craig A. James <cjames@modgraph-usa.com> wrote: > Merlin Moncure wrote: > > Using surrogate keys is dangerous and can lead to very bad design > > habits that are unfortunately so prevalent in the software industry > > they are virtually taught in schools. ... While there is > > nothing wrong with them in principle (you are exchanging one key for > > another as a performance optimization), they make it all too easy to > > create denormalized designs and tables with no real identifying > > criteria, etc,... > I can't recall ever seeing a "natural" key that was immutable. In my business (chemistry), we've seen several disasteroussituations were companies picked keys they thought were natural and immutable, and years down the road they discovered(for example) that chemical compounds they thought were pure were in fact isotopic mixtures, or simply the wrongmolecule (as analytical techniques improved). Or during a corporate takeover, they discovered that two companies usingthe same "natural" keys had as much as 10% differences in their multi-million-compound databases. These errors ledto six-month to year-long delays, as each of the conflicting chemical record had to be examined by hand by a PhD chemistto reclassify it. while your example might be a good case study in proper classification, it has nothing to do with key selection. it is especially unclear how adding an integer to a table will somehow magically solve these problems. are you claiming that a primary key can't be changed? mutability is strictly a performance argument. since RI handles cascading primary key changes, it's simply a matter of if you are willing to wait for RI to do its work or not (if not, swap in the id key as in my example). the performance argument really only applies to specific cases, and can be considered a attribute of certain tables. extraordinary cases do happen, like a company overhauling its numbering systems, but such cases can be dealt with by a number of methods including letting RI do its thing. merlin
Craig A. James wrote: > Merlin Moncure wrote: >> Using surrogate keys is dangerous and can lead to very bad design >> habits that are unfortunately so prevalent in the software industry >> they are virtually taught in schools. ... While there is >> nothing wrong with them in principle (you are exchanging one key for >> another as a performance optimization), they make it all too easy to >> create denormalized designs and tables with no real identifying >> criteria, etc,... > > Wow, that's the opposite of everything I've ever been taught, and all my > experience in the last few decades. > > ...chemistry...two companies using the same "natural" > keys had as much as 10% differences in their multi-million-compound > databases. These errors led to six-month to year-long delays, as each > of the conflicting chemical record had to be examined by hand by a PhD > chemist to reclassify it. That sounds almost like a feature, not a bug - giving information about what assumptions that went into the "natural key" need to be reconsidered. And I don't see how it would have been improved by adding a surrogate key - except that the data would have been just as messed up though harder to see where the messups were. > We've always recommended to our customers that all primary keys be > completely information free. They should be not based on any > information or combination of information from the data records. Every > time the customer has not followed this advice, they've later regretted it. Hmm... but then do you put a unique index on what the otherwise-would-have-been-natural-primary-key columns? If not, you tend to get into the odd situation of multiple rows that only vary in their surrogate key -- and it seems the surrogate key is redundant. > I'm sure there are situations where a natural key is appropriate, but I > haven't seen it in my work. I've seen both - and indeed usually use surrogate keys for convenience; but also find that having to fix incorrect assumptions in natural primary keys tends to raise business issues that are worth addressing anyway.
On Mon, 16 Apr 2007, Merlin Moncure wrote: > extraordinary cases do happen, like a company overhauling its numbering > systems, but such cases can be dealt with by a number of methods > including letting RI do its thing. I think the point Craig was trying to make is that what you refer to here as "extraordinary cases" are, in fact, rather common. I've never seen a database built on natural keys that didn't at some point turn ugly when some internal or external business need suddenly invalidated the believed uniqueness of that key. The last really bad one I saw was a manufacturing database that used a combination of the customer code and the customer's part number as the key. Surely if the customer changes their part number, we should switch ours to match so the orders are easy to process, right? When this got fun was when one large customer who released products on a yearly schedule decided to change the bill of material for many of their parts for the new year, but re-used the same part number; oh, and they were still ordering the old parts as well. Hilarity ensued. > it is especially unclear how adding an integer to a table will somehow > magically solve these problems. If the key is a integer, it's always possible to figure out a trivial map that renumbers the entire database programmatically in order to merge two sets of data. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 4/16/07, Greg Smith <gsmith@gregsmith.com> wrote: > I think the point Craig was trying to make is that what you refer to here > as "extraordinary cases" are, in fact, rather common. I've never seen a > database built on natural keys that didn't at some point turn ugly when > some internal or external business need suddenly invalidated the believed > uniqueness of that key. I don't think it's so terrible to add a field to a key...I too have worked on a ERP system based on natural keys and was quite amazed on how well organized the database was. When the company decided to re-number all the items in the database, it was a minor pain. Extending a critical key would be a project any way you organize the database IMO. Natural keys are most common in manufacturing and accounting systems because of the COBOL heritage, when natural keys were the only way to realistically do it. Unfortunately SQL really missed the boat on keys...otherwise they would behave more like a composite type. > The last really bad one I saw was a manufacturing database that used a > combination of the customer code and the customer's part number as the > key. Surely if the customer changes their part number, we should switch > ours to match so the orders are easy to process, right? When this got fun > was when one large customer who released products on a yearly schedule > decided to change the bill of material for many of their parts for the new > year, but re-used the same part number; oh, and they were still ordering > the old parts as well. Hilarity ensued. In the context of this debate, I see this argument all the time, with the implied suffix: 'If only we used integer keys we would not have had this problem...'. Either the customer identifies parts with a part number or they don't...and if they do identify parts with a number and recycle the numbers, you have a problem...period. Adding a integer key only moves the confusion to a separate place, unless it is used by the user to identify the part number and then *becomes* the key, or a part of it. If you hide the id from the user, then I claim the data model is pretty much busted. merlin
Merlin Moncure wrote: > In the context of this debate, I see this argument all the time, with > the implied suffix: 'If only we used integer keys we would not have > had this problem...'. Either the customer identifies parts with a > part number or they don't...and if they do identify parts with a > number and recycle the numbers, you have a problem...period. On the contrary. You create a new record with the same part number. You mark the old part number "obsolete". Everythingelse (the part's description, and all the relationships that it's in, such as order history, catalog inclusion,revision history, etc.) is unaffected. New orders are placed against the new part number's DB record; for safetythe old part number can have a trigger that prevent new orders from being placed. Since the part number is NOT the primary key, duplicate part numbers are not a problem. If you had instead used the partnumber as the primary key, you'd be dead in the water. You can argue that the customer is making a dumb decision by reusing catalog numbers, and I'd agree. But they do it, andas database designers we have to handle it. In my particular system, we aggregate information from several hundred companies,and this exact scenario happens frequently. Since we're only aggregating information, we have no control overthe data that these companies provide. If we'd used catalog numbers for primary keys, we'd have big problems. Craig
Craig A. James wrote: > Since we're only aggregating information, we have > no control over the data that these companies provide. And at the end of the day that's the root of the problem. It's easy to be lulled into "well it looks like a primary key" rather than being able to guarantee it. -- Richard Huxton Archonet Ltd
On Wed, 18 Apr 2007, Richard Huxton wrote: > And at the end of the day that's the root of the problem. It's easy to be > lulled into "well it looks like a primary key" rather than being able to > guarantee it. In some of these cases it is guaranteed to be a primary key given all available information at the time. The funny thing about unexpected changes to a business model is that you never expect them. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 4/18/07, Craig A. James <cjames@modgraph-usa.com> wrote: > Merlin Moncure wrote: > > In the context of this debate, I see this argument all the time, with > > the implied suffix: 'If only we used integer keys we would not have > > had this problem...'. Either the customer identifies parts with a > > part number or they don't...and if they do identify parts with a > > number and recycle the numbers, you have a problem...period. > > On the contrary. You create a new record with the same part number. You mark the old part number "obsolete". Everythingelse (the part's description, and all the relationships that it's in, such as order history, catalog inclusion,revision history, etc.) is unaffected. New orders are placed against the new part number's DB record; for safetythe old part number can have a trigger that prevent new orders from being placed. > > Since the part number is NOT the primary key, duplicate part numbers are not a problem. If you had instead used the partnumber as the primary key, you'd be dead in the water. You are redefining the primary key to be (part_number, obsoletion_date). Now, if you had not anticipated that in the original design (likely enough), you do have to refactor queries that join on the table...so what? If that's too much work, you can use a view to take care of the problem (which may be a good idea anyways). *you have to refactor the system anyways because you are now allowing duplicate part numbers where previously (from the perspective of the user), they were unique *. The hidden advantage of pushing the full key through the database is it tends to expose holes in the application/business logic. Chances are some query is not properly distinguishing obsoleted parts and now the real problems come...surrogate keys do not remove complexity, they simply sweep it under the rug. merlin
Merlin Moncure wrote: >> Since the part number is NOT the primary key, duplicate part numbers >> are not a problem. If you had instead used the part number as the >> primary key, you'd be dead in the water. > > You are redefining the primary key to be (part_number, > obsoletion_date). Now, if you had not anticipated that in the > original design (likely enough), you do have to refactor queries that > join on the table...so what? If that's too much work, you can use a > view to take care of the problem (which may be a good idea anyways). > *you have to refactor the system anyways because you are now allowing > duplicate part numbers where previously (from the perspective of the > user), they were unique *. > > The hidden advantage of pushing the full key through the database is > it tends to expose holes in the application/business logic. Chances > are some query is not properly distinguishing obsoleted parts and now > the real problems come...surrogate keys do not remove complexity, they > simply sweep it under the rug. This really boils down to an object-oriented perspective. I have an object, a customer's catalog entry. It has propertiessuch as catalog number, description, etc, and whether it's obsolete or not. Management of the object (its relationto other objects, its history, etc.) should NOT depend on the object's specific definition. This is true whether the object is represented in Lisp, C++, Perl, or (in this case) an SQL schema. Good object orienteddesign abstracts the object and its behavior from management of the object. In C++, Perl, etc., we manage objectsvia a pointer or object reference. In SQL, we reference objects by an *arbitrary* integer that is effectively a pointerto the object. What you're suggesting is that I should break the object-oriented encapsulation by pulling out specific fields of the object,exposing those internal object details to the applications, and spreading those details across the whole schema. AndI argue that this is wrong, because it breaks encapsulation. By exposing the details of the object, if the details change,*all* of your relationships break, and all of your applications have to change. And I've never seen a system wherebreaking object-oriented encapsulation was a good long-term solution. Systems change, and object-oriented techniqueswere invented to help manage change. This is one of the reasons the Postgres project was started way back when: To bring object-oriented techniques to the relational-databaseworld. Craig
I think a database with all natural keys is unrealistic. For example if you have a table that refers to people, are you going to use their name as a primary key? Names change all the time due to things like marriage, divorce, or trouble with the law. We have tables with 20 million rows which reference back to a table of people, and if I used the person's name as key, it would be a major pain when somebody's name changes. Even if there is referential integrity, one person might be referred to by 25% of the 20 million rows, so the update would take quite a long time. Also the table will be filled with dead rows and the indexes will likely be bloated. If I want to clean that up, it will take a vacuum full or a cluster which will lock the whole table and run for hours. If I use a surrogate key, I can change their name in one row and be done with it. Just my 2 cents. Dave
On Tue, 2007-04-17 at 21:06 -0700, Craig A. James wrote: > Merlin Moncure wrote: > > In the context of this debate, I see this argument all the time, with > > the implied suffix: 'If only we used integer keys we would not have > > had this problem...'. Either the customer identifies parts with a > > part number or they don't...and if they do identify parts with a > > number and recycle the numbers, you have a problem...period. > > On the contrary. You create a new record with the same part number. You mark the old part number "obsolete". Everythingelse (the part's description, and all the relationships that it's in, such as order history, catalog inclusion,revision history, etc.) is unaffected. New orders are placed against the new part number's DB record; for safetythe old part number can have a trigger that prevent new orders from being placed. > > Since the part number is NOT the primary key, duplicate part numbers are not a problem. If you had instead used the partnumber as the primary key, you'd be dead in the water. > > You can argue that the customer is making a dumb decision by reusing catalog numbers, and I'd agree. But they do it, andas database designers we have to handle it. In my particular system, we aggregate information from several hundred companies,and this exact scenario happens frequently. Since we're only aggregating information, we have no control overthe data that these companies provide. If we'd used catalog numbers for primary keys, we'd have big problems. > Storing data is easy. The difficulty lies in storing data in such a way that your assumptions about the data remain valid and your queries still answer the questions that you think they answer. Because an internal ID field has no meaning outside of the database (some auto-generated keys do have meaning outside the database, but I'm not talking about those), you can't effectively query based on an internal id any more than you can query by the ctid. So what do you query by then? You query by natural keys anyway. Internal id fields are an implementation detail related to performance (the real topic of this discussion). If you have two parts with the same part id, what does that mean? Sure, you can store the data, but then the queries that assumed that data was unique no longer hold. Sometimes you need two parts with the same part id, but you have to know the meaning in order to query based on that data. Let me ask these questions: - Do you think that all of your relations have an internal id? - Do you think that all the internal ids you use are unique in the relations in which they appear? If you answer "yes" to either question, consider that every query on that data is also a relation and so are subselects and intermediate results. Do those all have an id? If not, why not? How do you join a virtual relation to a physical relation if the virtual relation has no internal id? Is the id field still unique in the result of a join or Cartesian product? Regards, Jeff Davis
On 4/18/07, Dave Dutcher <dave@tridecap.com> wrote: > I think a database with all natural keys is unrealistic. For example if you > have a table that refers to people, are you going to use their name as a > primary key? Names change all the time due to things like marriage, > divorce, or trouble with the law. We have tables with 20 million rows which > reference back to a table of people, and if I used the person's name as key, > it would be a major pain when somebody's name changes. Even if there is > referential integrity, one person might be referred to by 25% of the 20 > million rows, so the update would take quite a long time. Also the table > will be filled with dead rows and the indexes will likely be bloated. If I > want to clean that up, it will take a vacuum full or a cluster which will > lock the whole table and run for hours. If I use a surrogate key, I can > change their name in one row and be done with it. That's perfectly reasonable (I mentioned this upthread)...there are a couple of corner cases where RI costs too much Exchanging a surrogate for a natural is a valid performance consideration. Usually, the performance win is marginal at best (and your example suggests possible normalization issues in the child table), sometimes there is no alternative....updating 5 million rows is obviously nasty. That said -- if the cost of update was zero, would you still do it that way? I'm trying to separate performance related issues, which are reasonable and valid depending on the situation, with good design principles. merlin
I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usual vector norm. Speed is critical here, and everything I have tried has been too slow. I imported the cube contrib package, and I tried creating an index on a cube of the last 6 elements, which are the most important. Then I tested the 2.5MM rows for being contained within a tolerance of the last 6 elements of X, +/- 0.1 in each coordinate, figuring that would be an indexed search (which I CLUSTERED on). I then ran the sort on this smaller set. The index was used, but it was still too slow. I also tried creating new columns with rounded int2 values of the last 6 coordinates and made a multicolumn index. For each X the search is taking about 4-15 seconds which is above my target at least one order of magnitude. Absolute numbers are dependent on my hardware and settings, and some of this can be addressed with configuration tweaks, etc., but first I think I need to know the optimum data structure/indexing strategy. Is anyone on the list experienced with this sort of issue? Thanks. Andrew Lazarus andrew@pillette.com
On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote: > I have a table with 2.5 million real[] arrays. (They are points in a > time series.) Given a new array X, I'd like to find, say, the 25 > closest to X in some sense--for simplification, let's just say in the > usual vector norm. Speed is critical here, and everything I have tried > has been too slow. > > I imported the cube contrib package, and I tried creating an index on > a cube of the last 6 elements, which are the most important. Then I Have you tried just making normal indexes on each of the last 6 elements and see if postgresql will use a BitmapAnd to combine them? Regards, Jeff Davis
Jeff Davis wrote: > On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote: >> I have a table with 2.5 million real[] arrays. (They are points in a >> time series.) Given a new array X, I'd like to find, say, the 25 >> closest to X in some sense--for simplification, let's just say in the >> usual vector norm. Speed is critical here, and everything I have tried >> has been too slow. >> >> I imported the cube contrib package, and I tried creating an index on >> a cube of the last 6 elements, which are the most important. Then I > > Have you tried just making normal indexes on each of the last 6 elements > and see if postgresql will use a BitmapAnd to combine them? > > I don't think that will work for the vector norm i.e: |x - y| = sqrt(sum over j ((x[j] - y[j])^2)) Cheers Mark
Because I know the 25 closest are going to be fairly close in each coordinate, I did try a multicolumn index on the last 6 columns and used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside that hypercube on the distribution of data in question.) This hypercube tended to have 10-20K records, and took at least 4 seconds to retrieve. I was a little surprised by how long that took. So I'm wondering if my data representation is off the wall. I should mention I also tried a cube index using gist on all 114 elements, but CREATE INDEX hadn't finished in 36 hours, when I killed it, and I wasn't in retrospect sure an index that took something like 6GB by itself would be helpful on a 2GB of RAM box. MK> I don't think that will work for the vector norm i.e: MK> |x - y| = sqrt(sum over j ((x[j] - y[j])^2)) MK> Cheers MK> Mark -- Sincerely, Andrew Lazarus mailto:andrew@pillette.com
Attachment
Andrew Lazarus wrote: > Because I know the 25 closest are going to be fairly close in each > coordinate, I did try a multicolumn index on the last 6 columns and > used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside > that hypercube on the distribution of data in question.) > > This hypercube tended to have 10-20K records, and took at least 4 > seconds to retrieve. I was a little surprised by how long that took. > So I'm wondering if my data representation is off the wall. > > I should mention I also tried a cube index using gist on all 114 > elements, but CREATE INDEX hadn't finished in 36 hours, when I killed > it, and I wasn't in retrospect sure an index that took something like > 6GB by itself would be helpful on a 2GB of RAM box. > > MK> I don't think that will work for the vector norm i.e: > > MK> |x - y| = sqrt(sum over j ((x[j] - y[j])^2)) > > Sorry, in that case it probably *is* worth trying out 6 single column indexes and seeing if they get bitmap and'ed together... Mark
Andrew Lazarus <andrew@pillette.com> writes: > Because I know the 25 closest are going to be fairly close in each > coordinate, I did try a multicolumn index on the last 6 columns and > used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside > that hypercube on the distribution of data in question.) > This hypercube tended to have 10-20K records, and took at least 4 > seconds to retrieve. I was a little surprised by how long that took. > So I'm wondering if my data representation is off the wall. A multicolumn btree index isn't going to be helpful at all. Jeff's idea of using six single-column indexes with the above query might work, though. regards, tom lane
Merlin Moncure wrote: > In the context of this debate, I see this argument all the time, with > the implied suffix: 'If only we used integer keys we would not have > had this problem...'. Either the customer identifies parts with a > part number or they don't...and if they do identify parts with a > number and recycle the numbers, you have a problem...period. On the contrary. You create a new record with the same part number. You mark the old part number "obsolete". Everythingelse (the part's description, and all the relationships that it's in, such as order history, catalog inclusion,revision history, etc.) is unaffected. New orders are placed against the new part number's DB record; for safetythe old part number can have a trigger that prevent new orders from being placed. Since the part number is NOT the primary key, duplicate part numbers are not a problem. If you had instead used the partnumber as the primary key, you'd be dead in the water. You can argue that the customer is making a dumb decision by reusing catalog numbers, and I'd agree. But they do it, andas database designers we have to handle it. In my particular system, we aggregate information from several hundred companies,and this exact scenario happens frequently. Since we're only aggregating information, we have no control overthe data that these companies provide. If we'd used catalog numbers for primary keys, we'd have big problems. Craig
On Apr 20, 12:07 pm, and...@pillette.com (Andrew Lazarus) wrote: > I have a table with 2.5 million real[] arrays. (They are points in a > time series.) Given a new array X, I'd like to find, say, the 25 > closest to X in some sense--for simplification, let's just say in the > usualvectornorm. Speed is critical here, and everything I have tried > has been too slow. > > I imported the cube contrib package, and I tried creating an index on > a cube of the last 6 elements, which are the most important. Then I > tested the 2.5MM rows for being contained within a tolerance of the > last 6 elements of X, +/- 0.1 in each coordinate, figuring that would > be an indexed search (which I CLUSTERED on). I then ran the sort on > this smaller set. The index was used, but it was still too slow. I > also tried creating new columns with rounded int2 values of the last 6 > coordinates and made a multicolumn index. > > For each X the search is taking about 4-15 seconds which is above my > target at least one order of magnitude. Absolute numbers are dependent > on my hardware and settings, and some of this can be addressed with > configuration tweaks, etc., but first I think I need to know the > optimum data structure/indexing strategy. > > Is anyone on the list experienced with this sort of issue? > > Thanks. > Andrew Lazarus and...@pillette.com > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq Having worked in high dimensional spaces a lot in my career I think you'll find that there are mathematical limits in terms of speed. In practical terms, a seq_scan will be unavoidable since on first approximation you are limited to doing an exhaustive search in 101-dimensional space unless you make approximations or dimensionality reductions of some kind. Read up on the Curse of Dimensionality: http://en.wikipedia.org/wiki/Curse_of_dimensionality Have you considered dimension reduction techniques such as Singular Value Decomposition, Principal Components Analysis, etc.?
On 4/20/07, Andrew Lazarus <andrew@pillette.com> wrote: > I have a table with 2.5 million real[] arrays. (They are points in a > time series.) Given a new array X, I'd like to find, say, the 25 > closest to X in some sense--for simplification, let's just say in the > usual vector norm. Speed is critical here, and everything I have tried > has been too slow. Let me chime in with the observation that this is a multidimensional nearest neighbour (reverse nearest neighbour and its close cousin, k-NN) that is well known in statistics, and particularly relevant to statistical learning and classification. Knowing the jargon might help you dig up efficient algorithms to mine your data; there are tons of fascinating papers available through Citeseer. In particular, I recommend the paper "Efficient k-NN Search on Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here: http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It proposes an algorithm called BOND to drastically reduce the search space by probalistic means. They give an example using image histograms, but the algorithm applies to any multidimensional data. Briefly put, it points out that proximity comparison can be computed vertically, a few dimensions at a time, and entire subsets can be thrown away when it's apparent that they are below a statistically derived lower bound. The only gotcha is that the algorithm derives much of its performance from the assumption that your data is vertically decomposed, one table per dimension, otherwise the search effectively incurs a sequential scan of the entire dataset, and then you're pretty much back to square one. The most common approach to nearest neighbour search is to use a spatial data structure. The classic algorithm is the kd-tree (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B tree, neither of which are available in PostgreSQL. If I remember correctly, R-trees have also been shown to be useful for high numbers of dimensions; with PostgreSQL you have R-trees and even better R-tree-equivalent support through GiST. I have no idea whether you can actually munge your integer vectors into something GiST can index and search, but it's a thought. (GiST, presumably, can also theoretically index kd-trees and other spatial trees.) Alexander.
On Fri, 27 Apr 2007, Alexander Staubo wrote: > On 4/20/07, Andrew Lazarus <andrew@pillette.com> wrote: >> I have a table with 2.5 million real[] arrays. (They are points in a >> time series.) Given a new array X, I'd like to find, say, the 25 >> closest to X in some sense--for simplification, let's just say in the >> usual vector norm. Speed is critical here, and everything I have tried >> has been too slow. > > Let me chime in with the observation that this is a multidimensional > nearest neighbour (reverse nearest neighbour and its close cousin, > k-NN) that is well known in statistics, and particularly relevant to > statistical learning and classification. Knowing the jargon might help > you dig up efficient algorithms to mine your data; there are tons of > fascinating papers available through Citeseer. > > In particular, I recommend the paper "Efficient k-NN Search on > Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here: > http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It > proposes an algorithm called BOND to drastically reduce the search > space by probalistic means. They give an example using image > histograms, but the algorithm applies to any multidimensional data. > Briefly put, it points out that proximity comparison can be computed > vertically, a few dimensions at a time, and entire subsets can be > thrown away when it's apparent that they are below a statistically > derived lower bound. The only gotcha is that the algorithm derives > much of its performance from the assumption that your data is > vertically decomposed, one table per dimension, otherwise the search > effectively incurs a sequential scan of the entire dataset, and then > you're pretty much back to square one. > > The most common approach to nearest neighbour search is to use a > spatial data structure. The classic algorithm is the kd-tree > (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B > tree, neither of which are available in PostgreSQL. If I remember > correctly, R-trees have also been shown to be useful for high numbers > of dimensions; with PostgreSQL you have R-trees and even better > R-tree-equivalent support through GiST. I have no idea whether you can > actually munge your integer vectors into something GiST can index and > search, but it's a thought. (GiST, presumably, can also theoretically > index kd-trees and other spatial trees.) you're right, but currently only theoretically due to interface restriction. We have plan to improve it sometime. There was SP-GiST project, which could be used for k-d-b tree, see http://www.cs.purdue.edu/spgist/ I don't know if it works with 8.2 version. Also, it doesn't supports concurrency and recovery > > Alexander. > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On 21-4-2007 1:42 Mark Kirkwood wrote: > I don't think that will work for the vector norm i.e: > > |x - y| = sqrt(sum over j ((x[j] - y[j])^2)) I don't know if this is usefull here, but I was able to rewrite that algorithm for a set of very sparse vectors (i.e. they had very little overlapping factors) to something like: |x - y| = sum over j (x[j]^2) + sum over j (y[j]^2) + for each j where x[j] and y[j] are both non-zero: - (x[j]^2 + y[j]^2) + (x[j] - y[j])^2 The first two parts sums can be calculated only once. So if you have very little overlap, this is therefore much more efficient (if there is no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this rewritten calculation allows you to store the X and Y vectors using a trivial table-layout vector(x,i,value) which is only filled with non-zero's and which you can trivially self-join to find the closest matches. You don't care about the j's where there is either no x or y-value anyway with this rewrite. I can compare over 1000 y's of on average 100 elements to two x's of over 1000 elements on just a single 1.8Ghz amd processor. (I use it for a bi-kmeans algorithm, so there are only two buckets to compare to). So it might be possible to rewrite your algorithm to be less calculation-intensive. Obviously, with a dense-matrix this isn't going to work, but there may be other ways to circumvent parts of the algorithm or to cache large parts of it. It might also help to extract only the 6 relevant columns into a seperate temporary table which will have much smaller records and thus can fit more records per page. Best regards, Arjen
Let me just thank the list, especially for the references. (I found similar papers myself with Google: and to think I have a university library alumni card and barely need it any more!) I'll write again on the sorts of results I get.
Attachment
On 5/1/07, Andrew Lazarus <andrew@pillette.com> wrote: > Let me just thank the list, especially for the references. (I found > similar papers myself with Google: and to think I have a university > library alumni card and barely need it any more!) > > I'll write again on the sorts of results I get. Looking forward to hearing about them. I have worked with such dataset problems, but never attempted to apply them to a relational database such as PostgreSQL. If you want speed, nothing beats in-memory vectors on a modern CPU architecture. Alexander.