Re: Determining logically unique entities across many partially complete rows where at least one column matches - Mailing list pgsql-sql
From | Jasen Betts |
---|---|
Subject | Re: Determining logically unique entities across many partially complete rows where at least one column matches |
Date | |
Msg-id | h5rlkd$e88$1@reversiblemaps.ath.cx Whole thread Raw |
In response to | Determining logically unique entities across many partially complete rows where at least one column matches (Jamie Tufnell <diesql@googlemail.com>) |
List | pgsql-sql |
On 2009-08-11, Jamie Tufnell <diesql@googlemail.com> wrote: > Hi, > > I am faced with a modeling problem and thought I'd see if anyone has run > into something similar and can offer some advice. > > Basically my problem domain is cataloguing "snippets of information" about > "entities" which are loosely identified. > > Entities can be identified up to 3 different methods (email, phone or > openid.) > > Entities can have zero or many emails, phone numbers and openids. The > only restriction is they must have at least one value in one of those three > columns. > > > Some sample data: > > snippet #1 > email: null > phone: +1234567890 > openid: john@myopenid.net > information: This is snippet #1 > > snippet #2 > email: foo@bar.com > phone: null > openid: johnny.name > information: This is snippet #2 > > At this point snippet #1 and #2 could refer to different entities. > > snippet #3 > email: bar@baz.com > phone: +1234567890 > openid: johnny.name > information: This is snippet #3 > > But now all three snippets definitely refer to the same entity, as far as > we're concerned: > > Entity: 1 > OpenIDs: johnny.name, john@myopenid.net > Phones: +1234567890 > Emails: foo@bar.com, bar@baz.com > > So as far as modeling this goes, I'm stuck between: > > 1. Normalizing as usual with some serious triggers to maintain the > relationships. > 2. Just having a snippets table with these fields inline and make these > inferences at query time. > 3. Something in between. > 4. Using a document store like CouchDB. I think three tables openid,email, phone (phone's a bad one for a unique id IME, especially POTS lines) create table (entid integer, phone text unique) etc,etc... then at insert time you use a rule that runs a function with exception handling when the unique rules detect a match with some existing data and then in the exception code you do updates to replace the higher entid with the lower one > The kinds of queries I need to do right now (which will no doubt change): > > * Return all snippets. > * Return all distinct entities. > * Find all id for a distinct entity given a single piece of id. > * Find all snippets for a distinct entity. > > To do it in one table, I am thinking something like this: > > create table snippets ( > id serial not null primary key, > email text, > phone_number text, > openid text, > information text not null, > check (email is not null or > phone_number is not null or openid is not null) > ); > > with queries like: > > * Find all snippets for one distinct entity, searching by openid: > > select * from snippets > where phone_number = > (select phone_number from snippets where openid = 'john@myopenid.net') > or email = > (select email from snippets where openid = 'john@myopenid.net') > or openid in > (select openid from snippets > where phone_number = > (select phone_number from snippets where openid = 'john@myopenid.net') > or email = > (select email from snippets where openid = 'john@myopenid.net')); but that won't get them all. > Or if I was to model as usual I am thinking something like this: > > create table entities ( > id serial not null primary key > ); > > create table entity_has_email ( > entity_id integer not null > references entities (id) on delete cascade on update cascade, > email text not null unique > ); > > create table entity_has_phone_number ( > entity_id integer not null > references entities (id) on delete cascade on update cascade, > phone_number text not null unique > ); > > create table entity_has_openid ( > entity_id integer not null > references entities (id) on delete cascade on update cascade, > openid text not null unique > ); > > create table snippets ( > id serial not null primary key, > entity_id integer not null > references entities (id) on delete cascade on update cascade, > information text not null > ); > > (followed by a mass of on insert/update/delete triggers) > > select s.* from snippets s > join entity_has_email e on s.entity_id = e.id > join entity_has_phone_number p on s.entity_id = p.id > join entity_has_openid o on s.entity_id = o.id > where o.openid = 'john@myopenid.net'; looks like the wrong query for snippets to me. select s.* from snippets s join entity_has_openid o on s.entity_id = o.idwhere o.openid = 'john@myopenid.net'; > Another option, sort of half way between the two could be: > > create table snippets ( > id serial not null primary key, > entity_id integer not null > references entities (id) on delete cascade on update cascade, > information text not null > ); > > create table entities ( > id serial not null primary key, > email text, > phone_number text, > openid text, > check (email is not null or > phone_number is not null or openid is not null) > ); that's not going to work with your example data. (subject has two different email addresses) I guess you could use arrays for email, openid, and phone.