Re: Determining logically unique entities across many partially complete rows where at least one column matches - Mailing list pgsql-sql

From Jasen Betts
Subject Re: Determining logically unique entities across many partially complete rows where at least one column matches
Date
Msg-id h5rlkd$e88$1@reversiblemaps.ath.cx
Whole thread Raw
In response to Determining logically unique entities across many partially complete rows where at least one column matches  (Jamie Tufnell <diesql@googlemail.com>)
List pgsql-sql
On 2009-08-11, Jamie Tufnell <diesql@googlemail.com> wrote:
> Hi,
>
> I am faced with a modeling problem and thought I'd see if anyone has run
> into something similar and can offer some advice.
>
> Basically my problem domain is cataloguing "snippets of information" about
> "entities" which are loosely identified.
>
> Entities can be identified up to 3 different methods (email, phone or
> openid.)
>
> Entities can have zero or many emails, phone numbers and openids.  The
> only restriction is they must have at least one value in one of those three
> columns.
>
>
> Some sample data:
>
> snippet #1
> email: null
> phone: +1234567890
> openid: john@myopenid.net
> information: This is snippet #1
>
> snippet #2
> email: foo@bar.com
> phone: null
> openid: johnny.name
> information: This is snippet #2
>
> At this point snippet #1 and #2 could refer to different entities.
>
> snippet #3
> email: bar@baz.com
> phone: +1234567890
> openid: johnny.name
> information: This is snippet #3
>
> But now all three snippets definitely refer to the same entity, as far as
> we're concerned:
>
> Entity: 1
> OpenIDs: johnny.name, john@myopenid.net
> Phones: +1234567890
> Emails: foo@bar.com, bar@baz.com
>
> So as far as modeling this goes, I'm stuck between:
>
> 1. Normalizing as usual with some serious triggers to maintain the
>    relationships.
> 2. Just having a snippets table with these fields inline and make these
>    inferences at query time.
> 3. Something in between.
> 4. Using a document store like CouchDB.


I think three tables openid,email, phone  
(phone's a bad one for a unique id IME, especially POTS lines) 

create table (entid integer, phone text unique)
etc,etc...

then at insert time you use a rule that runs a function 
with exception handling when the unique rules detect a match with some
existing data and then in the exception code you do updates to replace
the higher entid with the lower one 


> The kinds of queries I need to do right now (which will no doubt change):
>
> * Return all snippets.
> * Return all distinct entities.
> * Find all id for a distinct entity given a single piece of id.
> * Find all snippets for a distinct entity.
>
> To do it in one table, I am thinking something like this:
>
> create table snippets (
>   id serial not null primary key,
>   email text,
>   phone_number text,
>   openid text,
>   information text not null,
>   check (email is not null or
>          phone_number is not null or openid is not null)
> );
>
> with queries like:
>
> * Find all snippets for one distinct entity, searching by openid:
>
> select * from snippets
> where phone_number =
>   (select phone_number from snippets where openid = 'john@myopenid.net')
> or email =
>   (select email from snippets where openid = 'john@myopenid.net')
> or openid in
>   (select openid from snippets
>    where phone_number =
>     (select phone_number from snippets where openid = 'john@myopenid.net')
>    or email =
>    (select email from snippets where openid = 'john@myopenid.net'));

but that won't get them all.

> Or if I was to model as usual I am thinking something like this:
>
> create table entities (
>   id serial not null primary key
> );
>
> create table entity_has_email (
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   email text not null unique
> );
>
> create table entity_has_phone_number (
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   phone_number text not null unique
> );
>
> create table entity_has_openid (
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   openid text not null unique
> );
>
> create table snippets (
>   id serial not null primary key,
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   information text not null
> );
>
> (followed by a mass of on insert/update/delete triggers)

>
> select s.* from snippets s
>   join entity_has_email e on s.entity_id = e.id
>   join entity_has_phone_number p on s.entity_id = p.id
>   join entity_has_openid o on s.entity_id = o.id
> where o.openid = 'john@myopenid.net';

looks like the wrong query for snippets to me.
select s.* from snippets s  join entity_has_openid o on s.entity_id = o.idwhere o.openid = 'john@myopenid.net';

> Another option, sort of half way between the two could be:
>
> create table snippets (
>   id serial not null primary key,
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   information text not null
> );
>
> create table entities (
>   id serial not null primary key,
>   email text,
>   phone_number text,
>   openid text,
>   check (email is not null or
>          phone_number is not null or openid is not null)
> );

that's not going to work with your example data.
(subject has two different email addresses)

I guess you could use arrays for email, openid, and phone.



pgsql-sql by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Updating a specific number of rows in pl/pgsql
Next
From: Rob Sargent
Date:
Subject: Re: Determining logically unique entities across many partially complete rows where at least one column matches