Thread: Determining logically unique entities across many partially complete rows where at least one column matches

Hi,

I am faced with a modeling problem and thought I'd see if anyone has run
into something similar and can offer some advice.

Basically my problem domain is cataloguing "snippets of information" about
"entities" which are loosely identified.

Entities can be identified up to 3 different methods (email, phone or
openid.)

Entities can have zero or many emails, phone numbers and openids.  The
only restriction is they must have at least one value in one of those three
columns.


Some sample data:

snippet #1
email: null
phone: +1234567890
openid: john@myopenid.net
information: This is snippet #1

snippet #2
email: foo@bar.com
phone: null
openid: johnny.name
information: This is snippet #2

At this point snippet #1 and #2 could refer to different entities.

snippet #3
email: bar@baz.com
phone: +1234567890
openid: johnny.name
information: This is snippet #3

But now all three snippets definitely refer to the same entity, as far as
we're concerned:

Entity: 1
OpenIDs: johnny.name, john@myopenid.net
Phones: +1234567890
Emails: foo@bar.com, bar@baz.com

So as far as modeling this goes, I'm stuck between:

1. Normalizing as usual with some serious triggers to maintain the  relationships.
2. Just having a snippets table with these fields inline and make these  inferences at query time.
3. Something in between.
4. Using a document store like CouchDB.

The kinds of queries I need to do right now (which will no doubt change):

* Return all snippets.
* Return all distinct entities.
* Find all id for a distinct entity given a single piece of id.
* Find all snippets for a distinct entity.

To do it in one table, I am thinking something like this:

create table snippets ( id serial not null primary key, email text, phone_number text, openid text, information text
notnull, check (email is not null or        phone_number is not null or openid is not null)
 
);

with queries like:

* Find all snippets for one distinct entity, searching by openid:

select * from snippets
where phone_number = (select phone_number from snippets where openid = 'john@myopenid.net')
or email = (select email from snippets where openid = 'john@myopenid.net')
or openid in (select openid from snippets  where phone_number =   (select phone_number from snippets where openid =
'john@myopenid.net') or email =   (select email from snippets where openid = 'john@myopenid.net'));
 


Or if I was to model as usual I am thinking something like this:

create table entities ( id serial not null primary key
);

create table entity_has_email ( entity_id integer not null   references entities (id) on delete cascade on update
cascade,email text not null unique
 
);

create table entity_has_phone_number ( entity_id integer not null   references entities (id) on delete cascade on
updatecascade, phone_number text not null unique
 
);

create table entity_has_openid ( entity_id integer not null   references entities (id) on delete cascade on update
cascade,openid text not null unique
 
);

create table snippets ( id serial not null primary key, entity_id integer not null   references entities (id) on delete
cascadeon update cascade, information text not null
 
);

(followed by a mass of on insert/update/delete triggers)

with queries like:

* Find all snippets for a distinct entity, by one identifying field:

select s.* from snippets s join entity_has_email e on s.entity_id = e.id join entity_has_phone_number p on s.entity_id
=p.id join entity_has_openid o on s.entity_id = o.id
 
where o.openid = 'john@myopenid.net';

Another option, sort of half way between the two could be:

create table snippets ( id serial not null primary key, entity_id integer not null   references entities (id) on delete
cascadeon update cascade, information text not null
 
);

create table entities ( id serial not null primary key, email text, phone_number text, openid text, check (email is not
nullor        phone_number is not null or openid is not null)
 
);

* Find all snippets for a distinct entity, by openid = 'john@myopenid.net'

select * from snippets
where entity_id in ( select id from entities where phone_number =   (select phone_number from entities where openid =
'john@myopenid.net')or email =   (select email from entities where openid = 'john@myopenid.net') or openid in   (select
openidfrom entities    where phone_number =      (select phone_number from entities where openid = 'john@myopenid.net'
 or email =      (select email from entities where openid = 'john@myopenid.net') ));
 

At this point I am leaning towards that last method.

Has anyone had to model something similar?  Did you use one of these
methods or something else?

Any/all comments appreciated!

J


On 2009-08-11, Jamie Tufnell <diesql@googlemail.com> wrote:
> Hi,
>
> I am faced with a modeling problem and thought I'd see if anyone has run
> into something similar and can offer some advice.
>
> Basically my problem domain is cataloguing "snippets of information" about
> "entities" which are loosely identified.
>
> Entities can be identified up to 3 different methods (email, phone or
> openid.)
>
> Entities can have zero or many emails, phone numbers and openids.  The
> only restriction is they must have at least one value in one of those three
> columns.
>
>
> Some sample data:
>
> snippet #1
> email: null
> phone: +1234567890
> openid: john@myopenid.net
> information: This is snippet #1
>
> snippet #2
> email: foo@bar.com
> phone: null
> openid: johnny.name
> information: This is snippet #2
>
> At this point snippet #1 and #2 could refer to different entities.
>
> snippet #3
> email: bar@baz.com
> phone: +1234567890
> openid: johnny.name
> information: This is snippet #3
>
> But now all three snippets definitely refer to the same entity, as far as
> we're concerned:
>
> Entity: 1
> OpenIDs: johnny.name, john@myopenid.net
> Phones: +1234567890
> Emails: foo@bar.com, bar@baz.com
>
> So as far as modeling this goes, I'm stuck between:
>
> 1. Normalizing as usual with some serious triggers to maintain the
>    relationships.
> 2. Just having a snippets table with these fields inline and make these
>    inferences at query time.
> 3. Something in between.
> 4. Using a document store like CouchDB.


I think three tables openid,email, phone  
(phone's a bad one for a unique id IME, especially POTS lines) 

create table (entid integer, phone text unique)
etc,etc...

then at insert time you use a rule that runs a function 
with exception handling when the unique rules detect a match with some
existing data and then in the exception code you do updates to replace
the higher entid with the lower one 


> The kinds of queries I need to do right now (which will no doubt change):
>
> * Return all snippets.
> * Return all distinct entities.
> * Find all id for a distinct entity given a single piece of id.
> * Find all snippets for a distinct entity.
>
> To do it in one table, I am thinking something like this:
>
> create table snippets (
>   id serial not null primary key,
>   email text,
>   phone_number text,
>   openid text,
>   information text not null,
>   check (email is not null or
>          phone_number is not null or openid is not null)
> );
>
> with queries like:
>
> * Find all snippets for one distinct entity, searching by openid:
>
> select * from snippets
> where phone_number =
>   (select phone_number from snippets where openid = 'john@myopenid.net')
> or email =
>   (select email from snippets where openid = 'john@myopenid.net')
> or openid in
>   (select openid from snippets
>    where phone_number =
>     (select phone_number from snippets where openid = 'john@myopenid.net')
>    or email =
>    (select email from snippets where openid = 'john@myopenid.net'));

but that won't get them all.

> Or if I was to model as usual I am thinking something like this:
>
> create table entities (
>   id serial not null primary key
> );
>
> create table entity_has_email (
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   email text not null unique
> );
>
> create table entity_has_phone_number (
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   phone_number text not null unique
> );
>
> create table entity_has_openid (
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   openid text not null unique
> );
>
> create table snippets (
>   id serial not null primary key,
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   information text not null
> );
>
> (followed by a mass of on insert/update/delete triggers)

>
> select s.* from snippets s
>   join entity_has_email e on s.entity_id = e.id
>   join entity_has_phone_number p on s.entity_id = p.id
>   join entity_has_openid o on s.entity_id = o.id
> where o.openid = 'john@myopenid.net';

looks like the wrong query for snippets to me.
select s.* from snippets s  join entity_has_openid o on s.entity_id = o.idwhere o.openid = 'john@myopenid.net';

> Another option, sort of half way between the two could be:
>
> create table snippets (
>   id serial not null primary key,
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   information text not null
> );
>
> create table entities (
>   id serial not null primary key,
>   email text,
>   phone_number text,
>   openid text,
>   check (email is not null or
>          phone_number is not null or openid is not null)
> );

that's not going to work with your example data.
(subject has two different email addresses)

I guess you could use arrays for email, openid, and phone.



Seems to me that if you can safely identify which snippets correspond to 
a given entity you want a single id for the entity.  An entity-snippet 
relationship seems a must. I would not lean too heavily on a single 
table solution unless you're considering arrays for openid,email and 
phone.  (And given the one-to-many-real-people on phone I would be leery 
of this "identifier".)

Jamie Tufnell wrote:
> Hi,
>
> I am faced with a modeling problem and thought I'd see if anyone has run
> into something similar and can offer some advice.
>
> Basically my problem domain is cataloguing "snippets of information" about
> "entities" which are loosely identified.
>
> Entities can be identified up to 3 different methods (email, phone or
> openid.)
>
> Entities can have zero or many emails, phone numbers and openids.  The
> only restriction is they must have at least one value in one of those three
> columns.
>
>
> Some sample data:
>
> snippet #1
> email: null
> phone: +1234567890
> openid: john@myopenid.net
> information: This is snippet #1
>
> snippet #2
> email: foo@bar.com
> phone: null
> openid: johnny.name
> information: This is snippet #2
>
> At this point snippet #1 and #2 could refer to different entities.
>
> snippet #3
> email: bar@baz.com
> phone: +1234567890
> openid: johnny.name
> information: This is snippet #3
>
> But now all three snippets definitely refer to the same entity, as far as
> we're concerned:
>
> Entity: 1
> OpenIDs: johnny.name, john@myopenid.net
> Phones: +1234567890
> Emails: foo@bar.com, bar@baz.com
>
> So as far as modeling this goes, I'm stuck between:
>
> 1. Normalizing as usual with some serious triggers to maintain the
>    relationships.
> 2. Just having a snippets table with these fields inline and make these
>    inferences at query time.
> 3. Something in between.
> 4. Using a document store like CouchDB.
>
> The kinds of queries I need to do right now (which will no doubt change):
>
> * Return all snippets.
> * Return all distinct entities.
> * Find all id for a distinct entity given a single piece of id.
> * Find all snippets for a distinct entity.
>
> To do it in one table, I am thinking something like this:
>
> create table snippets (
>   id serial not null primary key,
>   email text,
>   phone_number text,
>   openid text,
>   information text not null,
>   check (email is not null or
>          phone_number is not null or openid is not null)
> );
>
> with queries like:
>
> * Find all snippets for one distinct entity, searching by openid:
>
> select * from snippets
> where phone_number =
>   (select phone_number from snippets where openid = 'john@myopenid.net')
> or email =
>   (select email from snippets where openid = 'john@myopenid.net')
> or openid in
>   (select openid from snippets
>    where phone_number =
>     (select phone_number from snippets where openid = 'john@myopenid.net')
>    or email =
>     (select email from snippets where openid = 'john@myopenid.net'));
>
>
> Or if I was to model as usual I am thinking something like this:
>
> create table entities (
>   id serial not null primary key
> );
>
> create table entity_has_email (
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   email text not null unique
> );
>
> create table entity_has_phone_number (
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   phone_number text not null unique
> );
>
> create table entity_has_openid (
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   openid text not null unique
> );
>
> create table snippets (
>   id serial not null primary key,
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   information text not null
> );
>
> (followed by a mass of on insert/update/delete triggers)
>
> with queries like:
>
> * Find all snippets for a distinct entity, by one identifying field:
>
> select s.* from snippets s
>   join entity_has_email e on s.entity_id = e.id
>   join entity_has_phone_number p on s.entity_id = p.id
>   join entity_has_openid o on s.entity_id = o.id
> where o.openid = 'john@myopenid.net';
>
> Another option, sort of half way between the two could be:
>
> create table snippets (
>   id serial not null primary key,
>   entity_id integer not null
>     references entities (id) on delete cascade on update cascade,
>   information text not null
> );
>
> create table entities (
>   id serial not null primary key,
>   email text,
>   phone_number text,
>   openid text,
>   check (email is not null or
>          phone_number is not null or openid is not null)
> );
>
> * Find all snippets for a distinct entity, by openid = 'john@myopenid.net'
>
> select * from snippets
> where entity_id in (
>   select id from entities
>   where phone_number =
>     (select phone_number from entities where openid = 'john@myopenid.net')
>   or email =
>     (select email from entities where openid = 'john@myopenid.net')
>   or openid in
>     (select openid from entities
>      where phone_number =
>        (select phone_number from entities where openid = 'john@myopenid.net'
>      or email =
>        (select email from entities where openid = 'john@myopenid.net') ));
>
> At this point I am leaning towards that last method.
>
> Has anyone had to model something similar?  Did you use one of these
> methods or something else?
>
> Any/all comments appreciated!
>
> J
>
>