Thread: A general database question!

A general database question!

From
Dmitry Tkach
Date:
Hi, everybody!

I have a general database design question for you...
Suppose, I have a database of entities, that have two attributes, say,
location and name. Location is numerical and name is a varchar.
I need to be able to search by both location and name (e.g. get all the
entries whose location is 1 and name is John etc...).
It is simple so far - I'd just create an an index on both location and
name...

Now, suppose, my entities may (or may not) have several names (aliases),
and I need my query to return the entries with any of the aliases,
matching the parameter...

Normally, I would just split the table into two - like:

create table location (id serial primary key, loc int);
create table names (id references location,
                     type int,
                     name varchar(100) not null,
                     unique (id,type,name)
                     );
Now I could do
select * from location l, names n where l.id=n.id and loc=1 and name =
'John';

...
But there are two problems with it:
The small problem is that it may return multiple rows for the same
entity (fine, suppose, I can deal with that).
The much bigger problem though is that I can't now create an index on
both location and name, and creating two separate indexes won't do me
much good as far as I understand - still only one of them will be used...

So, if there are, say 10 million entries in location 1, and none of them
is John, and there are 10 million Jons in other locations, my query
(that should be almost instantenious) would take forever :-(

The only way to get around it I can imagine is to stick location column
into the names table (and get rid of the other table) - so that I'll be
able to build an index on it...

But this is not normalized, wastes a lot of space and makes even a
simple query like select * ... where location = 1 return multiple rows
for the same entry, so that I'll have to scan the result set to remove
duplicates every time ...

I am wonderring if any of you database gurus out there have any better
ideas?

If not right away, any plans to implement multi-table indexes in
postgres? Or, perhaps, 'index intersection' (they say, SQL server has
this feature - you keep both tables clustered by the id, and the server
can then fetch two result sets from separate indexes, and merge them,
based on the fact that they are sorted by the id)...

Any input is greatly appreciated...

Dima


Re: A general database question!

From
Jeff Davis
Date:
> I have a general database design question for you...
> Suppose, I have a database of entities, that have two attributes, say,
> location and name. Location is numerical and name is a varchar.
> I need to be able to search by both location and name (e.g. get all the
> entries whose location is 1 and name is John etc...).
> It is simple so far - I'd just create an an index on both location and
> name...
>

I am not quite understanding why a multi-column index won't work. Can you
explain a little more?

Here is what I was thinking:
 create table entity(
    type int,
    location int,
    name varchar(100) not null,
 );
 create unique index my_index on entity(location,name);

It seems as though an "and" query would run quickly. It also seems as though
if you needed to get a few records that were one of several names, a query
such as the following would run quickly:
select * from entity where location = 5 and (name = 'joe' or name='joseph');

or are the aliases a seperate attribute? I am not clear on exactly where the
alias comes in.

Regards,
    Jeff

Re: A general database question!

From
Dmitry Tkach
Date:
It is simple so far - I'd just create an an index on=
 both location andname...

    I am not quite understanding why a multi-colu=
mn index won't work. Can you explain a little more?Here is what=
 I was thinking: create table entity(    type int,    location int,<b=
r>    name varchar(100) not null, ); create unique index my_index on e=
ntity(location,name);It seems as though an "and" query would run qu=
ickly. It also seems as though if you needed to get a few records that =
were one of several names, a query such as the following would run quic=
kly:select * from entity where location =3D 5 and (name =3D 'joe' or na=
me=3D'joseph');or are the aliases a seperate attribute? I am not cl=
ear on exactly where the alias comes in.

Well... Yes, that's the problem - aliases ARE a separate attribute. The SAME
entry could be known as both 'joe' and 'joseph', so, as far as I can see,
I have to choices to do this, as I explained earlier - either create two
tables, one having just entry id and location, and the other one, having
id, name and type (or 'nametype' if you will), liked to the first one, or,
I could have one table with multiple rows, corresponding to the same entry.=


The problem is that I don't =9Alike any of these solutions :-(
The first one isn't good enough because I can't create an index accross two
tables, so the search by location and name would be problematic...

The second one sucks, because it duplicates all those locations, and,=9A be=
cause
it is not normalized.

Your suggestion looks like the second of these solutions... First of all,
the table you suggest, would need another column, say, entityID, telling
which entity this name entry belongs to. This column can't even be a serial
or a primary key, because there will have to be multiple rows with the same
id (to hold aliases)... Now, suppose, a paritcular entity has 10 different
names - this would create 10 different rows in the table, all of which are
supposed to have the same id and location...
First of all, I'll waste 9*(sizeof(location)) bytes per each such entity
(10 identical location values, instead of only one really needed)... And
secondly, how am I going to enforce the consistency when an entity 'moves'
(location gets changed)? All those ten rows have to be updated... In other
words, I would need to have some means of ensuring that for every given pair
of rows in that table if row1.id=3Drow2.id then row1.location=3Drow2.locati=
on...
This calls for another complication of the schema - putting a trigger on
the table to ensure the consistency....

While this all seems to be doable, please keep in mind that what we are dis=
cussing
here is just a simple example... This gets more complicated, if you take
into consideration, for example, the fact, that name and location are likely
to be not the only two attributes of an entity - there could be many more
of them... Also, consider that you have LOTS of those entitites (tens of
millions), each having, say, 20 attributes - imagine how much space would
be wasted on duplication, and how much time would be wasted executing that
trigger... :-(

Does this clarify things?

Thanks!

Dima