Thread: Design question regarding arrays

Design question regarding arrays

From
A Gilmore
Date:
Hello,

Ill assume I have the correct mailing list since my question regards the
tip located in the 7.4 user docs on arrays.  It says that searching for
specific array elements can be a sign of bad design.  To fully quote the
documentation :

"Tip:  Arrays are not sets; searching for specific array elements may be
a sign of database misdesign. Consider using a separate table with a row
for each item that would be an array element. This will be easier to
search, and is likely to scale up better to large numbers of elements."

I was considering using an array to hold catagorization information in
my db.  So to use say books as an example, Id have a record for each
book and a text array column contain the list of catagories it belonged
to, such as:

{"sci-fi", "horror", "suspense"}

Regularly Ill have to search this db for records containing a specific
catagory.  For example get all books belonging to the sci-fi catagory,
for the most parts the searchs will only look for a single catagory.
They would look like this :

SELECT * FROM my_table WHERE 'sci-fi' = ANY (catagory);

Is that bad design?  What would be a better method of doing this?  The
number of catagories will number around 50, maybe more, and change
periodically, so creating individual BOOLEAN columns for each catagory
would be tedious and undesirable.  Although there are many different
catagories, rarely will more then 1-5 apply to a single record, so each
array wont be very large.

I don't know how large this database will grow, I would like to design
it to scale well just to be safe.

Thanks in advance for any help.
A Gilmore

Re: Design question regarding arrays

From
Michael Glaesemann
Date:
On Aug 8, 2004, at 2:03 PM, A Gilmore wrote:

> Regularly Ill have to search this db for records containing a specific
> catagory.  For example get all books belonging to the sci-fi catagory,
> for the most parts the searchs will only look for a single catagory.
> They would look like this :
>
> SELECT * FROM my_table WHERE 'sci-fi' = ANY (catagory);
>
> Is that bad design?

Probably. I'd set up a categories table that lists the possible
categories. Then create a books_categories table that lists the
categories for each book. For example:

create table books (
     book_id serial not null unique
     , title text not null
);

create table categories (
     category_name text not null unique
);

create table books_categories (
     book_id integer not null
         references books (book_id)
         on update cascade on delete cascade
     , category_name text not null
         references categories (category_name)
         on update cascade on delete cascade
     , unique (book_id, category_name)
);

insert into books (title) values ('Visual Explanations');
insert into books (title) values ('The Visual Display of Quantitative
Information');
insert into books (title) values ('The Mythical Man-Month');

select book_id, title from books;
  book_id |                     title
---------+------------------------------------------------
        1 | Visual Explanations
        2 | The Visual Display of Quantitative Information
        3 | The Mythical Man-Month
(3 rows)

insert into categories (category_name) values ('graphic design');
insert into categories (category_name) values ('non-fiction');
insert into categories (category_name) values ('fiction');
insert into categories (category_name) values ('project management');

insert into books_categories (book_id, category_name) values
(1,'graphic design');
insert into books_categories (book_id, category_name) values
(2,'graphic design');
insert into books_categories (book_id, category_name) values
(3,'project management');
insert into books_categories (book_id, category_name) values
(1,'non-fiction');
insert into books_categories (book_id, category_name) values
(2,'non-fiction');
insert into books_categories (book_id, category_name) values
(3,'non-fiction');

-- display books and their categories using explicit joins

select
     title
     , category_name
from books
join books_categories using (book_id)
order by title, category_name;
                      title                      |   category_name
------------------------------------------------+--------------------
  The Mythical Man-Month                         | non-fiction
  The Mythical Man-Month                         | project management
  The Visual Display of Quantitative Information | graphic design
  The Visual Display of Quantitative Information | non-fiction
  Visual Explanations                            | graphic design
  Visual Explanations                            | non-fiction
(6 rows)

-- or with implicit joins using the WHERE clause

select
     title
     , category_name
from books as b
     , books_categories as b_c
where b.book_id = b_c.book_id
order by title, category_name;
                      title                      |   category_name
------------------------------------------------+--------------------
  The Mythical Man-Month                         | non-fiction
  The Mythical Man-Month                         | project management
  The Visual Display of Quantitative Information | graphic design
  The Visual Display of Quantitative Information | non-fiction
  Visual Explanations                            | graphic design
  Visual Explanations                            | non-fiction
(6 rows)

-- selecting only graphic design books

select title
from books
join books_categories using (book_id)
where category_name = 'graphic design';
                      title
------------------------------------------------
  Visual Explanations
  The Visual Display of Quantitative Information
(2 rows)

For convenience, I'd probably also create a view that joins book
information and categories, such as:

create view books_categories_view as
select book_id, title, category_name
from books
join books_categories using (book_id);

Then you don't have to write the join each time. You use the like a
table:

select
     book_id
     , title
     , category_name
from books_categories_view
order by title, category_name;
  book_id |                     title                      |
category_name
---------+------------------------------------------------
+--------------------
        3 | The Mythical Man-Month                         | non-fiction
        3 | The Mythical Man-Month                         | project
management
        2 | The Visual Display of Quantitative Information | graphic
design
        2 | The Visual Display of Quantitative Information | non-fiction
        1 | Visual Explanations                            | graphic
design
        1 | Visual Explanations                            | non-fiction
(6 rows)

select title from books_categories_view where category_name = 'graphic
design';
                      title
------------------------------------------------
  Visual Explanations
  The Visual Display of Quantitative Information
(2 rows)

You might be wondering what the advantage of the categories table is,
as I haven't used it at all in any of the queries. One reason is that
it keeps a set list of categories. If you have the categories all in
one place, you are probably less like to set up similar categories,
such as 'Graphic Design' and 'graphic art design'. Also, if you have
more information about the categories, for example a longer description
of what the category covers, you keep that in the category table,
rather than duplicating it in the books_categories table. (An important
part of database design is normalization, which in essence is reducing
the amount of duplication in a database.) Some people would also prefer
setting up a category_id rather than using the text category_name. This
situation would look a little different:

create table categories (
     category_id serial not null unique
     , category_name text not null unique
     , category_description text not null unique
);

create table books_categories (
     book_id integer not null
         references books(book_id)
         on update cascade on delete cascade
     , category_id integer not null
         references categories(category_id)
         on update cascade on delete cascade
     , unique (book_id, category_id)
);

create view books_categories_view as
select book_id, title, category_id, category_name
from books
join books_categories using (book_id)
join categories using (category_id);

There'd be similar changes to the queries as well.

Unless you're dealing with data that is naturally an array (for
example, some math constructs with array data), I'd recommend sticking
with setting up tables and join tables. Personally, I have yet to use
an array type. Relational theory (upon which SQL is based) works well
for tables (relations). I'd recommend picking up a book on database
design, especially normalization. There's a lot of information
available on the web as well.

Michael Glaesemann
grzm myrealbox com


Re: Design question regarding arrays

From
Fredrik Jonson
Date:
On Sat  7 Aug 2004 22:03, A Gilmore wrote:

> I was considering using an array to hold catagorization information in
> my db.  So to use say books as an example, Id have a record for each
> book and a text array column contain the list of catagories it belonged
> to, such as:
>
> {"sci-fi", "horror", "suspense"}
>
> Regularly Ill have to search this db for records containing a specific
> catagory.  For example get all books belonging to the sci-fi catagory,
> for the most parts the searchs will only look for a single catagory.
> They would look like this :
>
> SELECT * FROM my_table WHERE 'sci-fi' = ANY (catagory);
>
> What would be a better method of doing this?

I think this is almost a classic problem of database design. Let's
see if I can get it right. =)

Create a separate table for your categories:

 category_names
  int id
  text name

And create a table which contain info about which items in my_table which
are related to which categories. I'm assuming you have a unique id in
the table 'my_table'

 book_category
 int my_table_id
 int category_names_id

Both of these of course are foreing keys from their respective tables.

This way, you can use a subquery to find all books of a category_name
which exist in the book_category, and then all the data about them in
my_table.

HTH, regards,
--
Fredrik Jonson

Re: Design question regarding arrays

From
A Gilmore
Date:
Michael Glaesemann wrote:
>
> On Aug 8, 2004, at 2:03 PM, A Gilmore wrote:
>
>> Regularly Ill have to search this db for records containing a
>> specific  catagory.  For example get all books belonging to the sci-fi
>> catagory,  for the most parts the searchs will only look for a single
>> catagory.  They would look like this :
>>
>> SELECT * FROM my_table WHERE 'sci-fi' = ANY (catagory);
>>
>> Is that bad design?
>
>
> Probably. I'd set up a categories table that lists the possible
> categories. Then create a books_categories table that lists the
> categories for each book. For example:
>
> create table books (
>     book_id serial not null unique
>     , title text not null
> );
>
> create table categories (
>     category_name text not null unique
> );
>
> create table books_categories (
>     book_id integer not null
>         references books (book_id)
>         on update cascade on delete cascade
>     , category_name text not null
>         references categories (category_name)
>         on update cascade on delete cascade
>     , unique (book_id, category_name)
> );
>
> Michael Glaesemann
> grzm myrealbox com
>

Thanks a lot of the detailed reply, its a huge help.  Does bring me to
one other question Id like to clarify.  Using this method Ill be doing
inserts into books_categories based on the ID of the latest INSERT into
the books table.  Since I cannot really have a INSERT return a value
such as the serial used without getting into PL/pgSQL, how can I get the
ID of the book just inserted?

According to board threads I have found doing something like :

select book_id from books where book_id = currval('book_id_seq');

Could lead to problems if multiple inserts are occuring at the same time
since currval() may have changed since the insert, and transactions do
not prevent that.  Is this correct, that transactions will not help, and
if so what is the most common way of dealing with this issue (since I
assume its common) ?

Thank you for any help.
A Gilmore

Re: Design question regarding arrays

From
Michael Glaesemann
Date:
On Aug 9, 2004, at 12:44 PM, A Gilmore wrote:

> Is this correct, that transactions will not help, and if so what is
> the most common way of dealing with this issue (since I assume its
> common) ?

In the PostgreSQL 7.4 documention on sequence manipulation functions I
think you'll find your answer.

<http://www.postgresql.org/docs/current/static/functions-sequence.html>

Michael Glaesemann
grzm myrealbox com


Re: Design question regarding arrays

From
Oliver Fromme
Date:
A Gilmore wrote:
 > Thanks a lot of the detailed reply, its a huge help.  Does bring me to
 > one other question Id like to clarify.  Using this method Ill be doing
 > inserts into books_categories based on the ID of the latest INSERT into
 > the books table.  Since I cannot really have a INSERT return a value
 > such as the serial used without getting into PL/pgSQL, how can I get the
 > ID of the book just inserted?
 >
 > According to board threads I have found doing something like :
 >
 > select book_id from books where book_id = currval('book_id_seq');
 >
 > Could lead to problems if multiple inserts are occuring at the same time
 > since currval() may have changed since the insert, and transactions do
 > not prevent that.  Is this correct, that transactions will not help, and
 > if so what is the most common way of dealing with this issue (since I
 > assume its common) ?

Well, I assume that all your book titles are unique, so after
INSERTing a title, you can immediately SELECT the row which
contains that title, so you will get the right ID, no matter
whether there are concurrent updates/inserts:

   INSERT books (title) VALUES ('PostgreSQL for Dummies')

   SELECT id FROM books WHERE title = 'PostgreSQL for Dummies'

If your titles are not unique, it will get a bit more difficult.
In that case, you can (in fact: must) set the "serializable"
isolation level (not the default which is "read committed"),
then you INSERT the new row in the books table and immediately
(within the same transaction) SELECT currval() from the serial
column.  The serializable isolation level will guarantee that
you will get the new value caused by your own most recent
INSERT within the same transaction.  However, be prepared that
the transaction might fail if a concurrent update occurs.  The
docs have much more information on the topic of isolation
levels:

http://www.postgresql.org/docs/current/static/transaction-iso.html

Best regards
   Oliver

--
Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"I made up the term 'object-oriented', and I can tell you
I didn't have C++ in mind."
        -- Alan Kay, OOPSLA '97

Re: Design question regarding arrays

From
Michael Glaesemann
Date:
On Aug 9, 2004, at 6:42 PM, Oliver Fromme wrote:

> Well, I assume that all your book titles are unique,

Realistically, probably not a good assumption. Not one I'd want to bank
by db schema on. You might be luckier with something like ISSN, but I
don't know enough about re-use of those types of numbers.

Michael Glaesemann
grzm myrealbox com


Re: Design question regarding arrays

From
Oliver Fromme
Date:
Michael Glaesemann wrote:
 > On Aug 9, 2004, at 6:42 PM, Oliver Fromme wrote:
 >
 > > Well, I assume that all your book titles are unique,
 >
 > Realistically, probably not a good assumption. Not one I'd want to bank
 > by db schema on. You might be luckier with something like ISSN, but I
 > don't know enough about re-use of those types of numbers.

My example was a bit simplified for the matter.  Probably
those rows contain more information than just the title,
but also the author, publisher, date, ISSN/ISBN etc.
I think it is a reasonable assumption that the whole rows
(excluding the serial IDs) are unique.

Best regards
   Oliver

--
Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"I have stopped reading Stephen King novels.
Now I just read C code instead."
        -- Richard A. O'Keefe

Re: Design question regarding arrays

From
"Sean Davis"
Date:
I'm a novice, too, but why not look up the value in the sequence and
increment it before doing any inserting?  If you are doing all the inserts
at the same time (more or less), then you can just use this value in the
insert into both tables.  Put all of this inside one transaction.

Sean
----- Original Message -----
From: "Oliver Fromme" <olli@lurza.secnetix.de>
To: "Michael Glaesemann" <grzm@myrealbox.com>
Cc: <pgsql-novice@postgresql.org>; "A Gilmore" <agilmore@shaw.ca>
Sent: Monday, August 09, 2004 6:06 AM
Subject: Re: [NOVICE] Design question regarding arrays


>
> Michael Glaesemann wrote:
>  > On Aug 9, 2004, at 6:42 PM, Oliver Fromme wrote:
>  >
>  > > Well, I assume that all your book titles are unique,
>  >
>  > Realistically, probably not a good assumption. Not one I'd want to bank
>  > by db schema on. You might be luckier with something like ISSN, but I
>  > don't know enough about re-use of those types of numbers.
>
> My example was a bit simplified for the matter.  Probably
> those rows contain more information than just the title,
> but also the author, publisher, date, ISSN/ISBN etc.
> I think it is a reasonable assumption that the whole rows
> (excluding the serial IDs) are unique.
>
> Best regards
>    Oliver
>
> --
> Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München
> Any opinions expressed in this message may be personal to the author
> and may not necessarily reflect the opinions of secnetix in any way.
>
> "I have stopped reading Stephen King novels.
> Now I just read C code instead."
>         -- Richard A. O'Keefe
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>



Re: Design question regarding arrays

From
Michael Glaesemann
Date:
On Aug 9, 2004, at 7:06 PM, Oliver Fromme wrote:

> Michael Glaesemann wrote:
>> On Aug 9, 2004, at 6:42 PM, Oliver Fromme wrote:
>>
>
> My example was a bit simplified for the matter.

Regardless, there's no reason not to use currval on the sequence. It's
transaction safe, as explained in both the link I posted previously and
in the FAQ.

http://www.postgresql.org/docs/faqs/FAQ.html (4.15.1 through 4.15.4)

Michael Glaesemann
grzm myrealbox com


Re: Design question regarding arrays

From
Michael Glaesemann
Date:
On Aug 9, 2004, at 7:14 PM, Sean Davis wrote:

> I'm a novice, too, but why not look up the value in the sequence and
> increment it before doing any inserting?  If you are doing all the
> inserts
> at the same time (more or less), then you can just use this value in
> the
> insert into both tables.  Put all of this inside one transaction.

Actually, I don't even think it has to be inside a transaction. Beyond
being transaction safe, currval is session safe (correcting something I
said in an earlier post). As long as you don't put another insert or
otherwise manipulate the sequence in the same session, you can trust
currval

Michael Glaesemann
grzm myrealbox com


Re: Design question regarding arrays

From
Oliver Fromme
Date:
Sean Davis wrote:
 > I'm a novice, too, but why not look up the value in the sequence and
 > increment it before doing any inserting?  If you are doing all the inserts
 > at the same time (more or less), then you can just use this value in the
 > insert into both tables.  Put all of this inside one transaction.

That won't work if another process is trying to do the same
thing at the same time.  They both get the same value when
they look it up, so there will be an ID collision.

Best regards
   Oliver

--
Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"Being really good at C++ is like being really good
at using rocks to sharpen sticks."
        -- Thant Tessman

Re: Design question regarding arrays

From
Stephan Szabo
Date:
On Mon, 9 Aug 2004, Oliver Fromme wrote:

> Sean Davis wrote:
>  > I'm a novice, too, but why not look up the value in the sequence and
>  > increment it before doing any inserting?  If you are doing all the inserts
>  > at the same time (more or less), then you can just use this value in the
>  > insert into both tables.  Put all of this inside one transaction.
>
> That won't work if another process is trying to do the same
> thing at the same time.  They both get the same value when
> they look it up, so there will be an ID collision.

It should be safe if you're using a sequence and using nextval() to get
the value that you are using in both insert statements. A second session
will not get the same value you just got from nextval barring a cycling
sequence generator that wraps back to the same number between the nextval
and insert.

Re: Design question regarding arrays

From
"Sean Davis"
Date:
That is exactly what I had in mind but wasn't clear on.

Sean
----- Original Message -----
From: "Stephan Szabo" <sszabo@megazone.bigpanda.com>
To: "Oliver Fromme" <olli@lurza.secnetix.de>
Cc: "Sean Davis" <sdavis2@mail.nih.gov>; "Michael Glaesemann"
<grzm@myrealbox.com>; <pgsql-novice@postgresql.org>; "A Gilmore"
<agilmore@shaw.ca>
Sent: Monday, August 09, 2004 10:22 AM
Subject: Re: [NOVICE] Design question regarding arrays


> On Mon, 9 Aug 2004, Oliver Fromme wrote:
>
> > Sean Davis wrote:
> >  > I'm a novice, too, but why not look up the value in the sequence and
> >  > increment it before doing any inserting?  If you are doing all the
inserts
> >  > at the same time (more or less), then you can just use this value in
the
> >  > insert into both tables.  Put all of this inside one transaction.
> >
> > That won't work if another process is trying to do the same
> > thing at the same time.  They both get the same value when
> > they look it up, so there will be an ID collision.
>
> It should be safe if you're using a sequence and using nextval() to get
> the value that you are using in both insert statements. A second session
> will not get the same value you just got from nextval barring a cycling
> sequence generator that wraps back to the same number between the nextval
> and insert.
>