Thread: Design question regarding arrays
Hello, Ill assume I have the correct mailing list since my question regards the tip located in the 7.4 user docs on arrays. It says that searching for specific array elements can be a sign of bad design. To fully quote the documentation : "Tip: Arrays are not sets; searching for specific array elements may be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale up better to large numbers of elements." I was considering using an array to hold catagorization information in my db. So to use say books as an example, Id have a record for each book and a text array column contain the list of catagories it belonged to, such as: {"sci-fi", "horror", "suspense"} Regularly Ill have to search this db for records containing a specific catagory. For example get all books belonging to the sci-fi catagory, for the most parts the searchs will only look for a single catagory. They would look like this : SELECT * FROM my_table WHERE 'sci-fi' = ANY (catagory); Is that bad design? What would be a better method of doing this? The number of catagories will number around 50, maybe more, and change periodically, so creating individual BOOLEAN columns for each catagory would be tedious and undesirable. Although there are many different catagories, rarely will more then 1-5 apply to a single record, so each array wont be very large. I don't know how large this database will grow, I would like to design it to scale well just to be safe. Thanks in advance for any help. A Gilmore
On Aug 8, 2004, at 2:03 PM, A Gilmore wrote: > Regularly Ill have to search this db for records containing a specific > catagory. For example get all books belonging to the sci-fi catagory, > for the most parts the searchs will only look for a single catagory. > They would look like this : > > SELECT * FROM my_table WHERE 'sci-fi' = ANY (catagory); > > Is that bad design? Probably. I'd set up a categories table that lists the possible categories. Then create a books_categories table that lists the categories for each book. For example: create table books ( book_id serial not null unique , title text not null ); create table categories ( category_name text not null unique ); create table books_categories ( book_id integer not null references books (book_id) on update cascade on delete cascade , category_name text not null references categories (category_name) on update cascade on delete cascade , unique (book_id, category_name) ); insert into books (title) values ('Visual Explanations'); insert into books (title) values ('The Visual Display of Quantitative Information'); insert into books (title) values ('The Mythical Man-Month'); select book_id, title from books; book_id | title ---------+------------------------------------------------ 1 | Visual Explanations 2 | The Visual Display of Quantitative Information 3 | The Mythical Man-Month (3 rows) insert into categories (category_name) values ('graphic design'); insert into categories (category_name) values ('non-fiction'); insert into categories (category_name) values ('fiction'); insert into categories (category_name) values ('project management'); insert into books_categories (book_id, category_name) values (1,'graphic design'); insert into books_categories (book_id, category_name) values (2,'graphic design'); insert into books_categories (book_id, category_name) values (3,'project management'); insert into books_categories (book_id, category_name) values (1,'non-fiction'); insert into books_categories (book_id, category_name) values (2,'non-fiction'); insert into books_categories (book_id, category_name) values (3,'non-fiction'); -- display books and their categories using explicit joins select title , category_name from books join books_categories using (book_id) order by title, category_name; title | category_name ------------------------------------------------+-------------------- The Mythical Man-Month | non-fiction The Mythical Man-Month | project management The Visual Display of Quantitative Information | graphic design The Visual Display of Quantitative Information | non-fiction Visual Explanations | graphic design Visual Explanations | non-fiction (6 rows) -- or with implicit joins using the WHERE clause select title , category_name from books as b , books_categories as b_c where b.book_id = b_c.book_id order by title, category_name; title | category_name ------------------------------------------------+-------------------- The Mythical Man-Month | non-fiction The Mythical Man-Month | project management The Visual Display of Quantitative Information | graphic design The Visual Display of Quantitative Information | non-fiction Visual Explanations | graphic design Visual Explanations | non-fiction (6 rows) -- selecting only graphic design books select title from books join books_categories using (book_id) where category_name = 'graphic design'; title ------------------------------------------------ Visual Explanations The Visual Display of Quantitative Information (2 rows) For convenience, I'd probably also create a view that joins book information and categories, such as: create view books_categories_view as select book_id, title, category_name from books join books_categories using (book_id); Then you don't have to write the join each time. You use the like a table: select book_id , title , category_name from books_categories_view order by title, category_name; book_id | title | category_name ---------+------------------------------------------------ +-------------------- 3 | The Mythical Man-Month | non-fiction 3 | The Mythical Man-Month | project management 2 | The Visual Display of Quantitative Information | graphic design 2 | The Visual Display of Quantitative Information | non-fiction 1 | Visual Explanations | graphic design 1 | Visual Explanations | non-fiction (6 rows) select title from books_categories_view where category_name = 'graphic design'; title ------------------------------------------------ Visual Explanations The Visual Display of Quantitative Information (2 rows) You might be wondering what the advantage of the categories table is, as I haven't used it at all in any of the queries. One reason is that it keeps a set list of categories. If you have the categories all in one place, you are probably less like to set up similar categories, such as 'Graphic Design' and 'graphic art design'. Also, if you have more information about the categories, for example a longer description of what the category covers, you keep that in the category table, rather than duplicating it in the books_categories table. (An important part of database design is normalization, which in essence is reducing the amount of duplication in a database.) Some people would also prefer setting up a category_id rather than using the text category_name. This situation would look a little different: create table categories ( category_id serial not null unique , category_name text not null unique , category_description text not null unique ); create table books_categories ( book_id integer not null references books(book_id) on update cascade on delete cascade , category_id integer not null references categories(category_id) on update cascade on delete cascade , unique (book_id, category_id) ); create view books_categories_view as select book_id, title, category_id, category_name from books join books_categories using (book_id) join categories using (category_id); There'd be similar changes to the queries as well. Unless you're dealing with data that is naturally an array (for example, some math constructs with array data), I'd recommend sticking with setting up tables and join tables. Personally, I have yet to use an array type. Relational theory (upon which SQL is based) works well for tables (relations). I'd recommend picking up a book on database design, especially normalization. There's a lot of information available on the web as well. Michael Glaesemann grzm myrealbox com
On Sat 7 Aug 2004 22:03, A Gilmore wrote: > I was considering using an array to hold catagorization information in > my db. So to use say books as an example, Id have a record for each > book and a text array column contain the list of catagories it belonged > to, such as: > > {"sci-fi", "horror", "suspense"} > > Regularly Ill have to search this db for records containing a specific > catagory. For example get all books belonging to the sci-fi catagory, > for the most parts the searchs will only look for a single catagory. > They would look like this : > > SELECT * FROM my_table WHERE 'sci-fi' = ANY (catagory); > > What would be a better method of doing this? I think this is almost a classic problem of database design. Let's see if I can get it right. =) Create a separate table for your categories: category_names int id text name And create a table which contain info about which items in my_table which are related to which categories. I'm assuming you have a unique id in the table 'my_table' book_category int my_table_id int category_names_id Both of these of course are foreing keys from their respective tables. This way, you can use a subquery to find all books of a category_name which exist in the book_category, and then all the data about them in my_table. HTH, regards, -- Fredrik Jonson
Michael Glaesemann wrote: > > On Aug 8, 2004, at 2:03 PM, A Gilmore wrote: > >> Regularly Ill have to search this db for records containing a >> specific catagory. For example get all books belonging to the sci-fi >> catagory, for the most parts the searchs will only look for a single >> catagory. They would look like this : >> >> SELECT * FROM my_table WHERE 'sci-fi' = ANY (catagory); >> >> Is that bad design? > > > Probably. I'd set up a categories table that lists the possible > categories. Then create a books_categories table that lists the > categories for each book. For example: > > create table books ( > book_id serial not null unique > , title text not null > ); > > create table categories ( > category_name text not null unique > ); > > create table books_categories ( > book_id integer not null > references books (book_id) > on update cascade on delete cascade > , category_name text not null > references categories (category_name) > on update cascade on delete cascade > , unique (book_id, category_name) > ); > > Michael Glaesemann > grzm myrealbox com > Thanks a lot of the detailed reply, its a huge help. Does bring me to one other question Id like to clarify. Using this method Ill be doing inserts into books_categories based on the ID of the latest INSERT into the books table. Since I cannot really have a INSERT return a value such as the serial used without getting into PL/pgSQL, how can I get the ID of the book just inserted? According to board threads I have found doing something like : select book_id from books where book_id = currval('book_id_seq'); Could lead to problems if multiple inserts are occuring at the same time since currval() may have changed since the insert, and transactions do not prevent that. Is this correct, that transactions will not help, and if so what is the most common way of dealing with this issue (since I assume its common) ? Thank you for any help. A Gilmore
On Aug 9, 2004, at 12:44 PM, A Gilmore wrote: > Is this correct, that transactions will not help, and if so what is > the most common way of dealing with this issue (since I assume its > common) ? In the PostgreSQL 7.4 documention on sequence manipulation functions I think you'll find your answer. <http://www.postgresql.org/docs/current/static/functions-sequence.html> Michael Glaesemann grzm myrealbox com
A Gilmore wrote: > Thanks a lot of the detailed reply, its a huge help. Does bring me to > one other question Id like to clarify. Using this method Ill be doing > inserts into books_categories based on the ID of the latest INSERT into > the books table. Since I cannot really have a INSERT return a value > such as the serial used without getting into PL/pgSQL, how can I get the > ID of the book just inserted? > > According to board threads I have found doing something like : > > select book_id from books where book_id = currval('book_id_seq'); > > Could lead to problems if multiple inserts are occuring at the same time > since currval() may have changed since the insert, and transactions do > not prevent that. Is this correct, that transactions will not help, and > if so what is the most common way of dealing with this issue (since I > assume its common) ? Well, I assume that all your book titles are unique, so after INSERTing a title, you can immediately SELECT the row which contains that title, so you will get the right ID, no matter whether there are concurrent updates/inserts: INSERT books (title) VALUES ('PostgreSQL for Dummies') SELECT id FROM books WHERE title = 'PostgreSQL for Dummies' If your titles are not unique, it will get a bit more difficult. In that case, you can (in fact: must) set the "serializable" isolation level (not the default which is "read committed"), then you INSERT the new row in the books table and immediately (within the same transaction) SELECT currval() from the serial column. The serializable isolation level will guarantee that you will get the new value caused by your own most recent INSERT within the same transaction. However, be prepared that the transaction might fail if a concurrent update occurs. The docs have much more information on the topic of isolation levels: http://www.postgresql.org/docs/current/static/transaction-iso.html Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "I made up the term 'object-oriented', and I can tell you I didn't have C++ in mind." -- Alan Kay, OOPSLA '97
On Aug 9, 2004, at 6:42 PM, Oliver Fromme wrote: > Well, I assume that all your book titles are unique, Realistically, probably not a good assumption. Not one I'd want to bank by db schema on. You might be luckier with something like ISSN, but I don't know enough about re-use of those types of numbers. Michael Glaesemann grzm myrealbox com
Michael Glaesemann wrote: > On Aug 9, 2004, at 6:42 PM, Oliver Fromme wrote: > > > Well, I assume that all your book titles are unique, > > Realistically, probably not a good assumption. Not one I'd want to bank > by db schema on. You might be luckier with something like ISSN, but I > don't know enough about re-use of those types of numbers. My example was a bit simplified for the matter. Probably those rows contain more information than just the title, but also the author, publisher, date, ISSN/ISBN etc. I think it is a reasonable assumption that the whole rows (excluding the serial IDs) are unique. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "I have stopped reading Stephen King novels. Now I just read C code instead." -- Richard A. O'Keefe
I'm a novice, too, but why not look up the value in the sequence and increment it before doing any inserting? If you are doing all the inserts at the same time (more or less), then you can just use this value in the insert into both tables. Put all of this inside one transaction. Sean ----- Original Message ----- From: "Oliver Fromme" <olli@lurza.secnetix.de> To: "Michael Glaesemann" <grzm@myrealbox.com> Cc: <pgsql-novice@postgresql.org>; "A Gilmore" <agilmore@shaw.ca> Sent: Monday, August 09, 2004 6:06 AM Subject: Re: [NOVICE] Design question regarding arrays > > Michael Glaesemann wrote: > > On Aug 9, 2004, at 6:42 PM, Oliver Fromme wrote: > > > > > Well, I assume that all your book titles are unique, > > > > Realistically, probably not a good assumption. Not one I'd want to bank > > by db schema on. You might be luckier with something like ISSN, but I > > don't know enough about re-use of those types of numbers. > > My example was a bit simplified for the matter. Probably > those rows contain more information than just the title, > but also the author, publisher, date, ISSN/ISBN etc. > I think it is a reasonable assumption that the whole rows > (excluding the serial IDs) are unique. > > Best regards > Oliver > > -- > Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München > Any opinions expressed in this message may be personal to the author > and may not necessarily reflect the opinions of secnetix in any way. > > "I have stopped reading Stephen King novels. > Now I just read C code instead." > -- Richard A. O'Keefe > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
On Aug 9, 2004, at 7:06 PM, Oliver Fromme wrote: > Michael Glaesemann wrote: >> On Aug 9, 2004, at 6:42 PM, Oliver Fromme wrote: >> > > My example was a bit simplified for the matter. Regardless, there's no reason not to use currval on the sequence. It's transaction safe, as explained in both the link I posted previously and in the FAQ. http://www.postgresql.org/docs/faqs/FAQ.html (4.15.1 through 4.15.4) Michael Glaesemann grzm myrealbox com
On Aug 9, 2004, at 7:14 PM, Sean Davis wrote: > I'm a novice, too, but why not look up the value in the sequence and > increment it before doing any inserting? If you are doing all the > inserts > at the same time (more or less), then you can just use this value in > the > insert into both tables. Put all of this inside one transaction. Actually, I don't even think it has to be inside a transaction. Beyond being transaction safe, currval is session safe (correcting something I said in an earlier post). As long as you don't put another insert or otherwise manipulate the sequence in the same session, you can trust currval Michael Glaesemann grzm myrealbox com
Sean Davis wrote: > I'm a novice, too, but why not look up the value in the sequence and > increment it before doing any inserting? If you are doing all the inserts > at the same time (more or less), then you can just use this value in the > insert into both tables. Put all of this inside one transaction. That won't work if another process is trying to do the same thing at the same time. They both get the same value when they look it up, so there will be an ID collision. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "Being really good at C++ is like being really good at using rocks to sharpen sticks." -- Thant Tessman
On Mon, 9 Aug 2004, Oliver Fromme wrote: > Sean Davis wrote: > > I'm a novice, too, but why not look up the value in the sequence and > > increment it before doing any inserting? If you are doing all the inserts > > at the same time (more or less), then you can just use this value in the > > insert into both tables. Put all of this inside one transaction. > > That won't work if another process is trying to do the same > thing at the same time. They both get the same value when > they look it up, so there will be an ID collision. It should be safe if you're using a sequence and using nextval() to get the value that you are using in both insert statements. A second session will not get the same value you just got from nextval barring a cycling sequence generator that wraps back to the same number between the nextval and insert.
That is exactly what I had in mind but wasn't clear on. Sean ----- Original Message ----- From: "Stephan Szabo" <sszabo@megazone.bigpanda.com> To: "Oliver Fromme" <olli@lurza.secnetix.de> Cc: "Sean Davis" <sdavis2@mail.nih.gov>; "Michael Glaesemann" <grzm@myrealbox.com>; <pgsql-novice@postgresql.org>; "A Gilmore" <agilmore@shaw.ca> Sent: Monday, August 09, 2004 10:22 AM Subject: Re: [NOVICE] Design question regarding arrays > On Mon, 9 Aug 2004, Oliver Fromme wrote: > > > Sean Davis wrote: > > > I'm a novice, too, but why not look up the value in the sequence and > > > increment it before doing any inserting? If you are doing all the inserts > > > at the same time (more or less), then you can just use this value in the > > > insert into both tables. Put all of this inside one transaction. > > > > That won't work if another process is trying to do the same > > thing at the same time. They both get the same value when > > they look it up, so there will be an ID collision. > > It should be safe if you're using a sequence and using nextval() to get > the value that you are using in both insert statements. A second session > will not get the same value you just got from nextval barring a cycling > sequence generator that wraps back to the same number between the nextval > and insert. >