Re: Use arrays or not? - Mailing list pgsql-sql
From | Achilleus Mantzios |
---|---|
Subject | Re: Use arrays or not? |
Date | |
Msg-id | Pine.LNX.4.44.0404300843040.32478-100000@matrix.gatewaynet.com Whole thread Raw |
In response to | Re: Use arrays or not? (Josh Berkus <josh@agliodbs.com>) |
List | pgsql-sql |
Just some comments from my experience: PgSQL Arrays are mainly for modeling arrays not relations. Arrays can be very useful if they are not ever gonna be treated as relations and if performance is an issue (e.g. dynamic graphs, on the fly statistics etc..). Also (besides other solutions) int[] arrays is a handy way of implementing tree structures in a database. For your case as you set it the XML arrays will make your life difficult. O kyrios Josh Berkus egrapse stis Apr 29, 2004 : > Roelant, > > Yours is not a performance question, so I'm crossing it over to SQL for advice > on database design. > > > I am building an application using postgresql to store XML-records. There > > is a debate within the group of developers about the best way to store our > > data. I hope you can help us make a decision. > > > > The data consists of XML-records, with a lot of XML-fields. I want to store > > the XML as it is, so taking the information from the XML-records and then > > storing it in a different-from-XML-format is not an option. > > > > Each XML-record describes data about one book. If an update of bookdata > > comes, the XML itself is not changed, but a new XML-record is stored with > > the updated data. Via a complex scheme of combining a base record and its > > updates, the final dataset is produced that is used in the application. > > > > There are different XML-formats that need to be combined. Right now, we can > > handle three different XML-formats, each with its own structure (but all > > describing book-data). > > > > Searching is done via a simple table lookup on three different fields: > > title, author and subject. The data for these fields is extracted from the > > database. Each book has a unique identifier (EAN13, derivative of ISBN). > > > > Here is one way to organize the database: > > table title: > > TITLE | EAN13, indexing on TITLE > > > > table author: > > AUTHOR | EAN13, indexing on AUTHOR > > > > table subject: > > SUBJECT | EAN13, indexing on SUBJECT. > > This is a *very* strange way of setting up your database. Are you new to > Relational Databases and SQL? If so, I'd recommend starting with a book on > relational database design. > > Either that, or you're a victim of UML design. > > If only one author, title and subject are allowed per book, you should have: > > table books > EAN13 | TITLE | AUTHOR | SUBJECT > > > Finally: > > table record: > > EAN13 | ARRAY OF XML-records. > > > > It's the last table that I am most curious (and worried) about, the > > question being mainly what the optimal way of structuring that table is. > > Option 1 is the given option: adding/deleting an XML-record for the same > > book requires adding/deleting it to/from the array of XML-records. > > > > Option 2 would be something like this: > > EAN13 | XML-record > > where, if a book has several records describing it, there are multiple > > entries of the EAN13|XML-record - pair. Adding an XML-record for the same > > book, requires adding a new entry to the table as a whole. > > In my mind, there is no question that this is the best way to do things. It > is a normalized data structure, as opposed to the arrays, which are now. > > > > > So, option 1-tables look like this: > > EAN13 | ARRAY OF XML-records > > 0001 | {<XML1>...</XML1>, <XML2>...</XML2>, ...} > > 0002 | {<XML1>...</XML1>, <XML2>...</XML2>, ...} > > > > Option-2 tables look like this: > > EAN13 | ARRAY OF XML-records > > 0001 | <XML1>...</XML1> > > 0001 | <XML2>...</XML2> > > 0002 | <XML1>...</XML1> > > 0002 | <XML2>...</XML2> > > > > We can't decide which one is best. These are some issues we can think of: > > > > Indexing: For option 1, the EAN13-index remains unique, even if you have > > multiple XML-records; for option 2 it does not, since multiple XML-records > > are stored as multiple tuples. On the other hand, an additional internal > > index can be used to link the several tuples of option 2 to the information > > in the `lookup'-tables (author, title, keyword). Does any of these two > > options increase query efficiency, ie. speed? > > > > Database growth: On average, the information about a book is updated three > > times per year. In option 1, this means that the length of the table does > > not increase, but the width does. If we choose option 2, if we have three > > updates per book each year, the length of the table triples, but the width > > does not. What is more costly to store for postgres, long arrays or long > > tables? > > > > Integrity: Option 1 means that our software needs to keep track of all the > > bookkeeping for arrays, since such support is quite rudimentary in > > postgres. For example, it is hard to take out a record from the middle of > > an array. Also, a multidimensional array, which contains for each record > > the record itself and its type, is even harder to maintain. Option 2 has a > > simpler datatype, so integrity can be easier inforced using the standard > > postgres-machinery of variable-types etc. > > > > Arrays are non-standard SQL, and I hear that PHP-support for postgres & > > arrays is rudimentary. So that might be an argument to avoid using them, > > and go for option 2. From the standpoint of performance (or wisdom), can > > you help me decide what I should choose? Or is there maybe an even better > > way to structure my data? > > > > Thanks for any contribution! > > > > Roelant. > > > -- -Achilleus