Re: Use arrays or not? - Mailing list pgsql-sql
From | Josh Berkus |
---|---|
Subject | Re: Use arrays or not? |
Date | |
Msg-id | 200404291023.58712.josh@agliodbs.com Whole thread Raw |
Responses |
Re: Use arrays or not?
|
List | pgsql-sql |
Roelant, Yours is not a performance question, so I'm crossing it over to SQL for advice on database design. > I am building an application using postgresql to store XML-records. There > is a debate within the group of developers about the best way to store our > data. I hope you can help us make a decision. > > The data consists of XML-records, with a lot of XML-fields. I want to store > the XML as it is, so taking the information from the XML-records and then > storing it in a different-from-XML-format is not an option. > > Each XML-record describes data about one book. If an update of bookdata > comes, the XML itself is not changed, but a new XML-record is stored with > the updated data. Via a complex scheme of combining a base record and its > updates, the final dataset is produced that is used in the application. > > There are different XML-formats that need to be combined. Right now, we can > handle three different XML-formats, each with its own structure (but all > describing book-data). > > Searching is done via a simple table lookup on three different fields: > title, author and subject. The data for these fields is extracted from the > database. Each book has a unique identifier (EAN13, derivative of ISBN). > > Here is one way to organize the database: > table title: > TITLE | EAN13, indexing on TITLE > > table author: > AUTHOR | EAN13, indexing on AUTHOR > > table subject: > SUBJECT | EAN13, indexing on SUBJECT. This is a *very* strange way of setting up your database. Are you new to Relational Databases and SQL? If so, I'd recommend starting with a book on relational database design. Either that, or you're a victim of UML design. If only one author, title and subject are allowed per book, you should have: table booksEAN13 | TITLE | AUTHOR | SUBJECT > Finally: > table record: > EAN13 | ARRAY OF XML-records. > > It's the last table that I am most curious (and worried) about, the > question being mainly what the optimal way of structuring that table is. > Option 1 is the given option: adding/deleting an XML-record for the same > book requires adding/deleting it to/from the array of XML-records. > > Option 2 would be something like this: > EAN13 | XML-record > where, if a book has several records describing it, there are multiple > entries of the EAN13|XML-record - pair. Adding an XML-record for the same > book, requires adding a new entry to the table as a whole. In my mind, there is no question that this is the best way to do things. It is a normalized data structure, as opposed to the arrays, which are now. > > So, option 1-tables look like this: > EAN13 | ARRAY OF XML-records > 0001 | {<XML1>...</XML1>, <XML2>...</XML2>, ...} > 0002 | {<XML1>...</XML1>, <XML2>...</XML2>, ...} > > Option-2 tables look like this: > EAN13 | ARRAY OF XML-records > 0001 | <XML1>...</XML1> > 0001 | <XML2>...</XML2> > 0002 | <XML1>...</XML1> > 0002 | <XML2>...</XML2> > > We can't decide which one is best. These are some issues we can think of: > > Indexing: For option 1, the EAN13-index remains unique, even if you have > multiple XML-records; for option 2 it does not, since multiple XML-records > are stored as multiple tuples. On the other hand, an additional internal > index can be used to link the several tuples of option 2 to the information > in the `lookup'-tables (author, title, keyword). Does any of these two > options increase query efficiency, ie. speed? > > Database growth: On average, the information about a book is updated three > times per year. In option 1, this means that the length of the table does > not increase, but the width does. If we choose option 2, if we have three > updates per book each year, the length of the table triples, but the width > does not. What is more costly to store for postgres, long arrays or long > tables? > > Integrity: Option 1 means that our software needs to keep track of all the > bookkeeping for arrays, since such support is quite rudimentary in > postgres. For example, it is hard to take out a record from the middle of > an array. Also, a multidimensional array, which contains for each record > the record itself and its type, is even harder to maintain. Option 2 has a > simpler datatype, so integrity can be easier inforced using the standard > postgres-machinery of variable-types etc. > > Arrays are non-standard SQL, and I hear that PHP-support for postgres & > arrays is rudimentary. So that might be an argument to avoid using them, > and go for option 2. From the standpoint of performance (or wisdom), can > you help me decide what I should choose? Or is there maybe an even better > way to structure my data? > > Thanks for any contribution! > > Roelant. -- Josh Berkus Aglio Database Solutions San Francisco