Re: Use arrays or not? - Mailing list pgsql-sql

From Josh Berkus
Subject Re: Use arrays or not?
Date
Msg-id 200404291023.58712.josh@agliodbs.com
Whole thread Raw
Responses Re: Use arrays or not?
List pgsql-sql
Roelant,

Yours is not a performance question, so I'm crossing it over to SQL for advice 
on database design.

> I am building an application using postgresql to store XML-records. There
> is a debate within the group of developers about the best way to store our
> data. I hope you can help us make a decision.
>
> The data consists of XML-records, with a lot of XML-fields. I want to store
> the XML as it is, so taking the information from the XML-records and then
> storing it in a different-from-XML-format is not an option.
>
> Each XML-record describes data about one book. If an update of bookdata
> comes, the XML itself is not changed, but a new XML-record is stored with
> the updated data. Via a complex scheme of combining a base record and its
> updates, the final dataset is produced that is used in the application.
>
> There are different XML-formats that need to be combined. Right now, we can
> handle three different XML-formats, each with its own structure (but all
> describing book-data).
>
> Searching is done via a simple table lookup on three different fields:
> title, author and subject. The data for these fields is extracted from the
> database. Each book has a unique identifier (EAN13, derivative of ISBN).
>
> Here is one way to organize the database:
> table title:
> TITLE | EAN13, indexing on TITLE
>
> table author:
> AUTHOR | EAN13, indexing on AUTHOR
>
> table subject:
> SUBJECT | EAN13, indexing on SUBJECT.

This is a *very* strange way of setting up your database.  Are you new to 
Relational Databases and SQL?   If so, I'd recommend starting with a book on 
relational database design.

Either that, or you're a victim of UML design.   

If only one author, title and subject are allowed per book, you should have:

table booksEAN13 | TITLE  | AUTHOR | SUBJECT

> Finally:
> table record:
> EAN13 | ARRAY OF XML-records.
>
> It's the last table that I am most curious (and worried) about, the
> question being mainly what the optimal way of structuring that table is.
> Option 1 is the given option: adding/deleting an XML-record for the same
> book requires adding/deleting it to/from the array of XML-records.
>
> Option 2 would be something like this:
> EAN13 | XML-record
> where, if a book has several records describing it, there are multiple
> entries of the EAN13|XML-record - pair. Adding an XML-record for the same
> book, requires adding a new entry to the table as a whole.

In my mind, there is no question that this is the best way to do things.  It 
is a normalized data structure, as opposed to the arrays, which are now.

>
> So, option 1-tables look like this:
> EAN13 | ARRAY OF XML-records
> 0001 | {<XML1>...</XML1>, <XML2>...</XML2>, ...}
> 0002 | {<XML1>...</XML1>, <XML2>...</XML2>, ...}
>
> Option-2 tables look like this:
> EAN13 | ARRAY OF XML-records
> 0001 | <XML1>...</XML1>
> 0001 | <XML2>...</XML2>
> 0002 | <XML1>...</XML1>
> 0002 | <XML2>...</XML2>
>
> We can't decide which one is best. These are some issues we can think of:
>
> Indexing: For option 1, the EAN13-index remains unique, even if you have
> multiple XML-records; for option 2 it does not, since multiple XML-records
> are stored as multiple tuples. On the other hand, an additional internal
> index can be used to link the several tuples of option 2 to the information
> in the `lookup'-tables (author, title, keyword). Does any of these two
> options increase query efficiency, ie. speed?
>
> Database growth: On average, the information about a book is updated three
> times per year. In option 1, this means that the length of the table does
> not increase, but the width does. If we choose option 2, if we have three
> updates per book each year, the length of the table triples, but the width
> does not. What is more costly to store for postgres, long arrays or long
> tables?
>
> Integrity: Option 1 means that our software needs to keep track of all the
> bookkeeping for arrays, since such support is quite rudimentary in
> postgres. For example, it is hard to take out a record from the middle of
> an array. Also, a multidimensional array, which contains for each record
> the record itself and its type, is even harder to maintain. Option 2 has a
> simpler datatype, so integrity can be easier inforced using the standard
> postgres-machinery of variable-types etc.
>
> Arrays are non-standard SQL, and I hear that PHP-support for postgres &
> arrays is rudimentary. So that might be an argument to avoid using them,
> and go for option 2. From the standpoint of performance (or wisdom), can
> you help me decide what I should choose? Or is there maybe an even better
> way to structure my data?
>
> Thanks for any contribution!
>
> Roelant.


-- 
Josh Berkus
Aglio Database Solutions
San Francisco


pgsql-sql by date:

Previous
From: Rob
Date:
Subject: Re: Equivalant of SQL Server's Nchar and NVARCHAR
Next
From: Marco Lazzeri
Date:
Subject: Check a value in array