Thread: Re: Use arrays or not?

Re: Use arrays or not?

From
Josh Berkus
Date:
Roelant,

Yours is not a performance question, so I'm crossing it over to SQL for advice 
on database design.

> I am building an application using postgresql to store XML-records. There
> is a debate within the group of developers about the best way to store our
> data. I hope you can help us make a decision.
>
> The data consists of XML-records, with a lot of XML-fields. I want to store
> the XML as it is, so taking the information from the XML-records and then
> storing it in a different-from-XML-format is not an option.
>
> Each XML-record describes data about one book. If an update of bookdata
> comes, the XML itself is not changed, but a new XML-record is stored with
> the updated data. Via a complex scheme of combining a base record and its
> updates, the final dataset is produced that is used in the application.
>
> There are different XML-formats that need to be combined. Right now, we can
> handle three different XML-formats, each with its own structure (but all
> describing book-data).
>
> Searching is done via a simple table lookup on three different fields:
> title, author and subject. The data for these fields is extracted from the
> database. Each book has a unique identifier (EAN13, derivative of ISBN).
>
> Here is one way to organize the database:
> table title:
> TITLE | EAN13, indexing on TITLE
>
> table author:
> AUTHOR | EAN13, indexing on AUTHOR
>
> table subject:
> SUBJECT | EAN13, indexing on SUBJECT.

This is a *very* strange way of setting up your database.  Are you new to 
Relational Databases and SQL?   If so, I'd recommend starting with a book on 
relational database design.

Either that, or you're a victim of UML design.   

If only one author, title and subject are allowed per book, you should have:

table booksEAN13 | TITLE  | AUTHOR | SUBJECT

> Finally:
> table record:
> EAN13 | ARRAY OF XML-records.
>
> It's the last table that I am most curious (and worried) about, the
> question being mainly what the optimal way of structuring that table is.
> Option 1 is the given option: adding/deleting an XML-record for the same
> book requires adding/deleting it to/from the array of XML-records.
>
> Option 2 would be something like this:
> EAN13 | XML-record
> where, if a book has several records describing it, there are multiple
> entries of the EAN13|XML-record - pair. Adding an XML-record for the same
> book, requires adding a new entry to the table as a whole.

In my mind, there is no question that this is the best way to do things.  It 
is a normalized data structure, as opposed to the arrays, which are now.

>
> So, option 1-tables look like this:
> EAN13 | ARRAY OF XML-records
> 0001 | {<XML1>...</XML1>, <XML2>...</XML2>, ...}
> 0002 | {<XML1>...</XML1>, <XML2>...</XML2>, ...}
>
> Option-2 tables look like this:
> EAN13 | ARRAY OF XML-records
> 0001 | <XML1>...</XML1>
> 0001 | <XML2>...</XML2>
> 0002 | <XML1>...</XML1>
> 0002 | <XML2>...</XML2>
>
> We can't decide which one is best. These are some issues we can think of:
>
> Indexing: For option 1, the EAN13-index remains unique, even if you have
> multiple XML-records; for option 2 it does not, since multiple XML-records
> are stored as multiple tuples. On the other hand, an additional internal
> index can be used to link the several tuples of option 2 to the information
> in the `lookup'-tables (author, title, keyword). Does any of these two
> options increase query efficiency, ie. speed?
>
> Database growth: On average, the information about a book is updated three
> times per year. In option 1, this means that the length of the table does
> not increase, but the width does. If we choose option 2, if we have three
> updates per book each year, the length of the table triples, but the width
> does not. What is more costly to store for postgres, long arrays or long
> tables?
>
> Integrity: Option 1 means that our software needs to keep track of all the
> bookkeeping for arrays, since such support is quite rudimentary in
> postgres. For example, it is hard to take out a record from the middle of
> an array. Also, a multidimensional array, which contains for each record
> the record itself and its type, is even harder to maintain. Option 2 has a
> simpler datatype, so integrity can be easier inforced using the standard
> postgres-machinery of variable-types etc.
>
> Arrays are non-standard SQL, and I hear that PHP-support for postgres &
> arrays is rudimentary. So that might be an argument to avoid using them,
> and go for option 2. From the standpoint of performance (or wisdom), can
> you help me decide what I should choose? Or is there maybe an even better
> way to structure my data?
>
> Thanks for any contribution!
>
> Roelant.


-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Use arrays or not?

From
Achilleus Mantzios
Date:
Just some comments from my experience:

PgSQL Arrays are mainly for modeling arrays not relations.
Arrays can be very useful if they are not ever gonna be treated
as relations and if performance is an issue (e.g. dynamic graphs, on the 
fly statistics etc..).

Also (besides other solutions) int[] arrays is a handy way of implementing
tree structures in a database.

For your case as you set it the XML arrays will make your life difficult.

O kyrios Josh Berkus egrapse stis Apr 29, 2004 :

> Roelant,
> 
> Yours is not a performance question, so I'm crossing it over to SQL for advice 
> on database design.
> 
> > I am building an application using postgresql to store XML-records. There
> > is a debate within the group of developers about the best way to store our
> > data. I hope you can help us make a decision.
> >
> > The data consists of XML-records, with a lot of XML-fields. I want to store
> > the XML as it is, so taking the information from the XML-records and then
> > storing it in a different-from-XML-format is not an option.
> >
> > Each XML-record describes data about one book. If an update of bookdata
> > comes, the XML itself is not changed, but a new XML-record is stored with
> > the updated data. Via a complex scheme of combining a base record and its
> > updates, the final dataset is produced that is used in the application.
> >
> > There are different XML-formats that need to be combined. Right now, we can
> > handle three different XML-formats, each with its own structure (but all
> > describing book-data).
> >
> > Searching is done via a simple table lookup on three different fields:
> > title, author and subject. The data for these fields is extracted from the
> > database. Each book has a unique identifier (EAN13, derivative of ISBN).
> >
> > Here is one way to organize the database:
> > table title:
> > TITLE | EAN13, indexing on TITLE
> >
> > table author:
> > AUTHOR | EAN13, indexing on AUTHOR
> >
> > table subject:
> > SUBJECT | EAN13, indexing on SUBJECT.
> 
> This is a *very* strange way of setting up your database.  Are you new to 
> Relational Databases and SQL?   If so, I'd recommend starting with a book on 
> relational database design.
> 
> Either that, or you're a victim of UML design.   
> 
> If only one author, title and subject are allowed per book, you should have:
> 
> table books
>     EAN13 | TITLE  | AUTHOR | SUBJECT
> 
> > Finally:
> > table record:
> > EAN13 | ARRAY OF XML-records.
> >
> > It's the last table that I am most curious (and worried) about, the
> > question being mainly what the optimal way of structuring that table is.
> > Option 1 is the given option: adding/deleting an XML-record for the same
> > book requires adding/deleting it to/from the array of XML-records.
> >
> > Option 2 would be something like this:
> > EAN13 | XML-record
> > where, if a book has several records describing it, there are multiple
> > entries of the EAN13|XML-record - pair. Adding an XML-record for the same
> > book, requires adding a new entry to the table as a whole.
> 
> In my mind, there is no question that this is the best way to do things.  It 
> is a normalized data structure, as opposed to the arrays, which are now.
> 
> >
> > So, option 1-tables look like this:
> > EAN13 | ARRAY OF XML-records
> > 0001 | {<XML1>...</XML1>, <XML2>...</XML2>, ...}
> > 0002 | {<XML1>...</XML1>, <XML2>...</XML2>, ...}
> >
> > Option-2 tables look like this:
> > EAN13 | ARRAY OF XML-records
> > 0001 | <XML1>...</XML1>
> > 0001 | <XML2>...</XML2>
> > 0002 | <XML1>...</XML1>
> > 0002 | <XML2>...</XML2>
> >
> > We can't decide which one is best. These are some issues we can think of:
> >
> > Indexing: For option 1, the EAN13-index remains unique, even if you have
> > multiple XML-records; for option 2 it does not, since multiple XML-records
> > are stored as multiple tuples. On the other hand, an additional internal
> > index can be used to link the several tuples of option 2 to the information
> > in the `lookup'-tables (author, title, keyword). Does any of these two
> > options increase query efficiency, ie. speed?
> >
> > Database growth: On average, the information about a book is updated three
> > times per year. In option 1, this means that the length of the table does
> > not increase, but the width does. If we choose option 2, if we have three
> > updates per book each year, the length of the table triples, but the width
> > does not. What is more costly to store for postgres, long arrays or long
> > tables?
> >
> > Integrity: Option 1 means that our software needs to keep track of all the
> > bookkeeping for arrays, since such support is quite rudimentary in
> > postgres. For example, it is hard to take out a record from the middle of
> > an array. Also, a multidimensional array, which contains for each record
> > the record itself and its type, is even harder to maintain. Option 2 has a
> > simpler datatype, so integrity can be easier inforced using the standard
> > postgres-machinery of variable-types etc.
> >
> > Arrays are non-standard SQL, and I hear that PHP-support for postgres &
> > arrays is rudimentary. So that might be an argument to avoid using them,
> > and go for option 2. From the standpoint of performance (or wisdom), can
> > you help me decide what I should choose? Or is there maybe an even better
> > way to structure my data?
> >
> > Thanks for any contribution!
> >
> > Roelant.
> 
> 
> 

-- 
-Achilleus



Re: Use arrays or not?

From
Roelant Ossewaarde
Date:
Hi,

Thanks for your fast response. But I'm afraid I stated some things unclear.

> >>The data consists of XML-records, with a lot of XML-fields. I want to 
> >>store
> >>the XML as it is, so taking the information from the XML-records and then
> >>storing it in a different-from-XML-format is not an option.
> 
> Actually, your data consists of data. The XML is just scaffolding to 
> enable you to locate and understand your data. Since you are storing it 
> in a relational database, that will use relational scaffolding for its 
> organisation instead. At least partially, you will *have* to parse the 
> values out and organise them differently.
I do, and I know. But I want to store the XML-records as they are. So given an
author, title and an XML-record that are related to that author and title, how
to store that. That's the question. I have good reasons to store the
XML-records as they are, without further parsing them.

> >>Each XML-record describes data about one book. If an update of bookdata
> >>comes, the XML itself is not changed, but a new XML-record is stored with
> >>the updated data. Via a complex scheme of combining a base record and its
> >>updates, the final dataset is produced that is used in the application.
> >>
> >>Searching is done via a simple table lookup on three different fields:
> >>title, author and subject. The data for these fields is extracted from the
> >>database. Each book has a unique identifier (EAN13, derivative of ISBN).
> >>
> >>Here is one way to organize the database:
> >>table title:
> >>TITLE | EAN13, indexing on TITLE
> >>
> >>table author:
> >>AUTHOR | EAN13, indexing on AUTHOR
> >>
> >>table subject:
> >>SUBJECT | EAN13, indexing on SUBJECT.
> >
> >
> >This is a *very* strange way of setting up your database.  Are you new to 
> >Relational Databases and SQL?   If so, I'd recommend starting with a book 
> >on relational database design.
> I agree with Josh - think about a book.

Thank your for the recommendations. But the above thing is just background
information, it will not be stored as such. The important question for me is
the question whether to use arrays or not. With index in the above examples I
do not mean the actual postgres-index, I mean that those are the fields that
are used in searching. One never searches on an EAN13-number, only on author,
title and subject. And one never, by the way, searches for a specific
XML-record, only the total of the stored XML-records per book should be retrieved.

> 
> >If only one author, title and subject are allowed per book, you should 
> >have:
> >
> >table books
> >    EAN13 | TITLE  | AUTHOR | SUBJECT
> 
> If, on the other hand you can have multiple authors (likely) you'll want 
>  something like:
> 
> CREATE TABLE author (
>   ean13        varchar(13), -- Guessing ean13 format
>   author_num   int4,
>   author_name  text,
>   PRIMARY KEY (ean13, author_num)
> );
> 
> Then you can have rows like:
> 
> ('my-ean-number-here', 1, 'Aaron Aardvark')
> ('my-ean-number-here', 2, 'Betty Bee')
> etc.

Yes, I have such a thing. There can be multiple titles, multiple authors and
multiple keywords per book. 

> 
> 
> >>Finally:
> >>table record:
> >>EAN13 | ARRAY OF XML-records.
> >>
> >>It's the last table that I am most curious (and worried) about, the
> >>question being mainly what the optimal way of structuring that table is.
> >>Option 1 is the given option: adding/deleting an XML-record for the same
> >>book requires adding/deleting it to/from the array of XML-records.
> >>
> >>Option 2 would be something like this:
> >>EAN13 | XML-record
> >>where, if a book has several records describing it, there are multiple
> >>entries of the EAN13|XML-record - pair. Adding an XML-record for the same
> >>book, requires adding a new entry to the table as a whole.
> >
> >
> >In my mind, there is no question that this is the best way to do things.  
> >It is a normalized data structure, as opposed to the arrays, which are now.
> 
> Although your option 2 doesn't go quite far enough. You'll also want to 
> know what order these come in. So, assuming you can't have two updates 
> at the same time:
> 
> CREATE TABLE book_history (
>   ean13  varchar(13), -- Guessing ean13 format
>   ts     timestamp with time zone,
>   xml    text,
>   PRIMARY KEY (ean13, ts)
> );

The order is not important; the interpretation of the XML-records is done by
an external module. The order is determined upon the content of the
XML-records, because they can come from different sources and can be combined
in different ways, depending on the application processing the
XML-records. Order is not determined at the moment that the records are
stored, but at the moment the records are interpreted.

> As for your other concerns:
> >>Indexing:
> >>Database growth:
> >>Integrity:
> Just worry about the integrity - if you keep the design simple, 
> PostgreSQL will manage quite large growth on quite small hardware.

What would be a situation in which one should use arrays then?


> Now... I don't think you want to do what you're trying to do. Don't take 
> this personally, but unless you're extremely pushed for time and 
> resources this is almost certainly a bad design choice.

> 1. Wrong tool for the job
> Basically you're taking a relational database and treating it like a 
> filesystem. All you need for what you're doing is a directory-tree to 
> represent the ean13 structure and one file per xml-record. Index the 
> author/title fields with dbm/SQLite. You could write the whole thing in 
> a day - simple, efficient, leverages existing unix tools.

That is correct. I think nothing would beat a dbm-style solution qua
performance, and I'm still considering using that. The added value of a system
like postgresql is the client/server-interface and the omnipresent support of
programming languages, and not in the least the familiarity of most people
with mysql/postgresql in comparison to dbm.

> 2. Wrong job for the tool
> How do I find out which publisher produced the most books in 2003 (I'm 
> assuming this is in your XML somewhere)? Which book is available in the 
> most languages?
> How many updates were applied last month? How many different books did 
> they affect? Why do the numbers not match - which books had multiple 
> changes?
> The first set of questions need you to write code, the second set don't. 
> Why? Because the second set rely on information stored simply and 
> explicitly in the database (book_history as it happens).

We know what set of questions will be asked: we only need to access through
author, title and subject keywords. Another reason maybe to choose a
non-relational model.

> 3. The medium isn't the message
> You don't want to open up your XML records to store them in the 
> database, but I assume you have to in your PHP code, or you can't 
> process individual values. As it stands you're having to extract certain 
> information when an XML update arrives anyway. If the title of a book is 
> amended, then you'll need to remember to update the book_title table. If 
> it's simple to extract more, why not do so? If some of it is fiddly to 
> represent in an SQL database then at least extract everything that is 
> convenient.

There are good reasons for that. The XMLs not necessarily contain the input to
author/title/subject-tables (or columns). I just want to store the XML,
interpret it later. (really, it makes sense!)

> Oh - and I would probably store a "current snapshot" of the book's 
> record separately too. Saves your application having to recalculate it 
> every time it's needed.

I *want* to recalculate it every time it's needed. Because the different
XML-records can be combined in several ways, depending on the application that
is accessing the database-client.

So, let me rephrase my questions:
1. When and why would anyone use arrays?
2. When designing the database, is it really true that there is no performance
difference between a table of which the number of tuples grow by a factor of,
say 10, and a table of which the size of the tuples grow by a factor of, say
10?

Thanks, 

Roelant.