Thread: Putting many related fields as an array

Putting many related fields as an array

From

Ow Mun Heng

Date:

12 May 2009, 02:38:20

Hi,

Currently doing some level of aggregrate tables for some data. These
data will be used for slice/dice activity and we want to be able to
play/manipulate the data such that I can get means and stddev data.

Eg: For each Original Column eg:

population_in_town : (I get derivatives)
- mean # of ppl in each town
- stddev # of ppl in each town (stdev calc already uses 2 extra columns
for # of ppl squared and qty of ppl)
- count of ppl
- count of # of ppl is < 100 (to get a percentage of population)
- count of # of ppl is < 500

Hence, I'm seeing a 1:5 column growth here if I put them as column
based.

eg:
| sum of count | sum_of_count_squared | qty | qty < 100 | qty < 500 |


I'm thinking of lumping them into 1 column via an array instead of into
5 different columns. Not sure how to go about this, hence the email to
the list.

something like {244,455,1234,43,23}

query can be done like

sum_of_count / qty = Ave
(sum_of_count_squared * sum_qty ) / (qty * (qty-1)) = STDEV
(sum_qty<100 / sum_qty) = % < 100
(sum_qty<500 / sum_qty) = % < 500


Then there's the issue of speed/responsiveness on doing it.


Help would be appreciated in this.

Re: Putting many related fields as an array

From

Sam Mason

Date:

12 May 2009, 07:14:20

On Tue, May 12, 2009 at 01:23:14PM +0800, Ow Mun Heng wrote:
> | sum of count | sum_of_count_squared | qty | qty < 100 | qty < 500 |
>
>
> I'm thinking of lumping them into 1 column via an array instead of into
> 5 different columns. Not sure how to go about this, hence the email to
> the list.

The normal array constructor should work:

  SELECT ARRAY[MIN(v),MAX(v),AVG(v),STDEV(v)]
  FROM (VALUES (1),(3),(4)) x(v);

Not sure why this is better than using separate columns though.  Maybe a
new datatype and a custom aggregate would be easier to work with?

--
  Sam  http://samason.me.uk/

Re: Putting many related fields as an array

From

"Ow Mun Heng"

Date:

12 May 2009, 09:08:06

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
On Tue, May 12, 2009 at 01:23:14PM +0800, Ow Mun Heng wrote:
>> | sum of count | sum_of_count_squared | qty | qty < 100 | qty < 500 |
>>
>>
>> I'm thinking of lumping them into 1 column via an array instead of into
>> 5 different columns. Not sure how to go about this, hence the email to
>> the list.

>The normal array constructor should work:
>
>  SELECT ARRAY[MIN(v),MAX(v),AVG(v),STDEV(v)]
>  FROM (VALUES (1),(3),(4)) x(v);
>
>Not sure why this is better than using separate columns though.  Maybe a
>new datatype and a custom aggregate would be easier to work with?

The issue here is the # of columns needed to populate the table.

The table I'm summarizing has close to between 50 to 100+ columns, if the
1:5x is used as a yardstick, then the table will get awfully wide quickly.

I need to know how to do it first, then test accordingly for performance and
corner cases.

Re: Putting many related fields as an array

From

Sam Mason

Date:

12 May 2009, 10:10:51

On Tue, May 12, 2009 at 08:06:25PM +0800, Ow Mun Heng wrote:
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> On Tue, May 12, 2009 at 01:23:14PM +0800, Ow Mun Heng wrote:
> >Not sure why this is better than using separate columns though.  Maybe a
> >new datatype and a custom aggregate would be easier to work with?
>
> The issue here is the # of columns needed to populate the table.
>
> The table I'm summarizing has close to between 50 to 100+ columns, if the
> 1:5x is used as a yardstick, then the table will get awfully wide quickly.
>
> I need to know how to do it first, then test accordingly for performance and
> corner cases.

Yes, those are going to be pretty wide tables!  Maybe if you can make
the source tables a bit "narrower" it will help things; PG has to read
entire rows from the table, so if your queries are only touching a few
columns then it's going to need a lot more disk bandwidth to get a
specific number of rows back from the table.

--
  Sam  http://samason.me.uk/

Re: Putting many related fields as an array

From

Andrew Gould

Date:

12 May 2009, 10:37:50

On Tue, May 12, 2009 at 7:06 AM, Ow Mun Heng <ow.mun.heng@wdc.com> wrote:

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
On Tue, May 12, 2009 at 01:23:14PM +0800, Ow Mun Heng wrote:
>> | sum of count | sum_of_count_squared | qty | qty < 100 | qty < 500 |
>>
>>
>> I'm thinking of lumping them into 1 column via an array instead of into
>> 5 different columns. Not sure how to go about this, hence the email to
>> the list.

>The normal array constructor should work:
>
> SELECT ARRAY[MIN(v),MAX(v),AVG(v),STDEV(v)]
> FROM (VALUES (1),(3),(4)) x(v);
>
>Not sure why this is better than using separate columns though. Maybe a
>new datatype and a custom aggregate would be easier to work with?

The issue here is the # of columns needed to populate the table.

The table I'm summarizing has close to between 50 to 100+ columns, if the
1:5x is used as a yardstick, then the table will get awfully wide quickly.

I need to know how to do it first, then test accordingly for performance and
corner cases.

I apologize for coming into this conversation late. I used to do analysis of a public use data flat file that had one row per patient and up to 24 diagnosis codes, each in a different column. Is this analogous to your situation? I found it was worth the effort to convert the flat file into a relational data model where the patients' diagnosis codes were in one column in a separate table. This model also makes more complex analysis easier.

Since there were several types of fields that needed to be combined into their own tables, I found it took less time to convert the flat file to the relational model using a script prior to importing the data into the database server. A Python script would read the original file and create 5 clean, tab-delimited files that were ready to be imported.

I hope this helps.

Andrew