Thread: Column as arrays.. more efficient than columns?

Column as arrays.. more efficient than columns?

From

Ow Mun Heng

Date:

06 September 2007, 01:23:29

Table is like

create table foo (
number int,
subset int,
value  int
)

select * from foo;
number | subset | value
1        1        1
1        2        2
1        3        10
1        4        3

current query is like

select number,
avg(case when subset = 1 then value else null end) as v1,
avg(case when subset = 2 then value else null end) as v2,
avg(case when subset = 3 then value else null end) as v3,
avg(case when subset = 4 then value else null end) as v4
from foo
group by number

results
------
number | v1 | v2 | v3 | v4
1         1   2    10   4


I'm thinking of denormalising it a bit and put it either as an array or
just create a new table with the end result like the above.

I just want to know which is more efficient. Users can just do a

select * from new_foo where number = 'X';


Thanks.

Re: Column as arrays.. more efficient than columns?

From

Ow Mun Heng

Date:

06 September 2007, 21:38:06

Nobody has any comments on this??

On Thu, 2007-09-06 at 12:22 +0800, Ow Mun Heng wrote:
> Table is like
>
> create table foo (
> number int,
> subset int,
> value  int
> )
>
> select * from foo;
> number | subset | value
> 1        1        1
> 1        2        2
> 1        3        10
> 1        4        3
>
> current query is like
>
> select number,
> avg(case when subset = 1 then value else null end) as v1,
> avg(case when subset = 2 then value else null end) as v2,
> avg(case when subset = 3 then value else null end) as v3,
> avg(case when subset = 4 then value else null end) as v4
> from foo
> group by number
>
> results
> ------
> number | v1 | v2 | v3 | v4
> 1         1   2    10   4
>
>
> I'm thinking of denormalising it a bit and put it either as an array or
> just create a new table with the end result like the above.
>
> I just want to know which is more efficient. Users can just do a
>
> select * from new_foo where number = 'X';
>
>
> Thanks.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

Re: Column as arrays.. more efficient than columns?

From

Michael Glaesemann

Date:

06 September 2007, 21:52:37

On Sep 6, 2007, at 19:37 , Ow Mun Heng wrote:

> Nobody has any comments on this??

Don't do it.

Michael Glaesemann
grzm seespotcode net

Re: Column as arrays.. more efficient than columns?

From

Ow Mun Heng

Date:

06 September 2007, 21:59:36

On Thu, 2007-09-06 at 19:52 -0500, Michael Glaesemann wrote:
> On Sep 6, 2007, at 19:37 , Ow Mun Heng wrote:
>
> > Nobody has any comments on this??
>
> Don't do it.

don't do what?
Don't denormalise the table?
don't put them into arrays?

Thing is, end-result is always for them to be in 1 row and all the
columns are needed for data analysis.

so, it's between a select * from foo vs select avg(...), avg(..) group
by ..

Re: Column as arrays.. more efficient than columns?

From

"Joshua D. Drake"

Date:

06 September 2007, 22:09:36

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Glaesemann wrote:
>
> On Sep 6, 2007, at 19:37 , Ow Mun Heng wrote:
>
>> Nobody has any comments on this??
>
> Don't do it.

HAHAHAHAHAHAHA....

Joshua D. Drake

>
> Michael Glaesemann
> grzm seespotcode net
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


- --

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
            UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG4KS5ATb/zqfZUUQRAg9wAJ96nzIP18MGtMlRZltoyN0XQb3iogCfSuPd
lX7G0aGGq6NbyrHOzW2N1lk=
=YaVL
-----END PGP SIGNATURE-----

Re: Column as arrays.. more efficient than columns?

From

Michael Glaesemann

Date:

06 September 2007, 22:20:30

On Sep 6, 2007, at 19:58 , Ow Mun Heng wrote:

> Don't denormalise the table?

Yes. Don't denormalize the tables.

> don't put them into arrays?

Yes. Don't use arrays. Caveat: if the data is *naturally* an array
and you will not be doing any relational operations on individual
elements of the arrays, then it makes sense to use arrays. Treat
arrays as you would any other opaque type.

Michael Glaesemann
grzm seespotcode net

Re: Column as arrays.. more efficient than columns?

From

Ow Mun Heng

Date:

06 September 2007, 22:47:04

On Thu, 2007-09-06 at 20:20 -0500, Michael Glaesemann wrote:
> On Sep 6, 2007, at 19:58 , Ow Mun Heng wrote:
>
> > Don't denormalise the table?
>
> Yes. Don't denormalize the tables.

I would believe performance would be better it being denormalised. (in
this case)
>
> > don't put them into arrays?
>
> Yes. Don't use arrays. Caveat: if the data is *naturally* an array
> and you will not be doing any relational operations on individual
> elements of the arrays, then it makes sense to use arrays. Treat
> arrays as you would any other opaque type.

Data is naturally an array, and will be used as an array in any case.
Since there will not be queries where users will select any one of the
values in that array, but the whole array itself.

data willbe used in this form

code   | v1 | v2 | v3 | v4
A         1    2   10   23
B         10   12  15   22
C         11   24  18   46
D         21   22  20   41

which will be imported into statistical software/excel for further
manipulation.

I i give them in the denormalised form, it'll take them an addition
30min or so to make them back into the form above.

and it'll make the queries more efficient too.

index on Code,
select * from foo where code = 'B';

By denormalising, I will also get the benefit of reducing the # of rows
by a factor of 20.. (20 rows  = 1 code)

Re: Column as arrays.. more efficient than columns?

From

"Merlin Moncure"

Date:

06 September 2007, 22:54:16

On 9/6/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:
> Table is like
>
> create table foo (
> number int,
> subset int,
> value  int
> )
>
> select * from foo;
> number | subset | value
> 1        1        1
> 1        2        2
> 1        3        10
> 1        4        3
>
> current query is like
>
> select number,
> avg(case when subset = 1 then value else null end) as v1,
> avg(case when subset = 2 then value else null end) as v2,
> avg(case when subset = 3 then value else null end) as v3,
> avg(case when subset = 4 then value else null end) as v4
> from foo
> group by number

arrays are interesting and have some useful problems.  however, we
must first discuss the problems...first and foremost if you need to
read any particular item off the array you must read the entire array
from disk and you must right all items back to disk for writes.  also,
they cause some problems with constraints and other issues that come
up with de-normalization tactics.

however, If a particular data is expressed actually as an array of
items (the polygon type comes to mind), then why not?  let'l

that said, let's look at a better way to express this query.  what
jumps out at me right away is:

select number, subset, avg(value) from foo group by subset;

does this give you the answer that you need?  If not we can proceed
and look at why arrays may or may not be appropriate (i suspect I am
not seeing the whole picture here).

merlin

Re: Column as arrays.. more efficient than columns?

From

Michael Glaesemann

Date:

06 September 2007, 22:57:52

On Sep 6, 2007, at 20:46 , Ow Mun Heng wrote:

> On Thu, 2007-09-06 at 20:20 -0500, Michael Glaesemann wrote:
>> On Sep 6, 2007, at 19:58 , Ow Mun Heng wrote:
>>
>>> Don't denormalise the table?
>>
>> Yes. Don't denormalize the tables.
>
> I would believe performance would be better it being denormalised. (in
> this case)

I assume you've arrived at the conclusion because you have (a) shown
that the performance with a normalized schema does not meet your
needs; (b) benchmarked the normalized schema under production
conditions; (c) benchmarked the denormalized schema under production
conditions; and (d) shown that performance is improved in the
denormalized case to arrive at that conclusion. I'm interested to see
the results of your comparisons.

Regardless, it sounds like you've already made up your mind. Why ask
for comments?

Michael Glaesemann
grzm seespotcode net

Re: Column as arrays.. more efficient than columns?

From

Ow Mun Heng

Date:

06 September 2007, 23:22:28

On Thu, 2007-09-06 at 21:53 -0400, Merlin Moncure wrote:
> On 9/6/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:
> > Table is like
> >
> > create table foo (
> > number int,
> > subset int,
> > value  int
> > )
> >
> > select * from foo;
> > number | subset | value
> > 1        1        1
> > 1        2        2
> > 1        3        10
> > 1        4        3
> >
> > current query is like
> >
> > select number,
> > avg(case when subset = 1 then value else null end) as v1,
> > avg(case when subset = 2 then value else null end) as v2,
> > avg(case when subset = 3 then value else null end) as v3,
> > avg(case when subset = 4 then value else null end) as v4
> > from foo
> > group by number
>
> arrays are interesting and have some useful problems.  however, we
> must first discuss the problems...first and foremost if you need to
> read any particular item off the array you must read the entire array
> from disk and you must right all items back to disk for writes.  also,
> they cause some problems with constraints and other issues that come
> up with de-normalization tactics.

I see. Didn't know that.. Good to know.

> select number, subset, avg(value) from foo group by subset;
>
> does this give you the answer that you need?

No it doesn't

select * from foo order by subset;
 code | subset | value
------+--------+-------
 A    | 0      |    98
 A    | 1      |    20
 A    | 2      |    98
 A    | 3      |    98
 A    | 4      |    98


=> select code, subset, avg(value)  from foo  group by subset;
ERROR:  column "foo.code" must appear in the GROUP BY clause or be used
in an aggregate function

=> select code, subset, avg(value)  from foo  group by subset, code;
 code | subset |         avg
------+--------+---------------------
 A    | 3      | 98.0000000000000000
 A    | 1      | 20.0000000000000000
 A    | 4      | 98.0000000000000000
 A    | 0      | 98.0000000000000000
 A    | 2      | 98.0000000000000000


=> select code, round(avg(case when subset = '0' then value else null
end),0) as v0,
round(avg(case when subset = '1' then value else null end),0) as v1,
round(avg(case when subset = '2' then value else null end),0) as v2,
round(avg(case when subset = '3' then value else null end),0) as v3,
round(avg(case when subset = '4' then value else null end),0) as v4
from foo
group by code;
 code | v0 | v1 | v2 | v3 | v4
------+----+----+----+----+----
 A    | 98 | 20 | 98 | 98 | 98

Re: Column as arrays.. more efficient than columns?

From

Ow Mun Heng

Date:

06 September 2007, 23:27:05

On Thu, 2007-09-06 at 20:57 -0500, Michael Glaesemann wrote:
> On Sep 6, 2007, at 20:46 , Ow Mun Heng wrote:

> > I would believe performance would be better it being denormalised. (in
> > this case)
>
> I assume you've arrived at the conclusion because you have
> (a) shown
> that the performance with a normalized schema does not meet your
> needs;
> (b) benchmarked the normalized schema under production
> conditions;
> (c) benchmarked the denormalized schema under production
> conditions; and
> (d) shown that performance is improved in the
> denormalized case to arrive at that conclusion. I'm interested to see
> the results of your comparisons.

> Regardless, it sounds like you've already made up your mind. Why ask
> for comments?

You've assumed wrong. I've not arrived at any conclusion but merely
exploring my options on which way would be the best to thread. I'm
asking the list because I'm new in PG and after reading all those
articles on highscalability etc.. majority of them are all using some
kind of denormalised tables.

Right now, there's 8 million rows of data in this one table, and growing
at a rapid rate of ~2 million/week. I can significantly reduce this
number down to 200K (i think by denormalising it) and shrink the table
size.

I would appreciate your guidance on this before I go knock my head on
the wall. :-)

Re: Column as arrays.. more efficient than columns?

From

Joe Conway

Date:

07 September 2007, 00:19:21

Ow Mun Heng wrote:
> => select code, subset, avg(value)  from foo  group by subset, code;
>  code | subset |         avg
> ------+--------+---------------------
>  A    | 3      | 98.0000000000000000
>  A    | 1      | 20.0000000000000000
>  A    | 4      | 98.0000000000000000
>  A    | 0      | 98.0000000000000000
>  A    | 2      | 98.0000000000000000
>

An alternative way to get the output below, would be to feed your
aggregate query above to the crosstab() function in contrib/tablefunc.

Joe

> => select code, round(avg(case when subset = '0' then value else null
> end),0) as v0,
> round(avg(case when subset = '1' then value else null end),0) as v1,
> round(avg(case when subset = '2' then value else null end),0) as v2,
> round(avg(case when subset = '3' then value else null end),0) as v3,
> round(avg(case when subset = '4' then value else null end),0) as v4
> from foo
> group by code;
>  code | v0 | v1 | v2 | v3 | v4
> ------+----+----+----+----+----
>  A    | 98 | 20 | 98 | 98 | 98

Re: Column as arrays.. more efficient than columns?

From

Ow Mun Heng

Date:

07 September 2007, 00:42:07

On Thu, 2007-09-06 at 20:19 -0700, Joe Conway wrote:
> Ow Mun Heng wrote:

> > => select code, round(avg(case when subset = '0' then value else null
> > end),0) as v0,
> > round(avg(case when subset = '1' then value else null end),0) as v1,
> > round(avg(case when subset = '2' then value else null end),0) as v2,
> > round(avg(case when subset = '3' then value else null end),0) as v3,
> > round(avg(case when subset = '4' then value else null end),0) as v4
> > from foo
> > group by code;
> >  code | v0 | v1 | v2 | v3 | v4
> > ------+----+----+----+----+----
> >  A    | 98 | 20 | 98 | 98 | 98
>
> An alternative way to get the output below, would be to feed your
> aggregate query above to the crosstab() function in contrib/tablefunc.


I just looked at it and seems like the

...
row_name and value must be of type text
...

Re: Column as arrays.. more efficient than columns?

From

Ron Johnson

Date:

07 September 2007, 02:22:17

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/06/07 21:26, Ow Mun Heng wrote:
> On Thu, 2007-09-06 at 20:57 -0500, Michael Glaesemann wrote:
>> On Sep 6, 2007, at 20:46 , Ow Mun Heng wrote:
>
>>> I would believe performance would be better it being denormalised. (in
>>> this case)
>> I assume you've arrived at the conclusion because you have
>> (a) shown
>> that the performance with a normalized schema does not meet your
>> needs;
>> (b) benchmarked the normalized schema under production
>> conditions;
>> (c) benchmarked the denormalized schema under production
>> conditions; and
>> (d) shown that performance is improved in the
>> denormalized case to arrive at that conclusion. I'm interested to see
>> the results of your comparisons.
>
>> Regardless, it sounds like you've already made up your mind. Why ask
>> for comments?
>
> You've assumed wrong. I've not arrived at any conclusion but merely
> exploring my options on which way would be the best to thread. I'm
> asking the list because I'm new in PG and after reading all those
> articles on highscalability etc.. majority of them are all using some
> kind of denormalised tables.

Correlation != causation.

There *might* be a causal relationship between high scalability and
table denormalization, but I seriously doubt it.

> Right now, there's 8 million rows of data in this one table, and growing
> at a rapid rate of ~2 million/week. I can significantly reduce this
> number down to 200K (i think by denormalising it) and shrink the table
> size.

Even presuming you only insert data SIX hours per day, that's only
13.3 inserts per second.  Not very impressive.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG4N81S9HxQb37XmcRArnRAJ9T2vOWe+RTWK99zYKCXIVfzisY5ACg3s8H
NAeykgSGT2jeiXUa8P8oRAQ=
=GBcW
-----END PGP SIGNATURE-----

Re: Column as arrays.. more efficient than columns?

From

Ron Johnson

Date:

07 September 2007, 02:28:11

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/06/07 20:53, Merlin Moncure wrote:
[snip]
>
> arrays are interesting and have some useful problems.  however, we
> must first discuss the problems...first and foremost if you need to
> read any particular item off the array you must read the entire array
> from disk and you must right all items back to disk for writes.

Reads and writes are done at the page level, so I'm not sure this is
valid.

>                                                                also,
> they cause some problems with constraints and other issues that come
> up with de-normalization tactics.

ACK.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG4N+/S9HxQb37XmcRApl6AJ43p087jXwHs2LHGlr+JoIUVs8s7QCgmRWY
BjV99QNGxKQnel3vQ4RuBMA=
=IeDI
-----END PGP SIGNATURE-----

Re: Column as arrays.. more efficient than columns?

From

Ow Mun Heng

Date:

07 September 2007, 04:50:17

On Fri, 2007-09-07 at 00:18 -0500, Ron Johnson wrote:
> On 09/06/07 21:26, Ow Mun Heng wrote:
> I've not arrived at any conclusion but merely
> > exploring my options on which way would be the best to thread. I'm
> > asking the list because I'm new in PG and after reading all those
> > articles on highscalability etc.. majority of them are all using some
> > kind of denormalised tables.
>
> Correlation != causation.
>
> There *might* be a causal relationship between high scalability and
> table denormalization, but I seriously doubt it.

I can't refute you on this since I have no experience in this arena,
only what I read in highscalbility.com (IIRC)

> > Right now, there's 8 million rows of data in this one table, and growing
> > at a rapid rate of ~2 million/week. I can significantly reduce this
> > number down to 200K (i think by denormalising it) and shrink the table
> > size.
>
> Even presuming you only insert data SIX hours per day, that's only
> 13.3 inserts per second.  Not very impressive.

Data is inserted 24 hours a day, but not at the same rate each
sec/minute. The problem isn't really the data-insertion, it's already
inserted in a normalised manner. It's the selection of data. (OLTP
datahouse) which takes a longer time and which is the area of worry.

Re: Column as arrays.. more efficient than columns?

From

Ron Johnson

Date:

07 September 2007, 09:12:44

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/07/07 02:49, Ow Mun Heng wrote:
> On Fri, 2007-09-07 at 00:18 -0500, Ron Johnson wrote:
>> On 09/06/07 21:26, Ow Mun Heng wrote:
>> I've not arrived at any conclusion but merely
>>> exploring my options on which way would be the best to thread. I'm
>>> asking the list because I'm new in PG and after reading all those
>>> articles on highscalability etc.. majority of them are all using some
>>> kind of denormalised tables.
>> Correlation != causation.
>>
>> There *might* be a causal relationship between high scalability and
>> table denormalization, but I seriously doubt it.
>
> I can't refute you on this since I have no experience in this arena,
> only what I read in highscalbility.com (IIRC)
>
>>> Right now, there's 8 million rows of data in this one table, and growing
>>> at a rapid rate of ~2 million/week. I can significantly reduce this
>>> number down to 200K (i think by denormalising it) and shrink the table
>>> size.
>> Even presuming you only insert data SIX hours per day, that's only
>> 13.3 inserts per second.  Not very impressive.
>
> Data is inserted 24 hours a day, but not at the same rate each
> sec/minute. The problem isn't really the data-insertion, it's already
> inserted in a normalised manner. It's the selection of data. (OLTP
> datahouse) which takes a longer time and which is the area of worry.

Datahouse or "data warehouse"?

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG4T8QS9HxQb37XmcRAmwFAJ0bOFYj4gWg2VGa4l28kiDAkraQYACgl167
sRA33c8h7ZHS2qgAfgFmzkg=
=66Z0
-----END PGP SIGNATURE-----

Re: Column as arrays.. more efficient than columns?

From

"Merlin Moncure"

Date:

07 September 2007, 09:59:24

On 9/7/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 09/06/07 20:53, Merlin Moncure wrote:
> [snip]
> >
> > arrays are interesting and have some useful problems.  however, we
> > must first discuss the problems...first and foremost if you need to
> > read any particular item off the array you must read the entire array
> > from disk and you must right all items back to disk for writes.
>
> Reads and writes are done at the page level, so I'm not sure this is
> valid.

sure it is...since the denormalized record is much larger (especially
in array scenarios), the tuple is much larger meaning the page will
fill up much more quickly meaning more dead pages, more vacuuming,
etc.   Besides that, the server has to do some work presenting the
array as part of the read which is overhead.  I didn't go into a lot
of detail but the reasoning is sound.  Here is a quick example showing
the problem.

merlin

create table denormalized
(
  data int[]
);

create table normalized
(
  id int primary key,
  datum int
);

insert into normalized select v, v from generate_series(1, 100) v;
insert into denormalized select array(select generate_series(1,100));

create sequence rotator maxvalue 100 cycle;

-- bench denormalized (d.sql) --
update denormalized set data[n] = data[n] + 1 from (select
nextval('rotator') as n) q

merlin@mernix:~$ pgbench -c 4 -t 1000 -f d.sql
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
number of clients: 4
number of transactions per client: 1000
number of transactions actually processed: 4000/4000
tps = 2452.188456 (including connections establishing)
tps = 2465.262905 (excluding connections establishing)

INFO:  "normalized": found 0 removable, 100 nonremovable row versions
in 38 pages

-- bench normalized (n.sql) --
update normalized set datum = datum + 1 where id = (select nextval('rotator'));

merlin@mernix:~$ pgbench -c 4 -t 1000 -f n.sql
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
number of clients: 4
number of transactions per client: 1000
number of transactions actually processed: 4000/4000
tps = 6494.402637 (including connections establishing)
tps = 6594.087741 (excluding connections establishing)

INFO:  "denormalized": found 0 removable, 1 nonremovable row versions
in 223 page

merlin

Re: Column as arrays.. more efficient than columns?

From

Ron Johnson

Date:

07 September 2007, 10:06:29

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/07/07 07:49, Merlin Moncure wrote:
> On 9/7/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 09/06/07 20:53, Merlin Moncure wrote:
>> [snip]
>>> arrays are interesting and have some useful problems.  however, we
>>> must first discuss the problems...first and foremost if you need to
>>> read any particular item off the array you must read the entire array
>>> from disk and you must right all items back to disk for writes.
>> Reads and writes are done at the page level, so I'm not sure this is
>> valid.
>
> sure it is...since the denormalized record is much larger (especially
> in array scenarios), the tuple is much larger meaning the page will
> fill up much more quickly meaning more dead pages, more vacuuming,
> etc.   Besides that, the server has to do some work presenting the
> array as part of the read which is overhead.  I didn't go into a lot
> of detail but the reasoning is sound.  Here is a quick example showing
> the problem.

We agree.

What I meant was that reads and writes are done at the page level no
matter whether the table is normalized or not.  Thus, to say "if you
need to read any particular item off the array you must read the
entire array from disk and you must right all items back to disk for
writes." is... irrelevant.

That's probably an imprecise word, but it's all I can think of at
the moment.

We also agree regarding big records filling pages faster.



- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG4Uv8S9HxQb37XmcRAk2rAKCpxsJjhcMbvHJB5GrZOzNsUXgmWwCg7Cy0
CEU8zlbHGG9JvZgCSm/hajE=
=/Uv/
-----END PGP SIGNATURE-----

CLOB support in postgresql

From

SHARMILA JOTHIRAJAH

Date:

07 September 2007, 10:43:09

What is the best way to store clob data in postgresql?
Currently I store my clob data as a text. The data consists of large xml files. When I access the text field using getString method in jdbc, the program is not able to hold a huge string in memory .If I can stream this data to and from the database it will be good. Is there a way to do this?
Are there any other solution to this problem?
Thanks in advance
shar joe

Shape Yahoo! in your own image. Join our Network Research Panel today!

Re: Column as arrays.. more efficient than columns?

From

"Ow Mun Heng"

Date:

07 September 2007, 11:02:29

>Datahouse or "data warehouse"?

OLTP data warehouse.

Re: Column as arrays.. more efficient than columns?

From

"Ow Mun Heng"

Date:

07 September 2007, 11:11:35

On 9/7/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----

> On 09/06/07 20:53, Merlin Moncure wrote:
> [snip]
> >
> > arrays are interesting and have some useful problems. however, we
> > must first discuss the problems...first and foremost if you need to
> > read any particular item off the array you must read the entire array
> > from disk and you must right all items back to disk for writes.
>
> Reads and writes are done at the page level, so I'm not sure this is
> valid.

>>sure it is...since the denormalized record is much larger (especially
>>in array scenarios), the tuple is much larger meaning the page will
>>fill up much more quickly meaning more dead pages, more vacuuming,
>>etc.   Besides that, the server has to do some work presenting the
>>array as part of the read which is overhead. I didn't go into a lot
>>of detail but the reasoning is sound. Here is a quick example showing
>>the problem.

[snip]

data warehouse, so once the data has been denormalised, no need to be updated again, so i would think that merlin's tps doesn't really take into account.

Anyway.. here are some stats on the table I was working on.

denormalising the table reduced the # of rows quite a bit. and the time taken to return 2.8K rows worth of results is only ~2s on the denormalised table vs.76secs on the original table.
AFAICT, this shows me better performance in terms of read-back.

comments please since I may not necessary know what I'm doing.

BTW, I don't really get all the talk about the dead-tuples, pages filing up more quickly etc..(BTW, data below is based on denormalising the table into column forms rather than as an array)

normalised table = 8 million
denormalised table = 328K
Join table sfoo = 3.6million
join table dbar = 1.5million
join table smallfoo = 108rows

Nested Loop (cost=0.00..15022.75 rows=1 width=280) (actual time=0.597..1345.239 rows=2872 loops=1)
Join Filter: ((dbar.famid)::text = (fam.famid)::text)
-> Nested Loop (cost=0.00..15017.32 rows=1 width=274) (actual time=0.310..247.265 rows=2872 loops=1)
        Join Filter: ((sfoo.date_time = denorm.date_time) AND (sfoo.ttype = denorm.ttype))
        -> Nested Loop (cost=0.00..5767.36 rows=71 width=281) (actual time=0.246..85.985 rows=2872 loops=1)
              -> Index Scan using idx_dbar on dbar (cost=0.00..1175.61 rows=332 width=28) (actual time=0.154..46.172 rows=482 loops=1)
                    Index Cond: ((code)::text = 'AAA71'::text)
              -> Index Scan using idx_denorm on denorm   (cost=0.00..13.74 rows=7 width=253) (actual time=0.017..0.055 rows=6 loops=482)
                    Index Cond: ((denorm.snum)::text = (dbar.snum)::text)
        -> Index Scan using idx_ts_sn on sfoo (cost=0.00..129.48 rows=46 width=37) (actual time=0.010..0.022 rows=6 loops=2872)
              Index Cond: ((sfoo.snum)::text = (norm.snum)::text)
-> Seq Scan on fam (cost=0.00..4.08 rows=108 width=18) (actual time=0.004..0.169 rows=108 loops=2872)
Total runtime: 1350.234 ms

returned 2.8K rows

HashAggregate (cost=61819.46..61819.67 rows=1 width=73) (actual time=76251.012..76586.406 rows=2872 loops=1)
-> Nested Loop (cost=20.55..61819.40 rows=1 width=73) (actual time=140.007..36979.539 rows=57440 loops=1)
        Join Filter: ((dbar.famid)::text = (fam.famid::text)
        -> Nested Loop (cost=20.55..61813.97 rows=1 width=67) (actual time=139.585..3412.300 rows=57440 loops=1)
              Join Filter: (sfoo.ttype = norm.ttype)
              -> Nested Loop (cost=0.00..42351.18 rows=792 width=65) (actual time=0.117..464.893 rows=2275 loops=1)
                    -> Index Scan using idx_dbar on bar (cost=0.00..1175.61 rows=332 width=28) (actual time=0.058..7.275 rows=482 loops=1)
                          Index Cond: ((code)::text = 'AAA71'::text)
                    -> Index Scan using idx_sfoo on sfoo (cost=0.00..123.45 rows=46 width=37) (actual time=0.761..0.929 rows=5 loops=482)
                          Index Cond: ((sfoo.snum)::text = (dbar.snum)::text)
              -> Bitmap Heap Scan on norm (cost=20.55..24.56 rows=1 width=46) (actual time=1.144..1.202 rows=25 loops=2275)
                    Recheck Cond: (((norm.snum)::text = (dbar.snum)::text) AND (sfoo.date_time = norm.date_time))
                    -> BitmapAnd (cost=20.55..20.55 rows=1 width=0) (actual time=0.929..0.929 rows=0 loops=2275)
                          -> Bitmap Index Scan on idx_norm (cost=0.00..6.70 rows=166 width=0) (actual time=0.056..0.056 rows=142 loops=2275)
                                Index Cond: ((norm.snum)::text = (dbar.snum)::text)
                          -> Bitmap Index Scan on idx_trz_rundate (cost=0.00..13.40 rows=604 width=0) (actual time=0.977..0.977 rows=55 loops=2021)
                                Index Cond: (sfoo.date_time = norm.date_time)
        -> Seq Scan on fam fam_id (cost=0.00..4.08 rows=108 width=18) (actual time=0.008..0.287 rows=108 loops=57440)
Total runtime: 76591.106 ms

Re: Column as arrays.. more efficient than columns?

From

Joe Conway

Date:

07 September 2007, 15:06:53

Ow Mun Heng wrote:
> On Thu, 2007-09-06 at 20:19 -0700, Joe Conway wrote:
>> An alternative way to get the output below, would be to feed your
>> aggregate query above to the crosstab() function in contrib/tablefunc.
>
> I just looked at it and seems like the
> ...
> row_name and value must be of type text

It doesn't say that, and apparently you didn't bother to test...

create table foo (number int, subset int, value int);
INSERT INTO foo VALUES(1,1,11),(1,2,22),(1,3,30),(1,4,43);
INSERT INTO foo VALUES(2,1,10),(2,2,23),(2,3,31),(2,4,46);
INSERT INTO foo VALUES(3,1,12),(3,2,24),(3,3,34),(3,4,47);
INSERT INTO foo VALUES(4,1,9),(4,2,23),(4,3,35),(4,4,42);
INSERT INTO foo VALUES(5,1,10),(5,2,22),(5,3,33),(5,4,45);

select number, round(avg(case when subset = '0' then value else null
end),0) as v0,
round(avg(case when subset = '1' then value else null end),0) as v1,
round(avg(case when subset = '2' then value else null end),0) as v2,
round(avg(case when subset = '3' then value else null end),0) as v3,
round(avg(case when subset = '4' then value else null end),0) as v4
from foo
group by number order by number;

  number | v0 | v1 | v2 | v3 | v4
--------+----+----+----+----+----
       1 |    | 11 | 22 | 30 | 43
       2 |    | 10 | 23 | 31 | 46
       3 |    | 12 | 24 | 34 | 47
       4 |    |  9 | 23 | 35 | 42
       5 |    | 10 | 22 | 33 | 45
(5 rows)

select * from crosstab(
   'select number, subset, round(avg(value))
    from foo group by number, subset order by number',
   'select * from (values(0),(1),(2),(3),(4)) as vc')
AS ct(code int, v0 int, v1 int, v2 int, v3 int, v4 int);

  code | v0 | v1 | v2 | v3 | v4
------+----+----+----+----+----
     1 |    | 11 | 22 | 30 | 43
     2 |    | 10 | 23 | 31 | 46
     3 |    | 12 | 24 | 34 | 47
     4 |    |  9 | 23 | 35 | 42
     5 |    | 10 | 22 | 33 | 45
(5 rows)

create table foo2 (number int, subset int, value float8);
INSERT INTO foo2 VALUES(1,1,1.5*11),(1,2,1.4*22),(1,3,1.3*30),(1,4,1.2*43);
INSERT INTO foo2 VALUES(2,1,1.5*10),(2,2,1.4*23),(2,3,1.3*31),(2,4,1.2*46);
INSERT INTO foo2 VALUES(3,1,1.5*12),(3,2,1.4*24),(3,3,1.3*34),(3,4,1.2*47);
INSERT INTO foo2 VALUES(4,1,1.5*9),(4,2,1.4*23),(4,3,1.3*35),(4,4,1.2*42);
INSERT INTO foo2 VALUES(5,1,1.5*10),(5,2,1.4*22),(5,3,1.3*33),(5,4,1.2*45);
INSERT INTO foo2 VALUES(1,1,2.5*11),(1,2,2.4*22),(1,3,2.3*30),(1,4,2.2*43);
INSERT INTO foo2 VALUES(2,1,2.5*10),(2,2,2.4*23),(2,3,2.3*31),(2,4,2.2*46);
INSERT INTO foo2 VALUES(3,1,2.5*12),(3,2,2.4*24),(3,3,2.3*34),(3,4,2.2*47);
INSERT INTO foo2 VALUES(4,1,2.5*9),(4,2,2.4*23),(4,3,2.3*35),(4,4,2.2*42);
INSERT INTO foo2 VALUES(5,1,2.5*10),(5,2,2.4*22),(5,3,2.3*33),(5,4,2.2*45);

select number,
avg(case when subset = '0' then value else null end) as v0,
avg(case when subset = '1' then value else null end) as v1,
avg(case when subset = '2' then value else null end) as v2,
avg(case when subset = '3' then value else null end) as v3,
avg(case when subset = '4' then value else null end) as v4
from foo2
group by number order by number;

  number | v0 | v1 |  v2  |  v3  |  v4
--------+----+----+------+------+------
       1 |    | 22 | 41.8 |   54 | 73.1
       2 |    | 20 | 43.7 | 55.8 | 78.2
       3 |    | 24 | 45.6 | 61.2 | 79.9
       4 |    | 18 | 43.7 |   63 | 71.4
       5 |    | 20 | 41.8 | 59.4 | 76.5
(5 rows)

select * from crosstab(
   'select number, subset, avg(value) from
    foo2 group by number, subset order by number',
   'select * from (values(0),(1),(2),(3),(4)) as vc')
AS ct(code int, v0 float8, v1 float8, v2 float8, v3 float8, v4 float8);

  code | v0 | v1 |  v2  |  v3  |  v4
------+----+----+------+------+------
     1 |    | 22 | 41.8 |   54 | 73.1
     2 |    | 20 | 43.7 | 55.8 | 78.2
     3 |    | 24 | 45.6 | 61.2 | 79.9
     4 |    | 18 | 43.7 |   63 | 71.4
     5 |    | 20 | 41.8 | 59.4 | 76.5
(5 rows)

Joe

Re: Column as arrays.. more efficient than columns?

From

Ron Johnson

Date:

07 September 2007, 15:14:48

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/07/07 09:00, Ow Mun Heng wrote:
>
>>Datahouse or "data warehouse"?
>
> OLTP data warehouse.

But OLTP & DW are diametrically opposed in how you design,
structure, load and use them.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG4ZRES9HxQb37XmcRArECAJ0elIpVRxGjB14fzH98opOmOnm4jwCg4Vb+
rn/mb7tB0d6p6GAio4lSxCc=
=SCl9
-----END PGP SIGNATURE-----

Re: CLOB support in postgresql

From

"Scott Marlowe"

Date:

07 September 2007, 22:52:33

On 9/7/07, SHARMILA JOTHIRAJAH <sharmi_jo@yahoo.com> wrote:
> What is the best way to store clob data in postgresql?
> Currently I store my clob data as a text. The data consists of large xml
> files. When I access the text field using getString method in jdbc, the
> program is not able to hold a huge string in memory .If I can stream this
> data  to and from the database it will be good. Is there a way to do this?
> Are there any other solution to this problem?

Look into large objects.  they stream.