Thread: [PERFORM] Using array instead of sub table (storage and speed)

[PERFORM] Using array instead of sub table (storage and speed)

From

Lutz Fischer

Date:

15 June 2017, 14:06:39

Hi,


I have two tables

s {

     id bigint NOT NULL PRIMARY KEY,

    ...

}


sp {

   id bigint PRIMARY KEY,

  sid bigint REFERENCES s (id),

  i numeric,

  m numeric

  ...

}


I have for each entry in [s]  on average around 120 entries in [sp]. And
that table has become the largest table in my database (8.81565*10^09
entries).

Data in [sp] are never changed. I can probably reduce the size by
changing datatypes from numeric to float but I was wondering if it would
be more efficient - primarily in terms of storage -  to change the
structure to have two arrays in [s]. E.g.

s {

     id bigint NOT NULL PRIMARY KEY,

     i numeric[],

    m numeric[],

    ...

}


I can probably reduce the size by changing datatypes from numeric to
float/double. so final table would look like this:


s {

     id bigint NOT NULL PRIMARY KEY,

     i float[],

    m double[],

    ...

}


I haven't really found anything yet how much space (e.g. how many bytes)
an array will use compared to a table row in postgresql.


Thanks

Lutz






--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: [PERFORM] Using array instead of sub table (storage and speed)

From

Stephen Frost

Date:

15 June 2017, 19:37:04

Greetings,

* Lutz Fischer (l.fischer@ed.ac.uk) wrote:
> Data in [sp] are never changed. I can probably reduce the size by
> changing datatypes from numeric to float but I was wondering if it
> would be more efficient - primarily in terms of storage -  to change
> the structure to have two arrays in [s]. E.g.

The short answer is 'probably', but it really depends on how wide your
rows are.

> I haven't really found anything yet how much space (e.g. how many
> bytes) an array will use compared to a table row in postgresql.

There's a 24-byte overhead for each tuple.  If the width of the tuple's
columns ends up being less than 24 bytes then half (or more) of the
space used is for the tuple header.  Arrays have a bit of overhead
themsleves but are then densely packed.

In testing that I've done, a table which looks like:

CREATE TABLE t1 (
  c1 int
);

Will end up with a couple hundred rows per 8k page (perhaps 250 or so),
meaning that you get ~1k of actual data for 8k of storage.  Changing
this to an array, like so:

CREATE TABLE t1 (
  c1 int[]
);

And then storing 3-4 tuples per 8k page (keeping each tuple under the 2k
TOAST limit) lead to being able to store something like 450 ints per
tuple with a subsequent result of 1800 ints per page and ~7.2k worth of
actual data for 8k of storage, which was much more efficient for
storage.

Of course, the tuple header is actually useful data in many
environments- if you go with this approach then you have to work out how
to deal with the fact that a given tuple is either visible or not, and
all the ints in the array for that tuple are all visible and that an
update to that tuple locks the entire tuple and that set of ints, etc.
If the data isn't changed after being loaded and you're able to load an
entire tuple all at once then this could work.

Note that arrays aren't more efficient than just using individual
columns, and this technique is only going to be helpful if the tuple
overhead in your situation is a large portion of the data and using this
technique allows you to reduce the number of tuples stored.

Thanks!

Stephen

Attachment

signature.asc

Re: [PERFORM] Using array instead of sub table (storage and speed)

From

Lutz Fischer

Date:

16 June 2017, 16:37:47

Hi Stephen,


Thanks for your reply. The data in the sub table (sp) are only read in
as a block. Meaning I will always read in all entries in [sp] that
belong to one entry in [s]. Meaning I would not lose much in terms of
what I could do with the data in [sp] and I could be saving around  2.8K
per entry in [s] (just counting the overhead for each tuple in [sp]) per
entry in [s]


One thing I would still wonder is in how far this would affect the
performance retrieving data from [s].

I often need some data from [s] where I don't care about [sp]. So in how
far does having these arrays a part of [s] would make these queries
slower. Or would be better to store the array data in a separate table
e.g. have [s] as it is now but turn [sp] into an array aggregated table.


Thanks,

Lutz



On 15/06/17 15:37, Stephen Frost wrote:
> Greetings,
>
> * Lutz Fischer (l.fischer@ed.ac.uk) wrote:
>> Data in [sp] are never changed. I can probably reduce the size by
>> changing datatypes from numeric to float but I was wondering if it
>> would be more efficient - primarily in terms of storage -  to change
>> the structure to have two arrays in [s]. E.g.
> The short answer is 'probably', but it really depends on how wide your
> rows are.
>
>> I haven't really found anything yet how much space (e.g. how many
>> bytes) an array will use compared to a table row in postgresql.
> There's a 24-byte overhead for each tuple.  If the width of the tuple's
> columns ends up being less than 24 bytes then half (or more) of the
> space used is for the tuple header.  Arrays have a bit of overhead
> themsleves but are then densely packed.
>
> In testing that I've done, a table which looks like:
>
> CREATE TABLE t1 (
>    c1 int
> );
>
> Will end up with a couple hundred rows per 8k page (perhaps 250 or so),
> meaning that you get ~1k of actual data for 8k of storage.  Changing
> this to an array, like so:
>
> CREATE TABLE t1 (
>    c1 int[]
> );
>
> And then storing 3-4 tuples per 8k page (keeping each tuple under the 2k
> TOAST limit) lead to being able to store something like 450 ints per
> tuple with a subsequent result of 1800 ints per page and ~7.2k worth of
> actual data for 8k of storage, which was much more efficient for
> storage.
>
> Of course, the tuple header is actually useful data in many
> environments- if you go with this approach then you have to work out how
> to deal with the fact that a given tuple is either visible or not, and
> all the ints in the array for that tuple are all visible and that an
> update to that tuple locks the entire tuple and that set of ints, etc.
> If the data isn't changed after being loaded and you're able to load an
> entire tuple all at once then this could work.
>
> Note that arrays aren't more efficient than just using individual
> columns, and this technique is only going to be helpful if the tuple
> overhead in your situation is a large portion of the data and using this
> technique allows you to reduce the number of tuples stored.
>
> Thanks!
>
> Stephen


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: [PERFORM] Using array instead of sub table (storage and speed)

From

Stephen Frost

Date:

16 June 2017, 18:05:37

Greeting, Lutz!

Please don't top-post on the PG mailing lists, our style is to relpy
in-line.

* Lutz Fischer (l.fischer@ed.ac.uk) wrote:
> I often need some data from [s] where I don't care about [sp]. So in
> how far does having these arrays a part of [s] would make these
> queries slower. Or would be better to store the array data in a
> separate table e.g. have [s] as it is now but turn [sp] into an
> array aggregated table.

If that's the case then you would probably be better off putting the
arrays into an independent table, yes.

Thanks!

Stephen

Attachment

signature.asc