Re: [PERFORM] Using array instead of sub table (storage and speed) - Mailing list pgsql-performance

From Stephen Frost
Subject Re: [PERFORM] Using array instead of sub table (storage and speed)
Date
Msg-id 20170615133704.GQ1769@tamriel.snowman.net
Whole thread Raw
In response to [PERFORM] Using array instead of sub table (storage and speed)  (Lutz Fischer <l.fischer@ed.ac.uk>)
Responses Re: [PERFORM] Using array instead of sub table (storage and speed)  (Lutz Fischer <l.fischer@ed.ac.uk>)
List pgsql-performance
Greetings,

* Lutz Fischer (l.fischer@ed.ac.uk) wrote:
> Data in [sp] are never changed. I can probably reduce the size by
> changing datatypes from numeric to float but I was wondering if it
> would be more efficient - primarily in terms of storage -  to change
> the structure to have two arrays in [s]. E.g.

The short answer is 'probably', but it really depends on how wide your
rows are.

> I haven't really found anything yet how much space (e.g. how many
> bytes) an array will use compared to a table row in postgresql.

There's a 24-byte overhead for each tuple.  If the width of the tuple's
columns ends up being less than 24 bytes then half (or more) of the
space used is for the tuple header.  Arrays have a bit of overhead
themsleves but are then densely packed.

In testing that I've done, a table which looks like:

CREATE TABLE t1 (
  c1 int
);

Will end up with a couple hundred rows per 8k page (perhaps 250 or so),
meaning that you get ~1k of actual data for 8k of storage.  Changing
this to an array, like so:

CREATE TABLE t1 (
  c1 int[]
);

And then storing 3-4 tuples per 8k page (keeping each tuple under the 2k
TOAST limit) lead to being able to store something like 450 ints per
tuple with a subsequent result of 1800 ints per page and ~7.2k worth of
actual data for 8k of storage, which was much more efficient for
storage.

Of course, the tuple header is actually useful data in many
environments- if you go with this approach then you have to work out how
to deal with the fact that a given tuple is either visible or not, and
all the ints in the array for that tuple are all visible and that an
update to that tuple locks the entire tuple and that set of ints, etc.
If the data isn't changed after being loaded and you're able to load an
entire tuple all at once then this could work.

Note that arrays aren't more efficient than just using individual
columns, and this technique is only going to be helpful if the tuple
overhead in your situation is a large portion of the data and using this
technique allows you to reduce the number of tuples stored.

Thanks!

Stephen

Attachment

pgsql-performance by date:

Previous
From: Lutz Fischer
Date:
Subject: [PERFORM] Using array instead of sub table (storage and speed)
Next
From: "ldh@laurent-hasson.com"
Date:
Subject: [PERFORM] Sudden drastic change in performance