Home > mailing lists

Re: [PERFORM] Using array instead of sub table (storage and speed) - Mailing list pgsql-performance

From	Stephen Frost
Subject	Re: [PERFORM] Using array instead of sub table (storage and speed)
Date	June 15, 2017 16:37:04
Msg-id	20170615133704.GQ1769@tamriel.snowman.net Whole thread
In response to	[PERFORM] Using array instead of sub table (storage and speed) (Lutz Fischer <l.fischer@ed.ac.uk>)
Responses	Re: [PERFORM] Using array instead of sub table (storage and speed)
List	pgsql-performance

Tree view

Greetings,

* Lutz Fischer (l.fischer@ed.ac.uk) wrote:
> Data in [sp] are never changed. I can probably reduce the size by
> changing datatypes from numeric to float but I was wondering if it
> would be more efficient - primarily in terms of storage -  to change
> the structure to have two arrays in [s]. E.g.

The short answer is 'probably', but it really depends on how wide your
rows are.

> I haven't really found anything yet how much space (e.g. how many
> bytes) an array will use compared to a table row in postgresql.

There's a 24-byte overhead for each tuple.  If the width of the tuple's
columns ends up being less than 24 bytes then half (or more) of the
space used is for the tuple header.  Arrays have a bit of overhead
themsleves but are then densely packed.

In testing that I've done, a table which looks like:

CREATE TABLE t1 (
  c1 int
);

Will end up with a couple hundred rows per 8k page (perhaps 250 or so),
meaning that you get ~1k of actual data for 8k of storage.  Changing
this to an array, like so:

CREATE TABLE t1 (
  c1 int[]
);

And then storing 3-4 tuples per 8k page (keeping each tuple under the 2k
TOAST limit) lead to being able to store something like 450 ints per
tuple with a subsequent result of 1800 ints per page and ~7.2k worth of
actual data for 8k of storage, which was much more efficient for
storage.

Of course, the tuple header is actually useful data in many
environments- if you go with this approach then you have to work out how
to deal with the fact that a given tuple is either visible or not, and
all the ints in the array for that tuple are all visible and that an
update to that tuple locks the entire tuple and that set of ints, etc.
If the data isn't changed after being loaded and you're able to load an
entire tuple all at once then this could work.

Note that arrays aren't more efficient than just using individual
columns, and this technique is only going to be helpful if the tuple
overhead in your situation is a large portion of the data and using this
technique allows you to reduce the number of tuples stored.

Thanks!

Stephen

Attachment

signature.asc

pgsql-performance by date:

From: Lutz Fischer
Date: 15 June 2017, 11:06:39
Subject: [PERFORM] Using array instead of sub table (storage and speed)

From: "ldh@laurent-hasson.com"
Date: 15 June 2017, 17:53:44
Subject: [PERFORM] Sudden drastic change in performance

Re: [PERFORM] Using array instead of sub table (storage and speed) - Mailing list pgsql-performance

Attachment

Previous

Next