Re: [PERFORM] Using array instead of sub table (storage and speed) - Mailing list pgsql-performance

From Lutz Fischer
Subject Re: [PERFORM] Using array instead of sub table (storage and speed)
Date
Msg-id 867b4235-95ba-6910-8224-824f75f7ce99@ed.ac.uk
Whole thread Raw
In response to Re: [PERFORM] Using array instead of sub table (storage and speed)  (Stephen Frost <sfrost@snowman.net>)
Responses Re: [PERFORM] Using array instead of sub table (storage and speed)  (Stephen Frost <sfrost@snowman.net>)
List pgsql-performance
Hi Stephen,


Thanks for your reply. The data in the sub table (sp) are only read in
as a block. Meaning I will always read in all entries in [sp] that
belong to one entry in [s]. Meaning I would not lose much in terms of
what I could do with the data in [sp] and I could be saving around  2.8K
per entry in [s] (just counting the overhead for each tuple in [sp]) per
entry in [s]


One thing I would still wonder is in how far this would affect the
performance retrieving data from [s].

I often need some data from [s] where I don't care about [sp]. So in how
far does having these arrays a part of [s] would make these queries
slower. Or would be better to store the array data in a separate table
e.g. have [s] as it is now but turn [sp] into an array aggregated table.


Thanks,

Lutz



On 15/06/17 15:37, Stephen Frost wrote:
> Greetings,
>
> * Lutz Fischer (l.fischer@ed.ac.uk) wrote:
>> Data in [sp] are never changed. I can probably reduce the size by
>> changing datatypes from numeric to float but I was wondering if it
>> would be more efficient - primarily in terms of storage -  to change
>> the structure to have two arrays in [s]. E.g.
> The short answer is 'probably', but it really depends on how wide your
> rows are.
>
>> I haven't really found anything yet how much space (e.g. how many
>> bytes) an array will use compared to a table row in postgresql.
> There's a 24-byte overhead for each tuple.  If the width of the tuple's
> columns ends up being less than 24 bytes then half (or more) of the
> space used is for the tuple header.  Arrays have a bit of overhead
> themsleves but are then densely packed.
>
> In testing that I've done, a table which looks like:
>
> CREATE TABLE t1 (
>    c1 int
> );
>
> Will end up with a couple hundred rows per 8k page (perhaps 250 or so),
> meaning that you get ~1k of actual data for 8k of storage.  Changing
> this to an array, like so:
>
> CREATE TABLE t1 (
>    c1 int[]
> );
>
> And then storing 3-4 tuples per 8k page (keeping each tuple under the 2k
> TOAST limit) lead to being able to store something like 450 ints per
> tuple with a subsequent result of 1800 ints per page and ~7.2k worth of
> actual data for 8k of storage, which was much more efficient for
> storage.
>
> Of course, the tuple header is actually useful data in many
> environments- if you go with this approach then you have to work out how
> to deal with the fact that a given tuple is either visible or not, and
> all the ints in the array for that tuple are all visible and that an
> update to that tuple locks the entire tuple and that set of ints, etc.
> If the data isn't changed after being loaded and you're able to load an
> entire tuple all at once then this could work.
>
> Note that arrays aren't more efficient than just using individual
> columns, and this technique is only going to be helpful if the tuple
> overhead in your situation is a large portion of the data and using this
> technique allows you to reduce the number of tuples stored.
>
> Thanks!
>
> Stephen


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



pgsql-performance by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: [PERFORM] Re: join under-estimates with ineq conditions
Next
From: Stephen Frost
Date:
Subject: Re: [PERFORM] Using array instead of sub table (storage and speed)