Thread: [PERFORM] Using array instead of sub table (storage and speed)
Hi, I have two tables s { id bigint NOT NULL PRIMARY KEY, ... } sp { id bigint PRIMARY KEY, sid bigint REFERENCES s (id), i numeric, m numeric ... } I have for each entry in [s] on average around 120 entries in [sp]. And that table has become the largest table in my database (8.81565*10^09 entries). Data in [sp] are never changed. I can probably reduce the size by changing datatypes from numeric to float but I was wondering if it would be more efficient - primarily in terms of storage - to change the structure to have two arrays in [s]. E.g. s { id bigint NOT NULL PRIMARY KEY, i numeric[], m numeric[], ... } I can probably reduce the size by changing datatypes from numeric to float/double. so final table would look like this: s { id bigint NOT NULL PRIMARY KEY, i float[], m double[], ... } I haven't really found anything yet how much space (e.g. how many bytes) an array will use compared to a table row in postgresql. Thanks Lutz -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Greetings, * Lutz Fischer (l.fischer@ed.ac.uk) wrote: > Data in [sp] are never changed. I can probably reduce the size by > changing datatypes from numeric to float but I was wondering if it > would be more efficient - primarily in terms of storage - to change > the structure to have two arrays in [s]. E.g. The short answer is 'probably', but it really depends on how wide your rows are. > I haven't really found anything yet how much space (e.g. how many > bytes) an array will use compared to a table row in postgresql. There's a 24-byte overhead for each tuple. If the width of the tuple's columns ends up being less than 24 bytes then half (or more) of the space used is for the tuple header. Arrays have a bit of overhead themsleves but are then densely packed. In testing that I've done, a table which looks like: CREATE TABLE t1 ( c1 int ); Will end up with a couple hundred rows per 8k page (perhaps 250 or so), meaning that you get ~1k of actual data for 8k of storage. Changing this to an array, like so: CREATE TABLE t1 ( c1 int[] ); And then storing 3-4 tuples per 8k page (keeping each tuple under the 2k TOAST limit) lead to being able to store something like 450 ints per tuple with a subsequent result of 1800 ints per page and ~7.2k worth of actual data for 8k of storage, which was much more efficient for storage. Of course, the tuple header is actually useful data in many environments- if you go with this approach then you have to work out how to deal with the fact that a given tuple is either visible or not, and all the ints in the array for that tuple are all visible and that an update to that tuple locks the entire tuple and that set of ints, etc. If the data isn't changed after being loaded and you're able to load an entire tuple all at once then this could work. Note that arrays aren't more efficient than just using individual columns, and this technique is only going to be helpful if the tuple overhead in your situation is a large portion of the data and using this technique allows you to reduce the number of tuples stored. Thanks! Stephen
Attachment
Hi Stephen, Thanks for your reply. The data in the sub table (sp) are only read in as a block. Meaning I will always read in all entries in [sp] that belong to one entry in [s]. Meaning I would not lose much in terms of what I could do with the data in [sp] and I could be saving around 2.8K per entry in [s] (just counting the overhead for each tuple in [sp]) per entry in [s] One thing I would still wonder is in how far this would affect the performance retrieving data from [s]. I often need some data from [s] where I don't care about [sp]. So in how far does having these arrays a part of [s] would make these queries slower. Or would be better to store the array data in a separate table e.g. have [s] as it is now but turn [sp] into an array aggregated table. Thanks, Lutz On 15/06/17 15:37, Stephen Frost wrote: > Greetings, > > * Lutz Fischer (l.fischer@ed.ac.uk) wrote: >> Data in [sp] are never changed. I can probably reduce the size by >> changing datatypes from numeric to float but I was wondering if it >> would be more efficient - primarily in terms of storage - to change >> the structure to have two arrays in [s]. E.g. > The short answer is 'probably', but it really depends on how wide your > rows are. > >> I haven't really found anything yet how much space (e.g. how many >> bytes) an array will use compared to a table row in postgresql. > There's a 24-byte overhead for each tuple. If the width of the tuple's > columns ends up being less than 24 bytes then half (or more) of the > space used is for the tuple header. Arrays have a bit of overhead > themsleves but are then densely packed. > > In testing that I've done, a table which looks like: > > CREATE TABLE t1 ( > c1 int > ); > > Will end up with a couple hundred rows per 8k page (perhaps 250 or so), > meaning that you get ~1k of actual data for 8k of storage. Changing > this to an array, like so: > > CREATE TABLE t1 ( > c1 int[] > ); > > And then storing 3-4 tuples per 8k page (keeping each tuple under the 2k > TOAST limit) lead to being able to store something like 450 ints per > tuple with a subsequent result of 1800 ints per page and ~7.2k worth of > actual data for 8k of storage, which was much more efficient for > storage. > > Of course, the tuple header is actually useful data in many > environments- if you go with this approach then you have to work out how > to deal with the fact that a given tuple is either visible or not, and > all the ints in the array for that tuple are all visible and that an > update to that tuple locks the entire tuple and that set of ints, etc. > If the data isn't changed after being loaded and you're able to load an > entire tuple all at once then this could work. > > Note that arrays aren't more efficient than just using individual > columns, and this technique is only going to be helpful if the tuple > overhead in your situation is a large portion of the data and using this > technique allows you to reduce the number of tuples stored. > > Thanks! > > Stephen -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Greeting, Lutz! Please don't top-post on the PG mailing lists, our style is to relpy in-line. * Lutz Fischer (l.fischer@ed.ac.uk) wrote: > I often need some data from [s] where I don't care about [sp]. So in > how far does having these arrays a part of [s] would make these > queries slower. Or would be better to store the array data in a > separate table e.g. have [s] as it is now but turn [sp] into an > array aggregated table. If that's the case then you would probably be better off putting the arrays into an independent table, yes. Thanks! Stephen