Optimize external TOAST storage - Mailing list pgsql-hackers

From davinder singh
Subject Optimize external TOAST storage
Date
Msg-id CAHzhFSEQ8Vi8pEWeZERPajCbOfvvae2Si5imxftEGhXp6px5iQ@mail.gmail.com
Whole thread Raw
Responses Re: Optimize external TOAST storage
List pgsql-hackers
Hi,

For Toast storage [1] in PostgreSQL, first, the attribute value is compressed
and then divided into chunks. The problem with storing compressed value is, if we
are not saving enough space such that it reduces the #chunks then we end up
adding extra decompression cost on every read.
Based on the discussion with Robert and Dilip, we tried to optimize this
process. The idea is to TOAST the compressed value only if we are saving at least
1 chunk(2KB default) of disk storage else use the uncompressed one. In this way,
we will save decompression costs without losing much on storage.

In our tests, we have observed improvement in the read performance up to 28% by
giving up at most TOAST_MAX_CHUNK_SIZE (2KB) bytes for storage on disk. The
gain is more on large attributes (size > 4KB) because compression/decompression
cost increases with size.
However, We have found, this assumption is not true when the data compression
ratio is more and the size is barely big enough to cause TOASTing. For example,
in the following test 4. b, we didn't get any performance advantage but the table
size grew by 42% by storing uncompressed values.
Test Setup.

Create table t1_lz4 ( a text compression lz4, b text compression lz4);
-- Generate random data
create or replace function generate_att_data(len_info int)
returns text
language plpgsql
as
$$
declare
   value text;
begin
   select array_agg(md5(g::text))
   into value
   from generate_series(1, round(len_info/33)::int) g;
   return value;
end;
$$;

--Test
Select b from t1_lz4;

Test 1:
Data: rows 200000
insert into t1_lz4(a, b) select generate_att_data(364), repeat (generate_att_data(1980), 2);
Summary:
Attribute size: original: 7925 bytes, after compression: 7845 bytes
Time for select: head: 42  sec, patch: 37 sec, Performance Gain: 11%
table size: Head 1662 MB, Patch: 1662 MB

Test 2:
Data: rows 100000
insert into t1_lz4(a, b) select generate_att_data(364), generate_att_data(16505);
Summary:
Attribute size: original: 16505 bytes, after compression: 16050 bytes
Time for select: head: 35.6  sec, patch: 30 sec, Performance Gain: 14%
table size: Head 1636 MB, Patch: 1688 MB

Test 3:
Data: rows 50000
insert into t1_lz4(a, b) select generate_att_data(364), generate_att_data(31685);
Summary:
Attribute size: original: 31685 bytes, after compression: 30263 bytes
Time for select: head: 35.4  sec, patch: 25.5 sec, Performance Gain: 28%
table size: Head 1601 MB, Patch: 1601 MB

Test 4.a:
Data: rows 200000
insert into t1_lz4(a, b) select generate_att_data(11), repeat ('b', 250) || generate_att_data(3885);
Summary:
Attribute size: original: 3885 bytes, after compression: 3645 bytes
Time for select: head: 28 sec, patch: 26 sec, Performance Gain: 7%
table size: Head 872 MB, Patch: 872 MB

Test 4.b (High compression):
Data: rows 200000
insert into t1_lz4(a, b) select generate_att_data(364), repeat (generate_att_data(1980), 2);
Summary:
Attribute size: original: 3966 bytes, after compression: 2012 bytes
Time for select: head: 27  sec, patch: 26 sec, Performance Gain: 0%
table size: Head 612 MB, Patch: 872 MB

This is the worst case for this optimization because of the following 2 reasons.
First, the table size would grow by 50% when compressed size is half of the original size and
yet barely large enough for TOASTing. Table size can't grow more than that because If
compression reduces the size even more then it will reduce the #chunks as well and it stores
the compressed value in the table.

Second, not much gain in performance because of the small attribute size, more attributes fit

in page (almost twice), on each page access it can access twice the number of rows. And

also small value means low compression/decompression costs.


We have avoided such cases by applying the optimization when attribute size > 4KB.


Regards,
Davinder
Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Add id's to various elements in protocol.sgml
Next
From: Etsuro Fujita
Date:
Subject: Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit