Re: Batch insert in CTAS/MatView code - Mailing list pgsql-hackers

From Paul Guo
Subject Re: Batch insert in CTAS/MatView code
Date
Msg-id CAEET0ZEpXXsszcnMTHpxNY2RMCiUsYPJePWL7Fh1by3g3Ru-ww@mail.gmail.com
Whole thread Raw
In response to Re: Batch insert in CTAS/MatView code  (Paul Guo <pguo@pivotal.io>)
List pgsql-hackers


On Mon, Jun 17, 2019 at 8:53 PM Paul Guo <pguo@pivotal.io> wrote:
Hi all,

I've been working other things until recently I restarted the work, profiling & refactoring the code.
It's been a long time since the last patch was proposed. The new patch has now been firstly refactored due to
4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible, finish split of existing slot type).

Now that TupleTableSlot, instead of HeapTuple is one argument of intorel_receive() so we can not get the
tuple length directly. This patch now gets the tuple length if we know all columns are with fixed widths, else
we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES (defined as 1000) tuples
and use for the total length of tuples in a batch. 

I noticed that to do batch insert, we might need additional memory copy sometimes comparing with "single insert"
(that should be the reason that we previously saw a bit regressions) so a good solution seems to fall back
to "single insert" if the tuple length is larger than a threshold. I set this as 2000 after quick testing.

To make test stable and strict, I run checkpoint before each ctas, the test script looks like this:
 
checkpoint;
\timing
create table tt as select a,b,c from t11;
\timing
drop table tt;

Also previously I just tested the BufferHeapTupleTableSlot (i.e. create table tt as select * from t11),
this time I test VirtualTupleTableSlot (i.e. create table tt as select a,b,c from t11) additionally.
It seems that VirtualTupleTableSlot is very common in real cases.

I tested four kinds of tables, see below SQLs.

-- tuples with small size.
create table t11 (a int, b int, c int, d int);
insert into t11 select s,s,s,s from generate_series(1, 10000000) s;
analyze t11;

-- tuples that are untoasted and tuple size is 1984 bytes.
create table t12 (a name, b name, c name, d name, e name, f name, g name, h name, i name, j name, k name, l name, m name, n name, o name, p name, q name, r name, s name, t name, u name, v name, w name, x name, y name, z name, a1 name, a2 name, a3 name, a4 name, a5 name);
insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 500000);
analyze t12;

-- tuples that are untoasted and tuple size is 2112 bytes.
create table t13 (a name, b name, c name, d name, e name, f name, g name, h name, i name, j name, k name, l name, m name, n name, o name, p name, q name, r name, s name, t name, u name, v name, w name, x name, y name, z name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name);
insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 500000);
analyze t13;

-- tuples that are toastable and tuple compressed size is 1084.
create table t14 (a text, b text, c text, d text, e text, f text, g text, h text, i text, j text, k text, l text, m text, n text, o text, p text, q text, r text, s text, t text, u text, v text, w text, x text, y text, z text);
insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i from (select repeat('123456789', 10000) from generate_series(1,5000)) i;
analyze t14;


I also tested two scenarios for each testing.

One is to clean up all kernel caches (page & inode & dentry on Linux) using the command below and then run the test,
    sync; echo 3 > /proc/sys/vm/drop_caches
After running all tests all relation files will be in kernel cache (my test system memory is large enough to accommodate all relation files),
then I run the tests again. I run like this because in real scenario the result of the test should be among the two results. Also I rerun
each test and finally I calculate the average results as the experiment results. Below are some results:


scenario1: All related kernel caches are cleaned up (note the first two columns are time with second).

baseline    patch   diff%   SQL                                                                                                                                                                                                                               

10.1        5.57    44.85%  create table tt as select * from t11;

10.7        5.52    48.41%  create table tt as select a,b,c from t11;

9.57        10.2    -6.58%  create table tt as select * from t12;

9.64        8.63    10.48%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;

14.2        14.46   -1.83%  create table tt as select * from t13;

11.88       12.05   -1.43%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from t13; 

3.17        3.25    -2.52%  create table tt as select * from t14;                                        

2.93        3.12    -6.48%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;                                                                                                                                             

scenario2: all related kernel caches are populated after previous testing.                                                                                                                                                                                                                                                              

baseline    patch   diff%   SQL                                                                                                                                                                                                                               

9.6         4.97    48.23%  create table tt as select * from t11;  

10.41       5.32    48.90%  create table tt as select a,b,c from t11; 

9.12        9.52    -4.38%  create table tt as select * from t12;

9.66        8.6     10.97%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;  

13.56       13.6    -0.30%  create table tt as select * from t13;               

11.36       11.7    -2.99%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from t13;

3.08        3.13    -1.62%  create table tt as select * from t14;               

2.95        3.03    -2.71%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;


From above we can get some tentative conclusions:

1. t11: For short-size tables, batch insert improves much (40%+).

2. t12: For  BufferHeapTupleTableSlot, the patch slows down 4.x%-6.x%, but for VirtualTupleTableSlot it improves 10.x%.
 If we look at execTuples.c, it looks like this is quite relevant to additional memory copy. It seems that VirtualTupleTableSlot is
more common than the BufferHeapTupleTableSlot so probably the current code should be fine for most real cases. Or it's possible
to determine multi-insert also according to the input slot tuple but this seems to be ugly in code. Or continue to lower the threshold a bit
so that "create table tt as select * from t12;" also improves although this hurts the VirtualTupleTableSlot case.


To alleviate this. I tuned MAX_TUP_LEN_FOR_MULTI_INSERT a bit and set it from 2000 to 1600. With a table with 24 name-typed columns (total size 1536), I tried both
case1: create table tt as select * from t12;
case2: create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w from t12; 

This patch increases the performance for both. Note, of course, this change (MAX_TUP_LEN_FOR_MULTI_INSERT) does not affect the test results of previous t11, t13, t14 in theory since the code path is not affected.

kernel caches cleaned up:

           baseline(s)   patch(s)   diff%
case1: 7.65             7.30          4.6%
case2: 7.75             6.80         12.2%


relation files are in cache:

case1: 7.09             6.66          6.1%
case2: 7.49             6.83          8.8%

We do not need to find a larger threshold that just makes the case1 improvement near to zero since on other test environments the threshold might be a bit different so it should be set as a rough value, and it seems that 1600 should benefit most cases.

I attached the v3 patch which just has the MAX_TUP_LEN_FOR_MULTI_INSERT change.

Thanks.

 
3. for t13, new code still uses single insert so the difference should be small. I just want to see the regression when even we use "single insert".

4. For toast case t14, the degradation is small, not a big deal.

By the way, did we try or think about allow better prefetch (on Linux) for seqscan. i.e. POSIX_FADV_SEQUENTIAL in posix_fadvise() to enlarge the kernel readahead window. Suppose this should help if seq tuple handling is faster than default kernel readahead setting.


v2 patch is attached.


On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 06/03/2019 22:06, Paul Guo wrote:
> The patch also modifies heap_multi_insert() a bit to do a bit further
> code-level optimization by using static memory, instead of using memory
> context and dynamic allocation.

If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
That is still leaked to the current memory context.

Leaking into the current memory context is not a bad thing, because
resetting a memory context is faster than doing a lot of pfree() calls.
The callers just need to be prepared for that, and use a short-lived
memory context.

> By the way, while looking at the code, I noticed that there are 9 local
> arrays with large length in toast_insert_or_update() which seems to be a
> risk of stack overflow. Maybe we should put it as static or global.

Hmm. We currently reserve 512 kB between the kernel's limit, and the
limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
arrays add up to 52800 bytes on a 64-bit maching, if I did my math
right. So there's still a lot of headroom. I agree that it nevertheless
seems a bit excessive, though.

> With the patch,
>
> Time: 4728.142 ms (00:04.728)
> Time: 14203.983 ms (00:14.204)
> Time: 1008.669 ms (00:01.009)
>
> Baseline,
> Time: 11096.146 ms (00:11.096)
> Time: 13106.741 ms (00:13.107)
> Time: 1100.174 ms (00:01.100)

Nice speedup!

> While for toast and large column size there is < 10% decrease but for
> small column size the improvement is super good. Actually if I hardcode
> the batch count as 4 all test cases are better but the improvement for
> small column size is smaller than that with current patch. Pretty much
> the number 4 is quite case specific so I can not hardcode that in the
> patch. Of course we could further tune that but the current value seems
> to be a good trade-off?

Have you done any profiling, on why the multi-insert is slower with
large tuples? In principle, I don't see why it should be slower.

- Heikki
Attachment

pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Next
From: Amit Kapila
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs