Re: COPY speedup - Mailing list pgsql-hackers

From Pierre Frédéric Caillaud
Subject Re: COPY speedup
Date
Msg-id op.uyl227c3cke6l8@soyouz
Whole thread Raw
In response to Re: COPY speedup  (Alvaro Herrera <alvherre@commandprompt.com>)
Responses Re: COPY speedup
List pgsql-hackers
In the previous mails I made a mistake, writing "MTuples/s" instead of  
"MDatums/s", sorry about that. It is the number of rows x columns. The  
title was wrong, but the data was right.

I've been doing some tests on COPY FROM ... BINARY.

- inlines in various pg_get* etc
- a faster buffer handling for copy
- that's about it...

In the below tables, you have "p17" (ie test patch 17, the last one) and  
straight postgres compared.

COPY annonces_2 FROM 'annonces.bin' BINARY  :  Time | Speedup |  Table |  KRows | MDatums | Name   (s) |         |
MB/s|     /s |      /s |
 
-------|---------|--------|--------|---------|---------------------------------------------------- 8.417 |  1.40 x |
38.70|  49.13 |    1.92 | 8.4.0 / p17
 
11.821 |     --- |  27.56 |  34.98 |    1.36 | 8.4.0 / compiled from source


COPY archive_data_2 FROM 'archive_data.bin' BINARY  :  Time | Speedup | Table |  KRows | MDatums | Name   (s) |
| MB/s |     /s |      /s |
 
-------|---------|-------|--------|---------|----------------------------------------------------
15.314 |  1.93 x | 25.94 | 172.88 |    4.84 | 8.4.0 / p17 COPY FROM BINARY  
all
29.520 |     --- | 13.46 |  89.69 |    2.51 | 8.4.0 / compiled from source


COPY test_one_int_2 FROM 'test_one_int.bin' BINARY  :  Time | Speedup |  Table |   KRows | MDatums | Name   (s) |
 |   MB/s |      /s |      /s |
 
-------|---------|--------|---------|---------|----------------------------------------------------
10.003 |  1.39 x |  30.63 |  999.73 |    1.00 | 8.4.0 / p17 COPY FROM  
BINARY all
13.879 |     --- |  22.08 |  720.53 |    0.72 | 8.4.0 / compiled from  
source


COPY test_many_ints_2 FROM 'test_many_ints.bin' BINARY  :  Time | Speedup | Table |  KRows | MDatums | Name   (s) |
   |  MB/s |     /s |      /s |
 
-------|---------|-------|--------|---------|---------------------------------------------------- 6.009 |  2.08 x |
21.31| 166.42 |    4.33 | 8.4.0 / p17 COPY FROM BINARY  
 
all
12.516 |     --- | 10.23 |  79.90 |    2.08 | 8.4.0 / compiled from source


I thought it might be interesting to get split timings of the various  
steps in COPY FROM, so I simply commented out bits of code and ran tests.

The "delta" columns are differences between two lines, that is the time  
taken in the step mentioned on the right.


reading data only = reading all the data and parsing it into chunks, doing  
everything until the RecvFunc is called.
RecvFuncs = same, + RecvFunc is called
heap_form_tuple = same + heap_form_tuple is called
triggers = same + triggers are applied
insert = actual tuple insertion
p17 = total time (post insert triggers, constraint check, etc)
  Time | Delta | Row delta | Datum delta | Name   (s) |   (s) |      (us) |        (us) |
-------|-------|-----------|-------------|---------------------- 1.311 |   --- |       --- |         --- | reading data
only4.516 | 3.205 |     7.750 |       0.199 | RecvFuncs 4.534 | 0.018 |     0.043 |       0.001 | heap_form_tuple 5.323
|0.789 |     1.908 |       0.049 | triggers 8.182 | 2.858 |     6.912 |       0.177 | insert 8.417 | 0.236 |     0.570
|      0.015 | p17
 


COPY archive_data_2 FROM 'archive_data.bin' BINARY  :  Time |  Delta | Row delta | Datum delta | Name   (s) |    (s) |
   (us) |        (us) |
 
-------|--------|-----------|-------------|--------------------- 4.729 |    --- |       --- |         --- | reading
dataonly 8.508 |  3.778 |     1.427 |       0.051 | RecvFuncs 8.567 |  0.059 |     0.022 |       0.001 |
heap_form_tuple
10.804 |  2.237 |     0.845 |       0.030 | triggers
14.475 |  3.671 |     1.386 |       0.050 | insert
15.314 |  0.839 |     0.317 |       0.011 | p17


COPY test_one_int_2 FROM 'test_one_int.bin' BINARY  :  Time | Delta | Row delta | Datum delta | Name   (s) |   (s) |
 (us) |        (us) |
 
-------|-------|-----------|-------------|---------------------- 1.247 |   --- |       --- |         --- | reading data
only1.745 | 0.498 |     0.050 |       0.050 | RecvFuncs 1.750 | 0.004 |     0.000 |       0.000 | heap_form_tuple 3.114
|1.364 |     0.136 |       0.136 | triggers 9.984 | 6.870 |     0.687 |       0.687 | insert
 
10.003 | 0.019 |     0.002 |       0.002 | p17


COPY test_many_ints_2 FROM 'test_many_ints.bin' BINARY  :  Time | Delta | Row delta | Datum delta | Name   (s) |   (s)
|     (us) |        (us) |
 
-------|-------|-----------|-------------|---------------------- 1.701 |   --- |       --- |         --- | reading data
only3.122 | 1.421 |     1.421 |       0.055 | RecvFuncs 3.129 | 0.008 |     0.008 |       0.000 | heap_form_tuple 3.754
|0.624 |     0.624 |       0.024 | triggers 5.639 | 1.885 |     1.885 |       0.073 | insert 6.009 | 0.370 |     0.370
|      0.014 | p17
 

We can see that :

- reading and parsing the data is still slow (actually, everything is  
copied something like 3-4 times)
- RecvFuncs take quite long, too
- triggers use some time, although the table has no triggers ....? This is  
suspicious...
- the actual insertion (which is really what we are interested in when  
loading a table) is actually very fast

Ideally in COPY the insertion time in the table should take most of the  
time used in the operation...


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Hot standby and synchronous replication status
Next
From: Josh Berkus
Date:
Subject: Re: Hot standby and synchronous replication status