Thread: Excessive WAL generation and related performance issue

Excessive WAL generation and related performance issue

From
Joe Conway
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have run into a situation where bulk loading a table with fairly
narrow rows and two indexes causes WAL to be generated at about 20:1
or higher ratio to the actual heap data (table plus indexes).

There are 560 million loaded rows which ultimately produce a table of
about 50 GB and indexes consuming about 75 GB for a grand total of 125
GB table+index heap size. This generates over 3 TB of WAL. The loading
progresses fairly fast at first and then drops off sharply so that the
load takes 4 plus days on production hardware. Near the end of the
load the heap grows very slowly -- measured in tens of MBs per hour.

I have reproduced the case on a test machine like so:

create table foo
( pid varchar(11) not null, sprid varchar(5) not null, cid varchar(11) not null, csdt date not null, pcind char(1) not
null,cedt date, csnum int2 not null default 767, rcd char(2) not null default '99'::bpchar, rrind char(1) not null
default'K'::bpchar, its timestamp not null default now(), luts timestamp not null default now(), primary key
(pid,sprid,cid,csdt,rcd)
);
create index fooidx on foo(cid,pid);

- -- creates 77GB file for test load
copy
( select  (id % 2604380)::text,  (id % 30)::text,  (id % 7906790)::text,  now() - ((id % 8777)::text || '
days')::interval, (id % 3)::text,  now() - ((id % 6351)::text || ' days')::interval,  (id % 5)::int2,  (id % 27)::text,
(id % 2)::text,  now() - ((id % 16922)::text || ' seconds')::interval,  now() - ((id % 27124)::text || '
seconds')::intervalfrom generate_series(10000000000,10000000000 + 559245675) as g(id)
 
) to '/home/postgres/foo.dump';


grep -E "^($|#|\s.*#?)" -v $PGDATA/postgresql.conf
max_connections = 1200
superuser_reserved_connections = 30
shared_buffers = 12288MB
work_mem = 8MB
maintenance_work_mem = 128MB
wal_level = hot_standbyhot_standby
checkpoint_segments = 96
checkpoint_timeout = 10min
checkpoint_completion_target = 0.7
archive_mode = on
archive_command = 'cp -i %p /home/postgres/archive/%f </dev/null'
max_wal_senders = 5
wal_keep_segments = 256
log_line_prefix = '%m;'

psql --echo-all \    --single-transaction \    --set ON_ERROR_STOP=on \    --dbname=test \
--file=/home/postgres/foo.dump

I had to kill it at just over half way done because I ran out of disk
space for WAL. This chart shows the growth of heap and WAL against
number of rows copied over time (I added an elog LOG to get messages
during XlogInsert).
 http://www.joeconway.com/presentations/test01.png

Also worth noting is that once this thing gets going, using "perf top"
I could see XLogInsert consuming 20-40% of all cycles, and the vast
majority of that was going into the CRC calculation.


I realize there are many things that can be done to improve my
specific scenario, e.g. drop indexes before loading, change various
configs, etc. My purpose for this post is to ask if it is really
expected to get over 20 times as much WAL as heap data?

Oh, and this is all on postgres 9.2.latest. I have not yet tried to
reproduce on git master but intend to do so.

Thoughts?

Thanks,

Joe


- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTTFQPAAoJEDfy90M199hlO+cP/jthKJJqkNMyVwWADHjZaeOC
4fgYSPvEGi8y2AtRZwBIKMWSVhCKbwZ+ZLdpUk3jJ1eIPOvxTAlbdf/haXctRDHv
ZLgs6uBSihbG2guijkxnMtSPjXql6WKmai+UncuGGcsqHvEwvnqIGdr7eg5Rd+c4
Q0b36DhoadKVJEwT3qWGfUXxcpkQIG5hgh39GOhQtL7xW+Tf6odLrep0/lmiNCaz
9eIcbVCzc8cIG8jGugSsGGHo1UA/s5A0aEd3mx5ZnroHjUvEWdHuTNY7ijXs6pqk
ynSHSJtDikTGVxJlj68nv/wGtHN0+xbsci/qv0sTZfOUh8mwfkBZAiOEyzK6lDcc
cxzafqAkFIpIL9cDPyhxWbSOI8LAqXfRGTiM5rsguX4iCuf3SEYl48f6sZ9CW3Jt
zNDlns2BZKKWWx88H/EL84sajo/SJS0Ml+9ppmV3TpA4zHHfiEpn3Rg4/Nsj0qzj
H/kx2UwSXA7iQiQ860gL+EggN0MmVcN+/KOAAg/9cJ4DL8TiVTv/vn5Rspv19Hu1
A12tWnioxGaFsuPAjtZCpxHPUOD3jUUSEJEAYJTCW3eJ7a6SPgAIq3eGE0Fj4v00
RWsCVhqydteuvB59MSAvtA06MHfer/fDvOK2rquY/HnOtwwdPkGG7QZ21lyfhdH1
UKo0u4OmvQuHJsiT7Z2W
=xxw3
-----END PGP SIGNATURE-----



Re: Excessive WAL generation and related performance issue

From
Andres Freund
Date:
Hi,

On 2014-04-14 14:33:03 -0700, Joe Conway wrote:
> checkpoint_segments = 96
> checkpoint_timeout = 10min

> I realize there are many things that can be done to improve my
> specific scenario, e.g. drop indexes before loading, change various
> configs, etc. My purpose for this post is to ask if it is really
> expected to get over 20 times as much WAL as heap data?

I'd bet a large percentage of this will be full page images of the
index. The values you index are essentially distributed over the whole
index, so you'll modifiy the same indx values repeatedly. But often
enough it won't be in the same checkpoint and thus will create full page
images.
I bet you'll see noticeably - while still not great - better performance
by setting checkpoint_timeout to an hour (with a corresponding increase
in checkpoint_segments).
Have you checked how often checkpoints are actually created? I'd bet
it's far more frequent than every 10min with that _segments setting and
such a load.

Greetings,

Andres Freund



Re: Excessive WAL generation and related performance issue

From
Jim Nasby
Date:
On 4/14/14, 4:50 PM, Andres Freund wrote:
> Hi,
>
> On 2014-04-14 14:33:03 -0700, Joe Conway wrote:
>> checkpoint_segments = 96
>> checkpoint_timeout = 10min
>
>> I realize there are many things that can be done to improve my
>> specific scenario, e.g. drop indexes before loading, change various
>> configs, etc. My purpose for this post is to ask if it is really
>> expected to get over 20 times as much WAL as heap data?
>
> I'd bet a large percentage of this will be full page images of the
> index. The values you index are essentially distributed over the whole
> index, so you'll modifiy the same indx values repeatedly. But often
> enough it won't be in the same checkpoint and thus will create full page
> images.

My thought exactly...

ISTM that we should be able to push all the index inserts to the end of the transaction. That should greatly reduce the
amountof full page writes. That would also open the door for doing all the index inserts in parallel.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Excessive WAL generation and related performance issue

From
Joe Conway
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/14/2014 03:17 PM, Jim Nasby wrote:
> On 4/14/14, 4:50 PM, Andres Freund wrote:
>> On 2014-04-14 14:33:03 -0700, Joe Conway wrote:
>>> I realize there are many things that can be done to improve my 
>>> specific scenario, e.g. drop indexes before loading, change
>>> various configs, etc. My purpose for this post is to ask if it
>>> is really expected to get over 20 times as much WAL as heap
>>> data?
>> 
>> I'd bet a large percentage of this will be full page images of
>> the index. The values you index are essentially distributed over
>> the whole index, so you'll modifiy the same indx values
>> repeatedly. But often enough it won't be in the same checkpoint
>> and thus will create full page images.
> 
> My thought exactly...
> 
> ISTM that we should be able to push all the index inserts to the
> end of the transaction. That should greatly reduce the amount of
> full page writes. That would also open the door for doing all the
> index inserts in parallel.

That's the thing. I'm sure there is tuning and other things to improve
this particular case, but creating over 20 times as much WAL as real
data seems like pathological behavior to me.

Joe

- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTTGZ8AAoJEDfy90M199hlhXQQAJfs/FOk6W83bVdU4pRN5bVI
HW0jeMwX4NtUigW2vk5tKWcCgWKDTZvvV2TE3C7XPQnoa4TC51bjFJDHErKxNV8i
vFk47OFvg4AEoILeRgsemLJFCc0jDlc5VClnNiH8esUjmOAv9vFktJ3JymVdaIYL
3ytxMyF/KYiCeWQlu6WZTfFD9qqdZh6dWIkm6m8gVXJstr+jVVkxHe2lNQe77nEi
DycHy/4dmMd4QThxw3sRzEGW1GNGGk/6X1zmZECXYu7v95E5dFLl1oD2CFUMpoGh
D5LWZqfuyhN0lHLe5nwTvvYeTGMg5+r/fVm1Y4oWbAQPjWycZcrMCFPho7U+5CHC
XPt6FuaIZlZ4GBPCNj398xyPZdwWkOBEhfvhu601ibOVbQwBECnWQxGpMTukCvxT
giaZD8C1Ty/MAq0lleAPdkNN91GPqMkhd46sG/aVMDOGtjfJkfYFeqj6b7rbFknw
+wdioB0+vTFQ+hJ3yzVIAR09RoL0o3UW/8C1kOE5jIjJZPxdta5or7ZD77y1RLJI
/UVU2LVcyS82ddmWcWM6/q5LaqlPgityZZmIoi8Hxp1ywNzIZcyY0t1RJkMrrb0I
LIOTSizFA1zFM3lDNV7sF261DQS9IjOSgeSMIfB9zJQArWWwJ7c/DiTEbwpZu7iz
0VKmaJk15zqf1FWEdX+I
=l6F9
-----END PGP SIGNATURE-----



Re: Excessive WAL generation and related performance issue

From
Jim Nasby
Date:
On 4/14/14, 5:51 PM, Joe Conway wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 04/14/2014 03:17 PM, Jim Nasby wrote:
>> On 4/14/14, 4:50 PM, Andres Freund wrote:
>>> On 2014-04-14 14:33:03 -0700, Joe Conway wrote:
>>>> I realize there are many things that can be done to improve my
>>>> specific scenario, e.g. drop indexes before loading, change
>>>> various configs, etc. My purpose for this post is to ask if it
>>>> is really expected to get over 20 times as much WAL as heap
>>>> data?
>>>
>>> I'd bet a large percentage of this will be full page images of
>>> the index. The values you index are essentially distributed over
>>> the whole index, so you'll modifiy the same indx values
>>> repeatedly. But often enough it won't be in the same checkpoint
>>> and thus will create full page images.
>>
>> My thought exactly...
>>
>> ISTM that we should be able to push all the index inserts to the
>> end of the transaction. That should greatly reduce the amount of
>> full page writes. That would also open the door for doing all the
>> index inserts in parallel.
>
> That's the thing. I'm sure there is tuning and other things to improve
> this particular case, but creating over 20 times as much WAL as real
> data seems like pathological behavior to me.

Can you take a look at what's actually going into WAL when the wheels fall off? I think it should be pretty easy to
testthe theory that it's a ton of full page writes of index leaf pages...
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Excessive WAL generation and related performance issue

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-04-14 14:33:03 -0700, Joe Conway wrote:
>> checkpoint_segments = 96
>> checkpoint_timeout = 10min

> I bet you'll see noticeably - while still not great - better performance
> by setting checkpoint_timeout to an hour (with a corresponding increase
> in checkpoint_segments).
> Have you checked how often checkpoints are actually created? I'd bet
> it's far more frequent than every 10min with that _segments setting and
> such a load.

My thoughts exactly.  It's not hard to blow through WAL at multiple
megabytes per second with modern machines.  I'd turn on checkpoint logging
and then do whatever you need to do to get the actual intercheckpoint time
up to 10-15 minutes at least.
        regards, tom lane



Re: Excessive WAL generation and related performance issue

From
Joe Conway
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/14/2014 04:17 PM, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On 2014-04-14 14:33:03 -0700, Joe Conway wrote:
>>> checkpoint_segments = 96 checkpoint_timeout = 10min
> 
>> I bet you'll see noticeably - while still not great - better
>> performance by setting checkpoint_timeout to an hour (with a
>> corresponding increase in checkpoint_segments). Have you checked
>> how often checkpoints are actually created? I'd bet it's far more
>> frequent than every 10min with that _segments setting and such a
>> load.
> 
> My thoughts exactly.  It's not hard to blow through WAL at
> multiple megabytes per second with modern machines.  I'd turn on
> checkpoint logging and then do whatever you need to do to get the
> actual intercheckpoint time up to 10-15 minutes at least.

That'll help performance, but lets say I generally keep WAL files for
PITR and don't turn that off before starting -- shouldn't I be very
surprised to need over 3TB of archive storage when loading a 50GB
table with a couple of indexes?

Joe


- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTTG3IAAoJEDfy90M199hluaIP/00NYTg+AiRNTaMkhZAqFxxl
8Fysfbe9UXedGU/3hzcq0rCNuQEuG4qiNjGEBCgsQuW9smxvzIzuT5EAAmdOP6jR
lWGW1574g9qaRT2GNTnlt5hKArVJtE+wlmzspAK12aiLlhSax4o0dAIibRliZ+nZ
a7Ay8ZcrwcNCyZKg0UjXhZ75SXQyxdYxygIhMzmYgB9UyfTxh0Dbujd692QvpzyG
gnBl6iqZH/EJFkU821QILf7UNzGALdZ3aSpfijwtkAnIyMt5ZB5JzuEFCd/+Xpe7
GZnl4hKpyD9chqQ+vv4YRJrdAxH3pfsYPo/ksyMZrRnBl5ezDLehdopLXEsh4hZI
XDVqQPgC1tPR6DNAYAWT2bR2iO11GZLyhmZ8aU7eDVbBlUe7bE37L3f4yr3shzsm
A98J1GbDq4NYWJPeta8x0o8xg3A+HR/Q/+qYqH4hgRU+RhuV4kQ5Vl1xIP9a0gqV
+95y6sznGM0mtDfZMvqf3uNotKpIKeBCsHyshMXXYiPr4JxkymIAuh1zYLzQMBN5
wrJ2hUG2wIH2hra3sihokyZeqK9CeO7jEtVEtaGz8CEL2ihHnAMkO6uRWzpsOgW4
Xk8+iWt+RO7dfPBNa1N0urCCgr3KYOE0M5TaxtrGnfT7/bGlcNKpVIC9a48SqK+U
acttmBm6Ev8XVyEqpUit
=GT5+
-----END PGP SIGNATURE-----



Re: Excessive WAL generation and related performance issue

From
Andres Freund
Date:
On 2014-04-14 16:22:48 -0700, Joe Conway wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 04/14/2014 04:17 PM, Tom Lane wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> >> On 2014-04-14 14:33:03 -0700, Joe Conway wrote:
> >>> checkpoint_segments = 96 checkpoint_timeout = 10min
> > 
> >> I bet you'll see noticeably - while still not great - better
> >> performance by setting checkpoint_timeout to an hour (with a
> >> corresponding increase in checkpoint_segments). Have you checked
> >> how often checkpoints are actually created? I'd bet it's far more
> >> frequent than every 10min with that _segments setting and such a
> >> load.
> > 
> > My thoughts exactly.  It's not hard to blow through WAL at
> > multiple megabytes per second with modern machines.  I'd turn on
> > checkpoint logging and then do whatever you need to do to get the
> > actual intercheckpoint time up to 10-15 minutes at least.
> 
> That'll help performance, but lets say I generally keep WAL files for
> PITR and don't turn that off before starting -- shouldn't I be very
> surprised to need over 3TB of archive storage when loading a 50GB
> table with a couple of indexes?

The point is that more frequent checkpoints will increase the WAL volume
*significantly* because more full page writes will have to be generated.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Excessive WAL generation and related performance issue

From
Joe Conway
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/14/2014 04:25 PM, Andres Freund wrote:
> On 2014-04-14 16:22:48 -0700, Joe Conway wrote:
>> That'll help performance, but lets say I generally keep WAL files
>> for PITR and don't turn that off before starting -- shouldn't I
>> be very surprised to need over 3TB of archive storage when
>> loading a 50GB table with a couple of indexes?
> 
> The point is that more frequent checkpoints will increase the WAL
> volume *significantly* because more full page writes will have to
> be generated.

OK, I'll see how much it can be brought down through checkpoint tuning
and report back.

Thanks!

Joe


- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTTHCgAAoJEDfy90M199hlRnUP/3kJket9GmwEQoMUhndpxmDu
7+qDX4gFsZEGkndF6OxivSxZdRrjdioJA0STinx82w+GTpwZuBbjBfPmmjbW7kY+
t8QfuJzAauoSnscaE/NRVL2IJivRsAWbNpwWvC7nbLIhZmE4rXytN53izJwAHwP6
qJlyLmlvMB3IH75uoVo3AMQggE+gMy6kmDSfQST7e0RpkxE7Zsp51B7m6MLOxXjc
gpf8ukZztCUvxFwI1ds++hEeUR7qpAT5V/1hhcFBqKOCoYXeTxH7PUrPofdcg7PT
rzlGyBARQNz6apnEUCB3/bhQENHdXjmrXBeeyNUNug2jSXsi7BJpccJhPYN49XB4
FAGGxyQLB4Gl75oSO9cYJ8B0b0peYYIm9VhBOPUw1iPgouNsdlh+yA4QagPp3VnW
O8+Fdf/6lcBgbABEfDwCRBBkyaVSvckqrzp946YnasiCENad3zz4UEWfekEpRzk6
QTyVsObjjKHmOFt9mGeUOH7hGpTwleT5oXDttCWOxPtDkwB6tT1pC7DnRPpmYCvt
FWQC11sN0RNIeOl2W2Zo18P3dtlI8anQ/ano4FE1Dx8GKypncxAIm7RlRCKuax+C
anL7fNie/ObZRMNxQ0m7nJvbzOKjVooslUDnBTje4um1Aw1ljS74brqqKkWUOP6C
3GReDBEE20wLsD/5t6L+
=/L/M
-----END PGP SIGNATURE-----



Re: Excessive WAL generation and related performance issue

From
Stephen Frost
Date:
* Joe Conway (mail@joeconway.com) wrote:
> That's the thing. I'm sure there is tuning and other things to improve
> this particular case, but creating over 20 times as much WAL as real
> data seems like pathological behavior to me.

Setting things up such that you are updating a single value on each page
in an index during each checkpoint, which then requires basically
rewriting the entire index as full page writes at checkpoint time, is
definitely pathological behavior- but sadly, not behavior we are likely
to be able to fix..

This sounds like a great example of the unlogged table -> logged table
use-case and makes me wonder if we could provide an optimization similar
to the existing CREATE TABLE + COPY under wal_level = minimal case,
where we wouldn't WAL log anything for CREATE TABLE + COPY even when
wal_level is above minimal, until COMMIT, at which point we'll blast the
whole thing out in one shot.

Another option that you might consider is ordering your input, if
possible, to improve the chances that the same page is changed multiple
times inside a given checkpoint, hopefully reducing the number of pages
changed.
Thanks,
    Stephen

Re: Excessive WAL generation and related performance issue

From
Joe Conway
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/14/2014 05:40 PM, Stephen Frost wrote:
> This sounds like a great example of the unlogged table -> logged
> table use-case and makes me wonder if we could provide an
> optimization similar to the existing CREATE TABLE + COPY under
> wal_level = minimal case, where we wouldn't WAL log anything for
> CREATE TABLE + COPY even when wal_level is above minimal, until
> COMMIT, at which point we'll blast the whole thing out in one
> shot.

I was definitely thinking that it would be nice if we could alter this
table to make it unlogged, load it, and then alter again to make it
logged. Obviously easier said than done though ;-)

> Another option that you might consider is ordering your input, if 
> possible, to improve the chances that the same page is changed
> multiple times inside a given checkpoint, hopefully reducing the
> number of pages changed.

Nice idea but doesn't work well when you have two indexes with
different first column correlations.

Joe

- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTTIj5AAoJEDfy90M199hlJ5AP/RQBGkdOpcVzFsZOmj0keN5z
SB7BmyX8qzzL9C8xNirHektyJ0QvFFuUIMN3vTC9hvQJQjoJEn8FO3GMlvmNlK4J
edFDEQVWj3e1etQhEnGJlDeALAqWc9emwOVvCsdllONq0KWMYIKW8ex5i921W3ya
1yzCc9fHuihCBb3UYakkEKoKgpJ4JkLPA3Ak9cEcHZOUWp+JEudiUpD12OgKqyZ7
HwmW5clpQWLwyIZtx9e+/5psjDTytBNIYb1xpF4XuSv1CqbL8UEynNs5KjgDdziY
0w+DOF+xe1gWCj4Qou/c466YybFeo68YU5O1YSydfl/e0IxX7AwYTu+LfiUCUutP
0SNhrpPksoZQQ9hwlYcJLDskl8AAlUbTmSGfU7rvfZq1FDe5AramBkGUhQkwigIn
HXpZ2CWLMvive9NwXM2s69Rsnb51lu9LQ9ewmbDY9mPsdOye1WEy89xr3JG624UV
CuOQAzCmPI0HXJsPuDNdqJY9Hgk3Prypq3viBbfmQzDCIc+v1dT3S9SELOtERnZp
HejpcXoSck6A3SrTRPAe1F8kx5ssiJF83Tnc6kdiHndzK2vN85RXHIAA1DX4lKpF
oS7+l1y6Gx1Y1K3Tru87+P3j6CNHcdaca2ZdzMlPyEcuZ08/M949SSQkoILNrHAA
BtB+B/XiJE8Sn9cBhG1J
=7qHf
-----END PGP SIGNATURE-----



Re: Excessive WAL generation and related performance issue

From
Joe Conway
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/14/2014 04:34 PM, Joe Conway wrote:
> On 04/14/2014 04:25 PM, Andres Freund wrote:
>> On 2014-04-14 16:22:48 -0700, Joe Conway wrote:
>>> That'll help performance, but lets say I generally keep WAL
>>> files for PITR and don't turn that off before starting --
>>> shouldn't I be very surprised to need over 3TB of archive
>>> storage when loading a 50GB table with a couple of indexes?
> 
>> The point is that more frequent checkpoints will increase the
>> WAL volume *significantly* because more full page writes will
>> have to be generated.
> 
> OK, I'll see how much it can be brought down through checkpoint
> tuning and report back.

One more question before I get to that. I had applied the following
patch to XLogInsert

8<--------------------------
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 2f71590..e39cd37 100644
- --- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -737,10 +737,12 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData
*rdata)       uint32          len,                               write_len;       unsigned        i;
+       unsigned        inorm;       bool            updrqst;       bool            doPageWrites;       bool
isLogSwitch= (rmid == RM_XLOG_ID && info ==
 
XLOG_SWITCH);       uint8           info_orig = info;
+       uint32          xl_tot_len;
       /* cross-check on whether we should be here or not */       if (!XLogInsertAllowed())
@@ -924,8 +926,23 @@ begin:;        * header.        */       INIT_CRC32(rdata_crc);
+       i = 0;
+       inorm = 0;       for (rdt = rdata; rdt != NULL; rdt = rdt->next)
+       {               COMP_CRC32(rdata_crc, rdt->data, rdt->len);
+
+               if (rdt_lastnormal == rdt)
+               {
+                       inorm = i;
+                       i = 0;
+               }
+               else
+                       i++;
+       }
+       xl_tot_len = SizeOfXLogRecord + write_len;
+       if ((inorm + i) > 4 || xl_tot_len > 2000)
+               elog(LOG,
"XLogInsert;tot_nml_blks;%d;tot_bkp_blks;%d;tot_Xlog_Len;%d", inorm,
i, xl_tot_len);
       START_CRIT_SECTION();
8<--------------------------

The idea was to record number of normal and backup blocks, and total
size of the record. I have quite a few entries in the log from the
test run which are like:

8<--------------------------
2014-04-11 08:42:06.904 PDT;LOG:
XLogInsert;tot_nml_blks;4;tot_bkp_blks;5;tot_Xlog_Len;16168
2014-04-11 09:03:12.790 PDT;LOG:
XLogInsert;tot_nml_blks;4;tot_bkp_blks;5;tot_Xlog_Len;16172
2014-04-11 10:16:57.949 PDT;LOG:
XLogInsert;tot_nml_blks;3;tot_bkp_blks;5;tot_Xlog_Len;16150
8<--------------------------

and

8<--------------------------
2014-04-11 11:17:08.313 PDT;LOG:
XLogInsert;tot_nml_blks;4;tot_bkp_blks;6;tot_Xlog_Len;12332
2014-04-11 11:17:08.338 PDT;LOG:
XLogInsert;tot_nml_blks;4;tot_bkp_blks;6;tot_Xlog_Len;16020
2014-04-11 11:17:08.389 PDT;LOG:
XLogInsert;tot_nml_blks;4;tot_bkp_blks;6;tot_Xlog_Len;12332
8<--------------------------


In other words, based on my inserted logic, it appears that there are
5 and 6 backup blocks on a fairly regular basis.

However in xlog.h it says:
8<--------------------------* If we backed up any disk blocks with the XLOG record, we use flag* bits in xl_info to
signalit.  We support backup of up to 4 disk* blocks per XLOG record.
 
8<--------------------------

So is my logic to record number of backup blocks wrong, or is the
comment wrong, or am I otherwise misunderstanding something?

Thanks,

Joe

- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTTZxJAAoJEDfy90M199hl15MQAKTcv9BoZTsXDleSu9JrU1ha
UhHUnDALRmxWLgyPYsgtifxMQ3jLp5eLrkMHGnQbVD17619OgHckuOiEphc2bdQp
MfZlv3jrEqxnmsh6qKhK1J23mHj0cohWXQ9EUoyjE6tlZueLPyMigaIV662KP1d2
pUXCh6IEJYMMaPfqhR5Mxi62s+HMkpAULhafWeEeAwcU1eYNijFWlyxJWlsv7D6X
9ZuDSmRtqnAP0g23GcbxNkL/I9Yv090Uxar7um2Rw5SEUV+Uv1kMY0GVCjHluE0k
qZhSF1tE2jypThhSnv5xRHT3ZzdKoJtNmfLekjws7+dFZbSBLgNOj4EdV0H/wUgf
NqO71kkeRhd44uMRzii0cr03LwBiwqC2apCYoZy7s0X3rl10hZfKgVEKkyhaZ4VJ
QdfR1WdY/hC7mKW7NPnkycF+Es1ykEfuPnKHHsyJ3fHeFGxkKD3I6A8jGnNnS6VL
ba+jx+t3qnrcKQAW8lqQ3rAij5Jkb97Ljibc7o6w8cgnGA4S0tqsE6jDrdDR1FO4
ns5uULTs4REU8clFwiKNZnQfINRUUfqY1mtlRneJMANeafm0j2CyIzvqLqB2mdOH
YL9SS2lIngQlVSfgpu7EiSS7sJx8XGe3a3YFE9DoTBpq009scrscH40+kuN823wp
yruufkzaBN6lyAjo3zoR
=GQDN
-----END PGP SIGNATURE-----



Re: Excessive WAL generation and related performance issue

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> In other words, based on my inserted logic, it appears that there are
> 5 and 6 backup blocks on a fairly regular basis.

> However in xlog.h it says:
> 8<--------------------------
>  * If we backed up any disk blocks with the XLOG record, we use flag
>  * bits in xl_info to signal it.  We support backup of up to 4 disk
>  * blocks per XLOG record.
> 8<--------------------------

> So is my logic to record number of backup blocks wrong, or is the
> comment wrong, or am I otherwise misunderstanding something?

The comment is correct, so you did something wrong.  From memory,
there's a goto-label retry loop in that function; maybe you need
to zero your counters after the retry label?
        regards, tom lane



Re: Excessive WAL generation and related performance issue

From
Heikki Linnakangas
Date:
On 04/15/2014 11:53 PM, Joe Conway wrote:
> One more question before I get to that. I had applied the following
> patch to XLogInsert
>
> 8<--------------------------
> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index 2f71590..e39cd37 100644
> - --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -737,10 +737,12 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData
> *rdata)
>          uint32          len,
>                                  write_len;
>          unsigned        i;
> +       unsigned        inorm;
>          bool            updrqst;
>          bool            doPageWrites;
>          bool            isLogSwitch = (rmid == RM_XLOG_ID && info ==
> XLOG_SWITCH);
>          uint8           info_orig = info;
> +       uint32          xl_tot_len;
>
>          /* cross-check on whether we should be here or not */
>          if (!XLogInsertAllowed())
> @@ -924,8 +926,23 @@ begin:;
>           * header.
>           */
>          INIT_CRC32(rdata_crc);
> +       i = 0;
> +       inorm = 0;
>          for (rdt = rdata; rdt != NULL; rdt = rdt->next)
> +       {
>                  COMP_CRC32(rdata_crc, rdt->data, rdt->len);
> +
> +               if (rdt_lastnormal == rdt)
> +               {
> +                       inorm = i;
> +                       i = 0;
> +               }
> +               else
> +                       i++;
> +       }
> +       xl_tot_len = SizeOfXLogRecord + write_len;
> +       if ((inorm + i) > 4 || xl_tot_len > 2000)
> +               elog(LOG,
> "XLogInsert;tot_nml_blks;%d;tot_bkp_blks;%d;tot_Xlog_Len;%d", inorm,
> i, xl_tot_len);
>
>          START_CRIT_SECTION();
> 8<--------------------------
>
> The idea was to record number of normal and backup blocks, and total
> size of the record. I have quite a few entries in the log from the
> test run which are like:
>
> 8<--------------------------
> 2014-04-11 08:42:06.904 PDT;LOG:
> XLogInsert;tot_nml_blks;4;tot_bkp_blks;5;tot_Xlog_Len;16168
> 2014-04-11 09:03:12.790 PDT;LOG:
> XLogInsert;tot_nml_blks;4;tot_bkp_blks;5;tot_Xlog_Len;16172
> 2014-04-11 10:16:57.949 PDT;LOG:
> XLogInsert;tot_nml_blks;3;tot_bkp_blks;5;tot_Xlog_Len;16150
> 8<--------------------------
>
> and
>
> 8<--------------------------
> 2014-04-11 11:17:08.313 PDT;LOG:
> XLogInsert;tot_nml_blks;4;tot_bkp_blks;6;tot_Xlog_Len;12332
> 2014-04-11 11:17:08.338 PDT;LOG:
> XLogInsert;tot_nml_blks;4;tot_bkp_blks;6;tot_Xlog_Len;16020
> 2014-04-11 11:17:08.389 PDT;LOG:
> XLogInsert;tot_nml_blks;4;tot_bkp_blks;6;tot_Xlog_Len;12332
> 8<--------------------------
>
>
> In other words, based on my inserted logic, it appears that there are
> 5 and 6 backup blocks on a fairly regular basis.
>
> However in xlog.h it says:
> 8<--------------------------
>   * If we backed up any disk blocks with the XLOG record, we use flag
>   * bits in xl_info to signal it.  We support backup of up to 4 disk
>   * blocks per XLOG record.
> 8<--------------------------
>
> So is my logic to record number of backup blocks wrong, or is the
> comment wrong, or am I otherwise misunderstanding something?

You're counting XLogRecData structs, not backup blocks. Each backup 
block typically consists of three XLogRecData structs, one to record a 
BkpBlock struct, one to record the data before the unused "hole" in the 
middle of the page, and one for after. Sometimes just two, if there is 
no hole to skip. See the loop just before the CRC calculation, that's 
where the XLogRecData entries for backup blocks are created.

- Heikki



Re: Excessive WAL generation and related performance issue

From
Joe Conway
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/15/2014 02:15 PM, Heikki Linnakangas wrote:
> You're counting XLogRecData structs, not backup blocks. Each
> backup block typically consists of three XLogRecData structs, one
> to record a BkpBlock struct, one to record the data before the
> unused "hole" in the middle of the page, and one for after.
> Sometimes just two, if there is no hole to skip. See the loop just
> before the CRC calculation, that's where the XLogRecData entries
> for backup blocks are created.
> 
> - Heikki

Ah, thanks for the explanation!

Joe

- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTTaLxAAoJEDfy90M199hlwuQP/3Tuea1TUe+4L21ZdProNIIF
fUtejFNUwqhkyWNcnePlubgSyTEIfHGEG9hatrB5/MWdzpiyEvXdDkvV1ODakEhJ
CVKZbnQ4dmnrevypy2f2YdhlbB9du/DDFhFcPOZGbn+vLywwM9oMPS8tQmsol37e
aITe2GnD5LpEcmCSqzz04OL+xAxKLe8fXaI9dDsTRWXb9qdj4pDHI706CeixwSFb
sGsGcIHXmnWieMby9qfWc0WGpc38iMRRkE+LeaEULhsycFP/2x09irXdhbl5T1SH
4PItwX0/ZgLskklG2gaD4HpNe75+Emj1i22PHDYhXSoAzpykHUf+kZZwMUr0AbaF
5QVCer071jHaMacpaVC7/qwUt8zISx4/1wtJuQzfk5H3P2q4L+b/xPmod5/cqs9z
/wFp+9kjMT4349sSMe1eDPTDoIZKgRh8Eiag5IfJtrOAjoK+FN+k8uWNikiyFDMu
z/3l+6mbfrl7FAmfeXLFC9fqhhGOiGLHoZufB/4qFgEikj4S94Hx9Q0nHqkMsFvM
Fcd3qcpLI06Xku7LmBPRvdZ8OVFGWAirH1jBlrdsvC9E5VoZxgByxg90EaTlwjAQ
1ZaGOsbQoXdOPOwe/rGx2ONGwgZp8uFwHSXzUY+CJucvfYQh2AD67AlEhS7Jb9NC
ummpulzJ6arce0815KaT
=Y+m4
-----END PGP SIGNATURE-----