Re: Postgres, fsync, and OSs (specifically linux) - Mailing list pgsql-hackers

From Dmitry Dolgov
Subject Re: Postgres, fsync, and OSs (specifically linux)
Date
Msg-id CA+q6zcV6Ckt0r3AgCzeqt74MR78u3p0+Nr6FNv===NuD3XzCTA@mail.gmail.com
Whole thread Raw
In response to Re: Postgres, fsync, and OSs (specifically linux)  (Andres Freund <andres@anarazel.de>)
Responses Re: Postgres, fsync, and OSs (specifically linux)  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
> On 22 May 2018 at 20:59, Andres Freund <andres@anarazel.de> wrote:
> On 2018-05-22 20:54:46 +0200, Dmitry Dolgov wrote:
>> > On 22 May 2018 at 18:47, Andres Freund <andres@anarazel.de> wrote:
>> > On 2018-05-22 08:57:18 -0700, Andres Freund wrote:
>> >> Hi,
>> >>
>> >>
>> >> On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote:
>> >> > Thanks for the patch. Out of curiosity I tried to play with it a bit.
>> >>
>> >> Thanks.
>> >>
>> >>
>> >> > `pgbench -i -s 100` actually hang on my machine, because the
>> >> > copy process ended up with waiting after `pg_uds_send_with_fd`
>> >> > had
>> >>
>> >> Hm, that had worked at some point...
>> >>
>> >>
>> >> >     errno == EWOULDBLOCK || errno == EAGAIN
>> >> >
>> >> > as well as the checkpointer process.
>> >>
>> >> What do you mean with that latest sentence?
>>
>> To investigate what's happening I attached with gdb to two processes, COPY
>> process from pgbench and checkpointer (since I assumed it may be involved).
>> Both were waiting in WaitLatchOrSocket right after SendFsyncRequest.
>
> Huh? Checkpointer was in SendFsyncRequest()? Coudl you share the
> backtrace?

Well, that's what I've got from gdb:

    #0  0x00007fae03fae9f3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
    #1  0x000000000077a979 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffe37529ec0, cur_timeout=-1, set=0x23cddf8) at
latch.c:1048
    #2  WaitEventSetWait (set=set@entry=0x23cddf8,
timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffe37529ec0,
nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=0) at
latch.c:1000
    #3  0x000000000077ad08 in WaitLatchOrSocket
(latch=latch@entry=0x0, wakeEvents=wakeEvents@entry=4, sock=8,
timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=0) at
latch.c:385
    #4  0x00000000007152cb in SendFsyncRequest
(request=request@entry=0x7ffe37529f40, fd=fd@entry=-1) at
checkpointer.c:1345
    #5  0x0000000000716223 in AbsorbAllFsyncRequests () at checkpointer.c:1207
    #6  0x000000000079a5f0 in mdsync () at md.c:1339
    #7  0x000000000079c672 in smgrsync () at smgr.c:766
    #8  0x000000000076dd53 in CheckPointBuffers (flags=flags@entry=64)
at bufmgr.c:2581
    #9  0x000000000051c681 in CheckPointGuts
(checkPointRedo=722254352, flags=flags@entry=64) at xlog.c:9079
    #10 0x0000000000523c4a in CreateCheckPoint (flags=flags@entry=64)
at xlog.c:8863
    #11 0x0000000000715f41 in CheckpointerMain () at checkpointer.c:494
    #12 0x00000000005329f4 in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7ffe3752a220) at bootstrap.c:451
    #13 0x0000000000720c28 in StartChildProcess
(type=type@entry=CheckpointerProcess) at postmaster.c:5340
    #14 0x0000000000721c23 in reaper (postgres_signal_arg=<optimized
out>) at postmaster.c:2875
    #15 <signal handler called>
    #16 0x00007fae03fa45b3 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:84
    #17 0x0000000000722968 in ServerLoop () at postmaster.c:1679
    #18 0x0000000000723cde in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x23a00e0) at postmaster.c:1388
    #19 0x000000000068979f in main (argc=3, argv=0x23a00e0) at main.c:228

>> >> > Looks like with the default
>> >> > configuration and `max_wal_size=1GB` it writes more than reads to a
>> >> > socket, and a buffer eventually becomes full.
>> >>
>> >> That's intended to then wake up the checkpointer immediately, so it can
>> >> absorb the requests.  So something isn't right yet.
>> >
>> > Doesn't hang here, but it's way too slow.
>>
>> Yep, in my case it was also getting slower, but eventually hang.
>>
>> > Reason for that is that I've wrongly resolved a merge conflict. Attached is a
>> > fixup patch - does that address the issue for you?
>>
>> Hm...is it a correct patch? I see the same committed in
>> 8c3debbbf61892dabd8b6f3f8d55e600a7901f2b, so I can't really apply it.
>
> Yea, sorry for that. Too many files in my patch directory... Right one
> attached.

Yes, this patch solves the problem, thanks.


pgsql-hackers by date:

Previous
From: Matthew Stickney
Date:
Subject: Re: [PATCH] (Windows) psql echoes password when reading from pipe
Next
From: Robert Haas
Date:
Subject: Re: Commit fest 2017-11