Home > mailing lists

Re: Logical Replication WIP - Mailing list pgsql-hackers

From	Petr Jelinek
Subject	Re: Logical Replication WIP
Date	August 16, 2016 03:31:11
Msg-id	2c72e1cc-9f56-5137-749b-47e1903280bb@2ndquadrant.com Whole thread Raw
In response to	Re: Logical Replication WIP (Stas Kelvich <s.kelvich@postgrespro.ru>)
Responses	Re: Logical Replication WIP (Petr Jelinek <petr@2ndquadrant.com>)
List	pgsql-hackers

Tree view

On 15/08/16 15:51, Stas Kelvich wrote:
>> On 11 Aug 2016, at 17:43, Petr Jelinek <petr@2ndquadrant.com> wrote:
>>
>>>
>>> * Also I wasn’t able actually to run replication itself =) While regression tests passes, TAP
>>> tests and manual run stuck. pg_subscription_rel.substate never becomes ‘r’. I’ll investigate
>>> that more and write again.
>>
>> Interesting, please keep me posted. It's possible for tables to stay in 's' state for some time if there is nothing
happeningon the server, but that should not mean anything is stuck.
 
>
> Slightly played around, it seems that apply worker waits forever for substate change.
>
> (lldb) bt
> * thread #1: tid = 0x183e00, 0x00007fff88c7f2a2 libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread',
stopreason = signal SIGSTOP
 
>     frame #0: 0x00007fff88c7f2a2 libsystem_kernel.dylib`poll + 10
>     frame #1: 0x00000001017ca8a3 postgres`WaitEventSetWaitBlock(set=0x00007fd2dc816b30, cur_timeout=10000,
occurred_events=0x00007fff5e7f67d8,nevents=1) + 51 at latch.c:1108
 
>     frame #2: 0x00000001017ca438 postgres`WaitEventSetWait(set=0x00007fd2dc816b30, timeout=10000,
occurred_events=0x00007fff5e7f67d8,nevents=1) + 248 at latch.c:941
 
>     frame #3: 0x00000001017c9fde postgres`WaitLatchOrSocket(latch=0x000000010ab208a4, wakeEvents=25, sock=-1,
timeout=10000)+ 254 at latch.c:347
 
>     frame #4: 0x00000001017c9eda postgres`WaitLatch(latch=0x000000010ab208a4, wakeEvents=25, timeout=10000) + 42 at
latch.c:302
>   * frame #5: 0x0000000101793352 postgres`wait_for_sync_status_change(tstate=0x0000000101e409b0) + 178 at
tablesync.c:228
>     frame #6: 0x0000000101792bbe postgres`process_syncing_tables_apply(slotname="subbi", end_lsn=140734778796592) +
430at tablesync.c:436
 
>     frame #7: 0x00000001017928c1 postgres`process_syncing_tables(slotname="subbi", end_lsn=140734778796592) + 81 at
tablesync.c:518
>     frame #8: 0x000000010177b620 postgres`LogicalRepApplyLoop(last_received=140734778796592) + 704 at apply.c:1122
>     frame #9: 0x000000010177bef4 postgres`ApplyWorkerMain(main_arg=0) + 1044 at apply.c:1353
>     frame #10: 0x000000010174cb5a postgres`StartBackgroundWorker + 826 at bgworker.c:729
>     frame #11: 0x0000000101762227 postgres`do_start_bgworker(rw=0x00007fd2db700000) + 343 at postmaster.c:5553
>     frame #12: 0x000000010175d42b postgres`maybe_start_bgworker + 427 at postmaster.c:5761
>     frame #13: 0x000000010175bccf postgres`sigusr1_handler(postgres_signal_arg=30) + 383 at postmaster.c:4979
>     frame #14: 0x00007fff9ab2352a libsystem_platform.dylib`_sigtramp + 26
>     frame #15: 0x00007fff88c7e07b libsystem_kernel.dylib`__select + 11
>     frame #16: 0x000000010175d5ac postgres`ServerLoop + 252 at postmaster.c:1665
>     frame #17: 0x000000010175b2e0 postgres`PostmasterMain(argc=3, argv=0x00007fd2db403840) + 5968 at
postmaster.c:1309
>     frame #18: 0x000000010169507f postgres`main(argc=3, argv=0x00007fd2db403840) + 751 at main.c:228
>     frame #19: 0x00007fff8d45c5ad libdyld.dylib`start + 1
> (lldb) p state
> (char) $1 = 'c'
> (lldb) p tstate->state
> (char) $2 = ‘c’
>

Hmm, not sure why is that, it might be related to the lsn reported being 
wrong. Could you check what is the lsn there (either in tstate or or in 
pg_subscription_rel)? Especially in comparison with what the 
sent_location is.

> Also I’ve noted that some lsn position looks wrong on publisher:
>
> postgres=# select restart_lsn, confirmed_flush_lsn from pg_replication_slots;
>  restart_lsn | confirmed_flush_lsn
> -------------+---------------------
>  0/1530EF8   | 7FFF/5E7F6A30
> (1 row)
>
> postgres=# select sent_location, write_location, flush_location, replay_location from pg_stat_replication;
>  sent_location | write_location | flush_location | replay_location
> ---------------+----------------+----------------+-----------------
>  0/1530F30     | 7FFF/5E7F6A30  | 7FFF/5E7F6A30  | 7FFF/5E7F6A30
> (1 row)
>

That's most likely result of the unitialized origin_startpos warning. I 
am working on new version of patch where that part is fixed, if you want 
to check this before I send it in, the patch looks like this:

diff --git a/src/backend/replication/logical/apply.c 
b/src/backend/replication/logical/apply.c
index 581299e..7a9e775 100644
--- a/src/backend/replication/logical/apply.c
+++ b/src/backend/replication/logical/apply.c
@@ -1353,6 +1353,7 @@ ApplyWorkerMain(Datum main_arg)                originid = replorigin_by_name(myslotname, false);
             replorigin_session_setup(originid);                replorigin_session_origin = originid;
 
+               origin_startpos = replorigin_session_get_progress(false);                CommitTransactionCommand();
                wrcapi->connect(wrchandle, MySubscription->conninfo, true,


--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

pgsql-hackers by date:

From: Josh Berkus
Date: 16 August 2016, 03:22:33
Subject: Re: PSA: Systemd will kill PostgreSQL

From: Tom Lane
Date: 16 August 2016, 03:31:27
Subject: Re: Let's get rid of the separate minor version numbers for shlibs

Re: Logical Replication WIP - Mailing list pgsql-hackers

Previous

Next