Thread: logical decoding - reading a user catalog table

logical decoding - reading a user catalog table

From
Steve Singer
Date:
My logical decoding plugin is occasionally getting this  error

"could not resolve cmin/cmax of catalog tuple"

I get this when my output plugin is trying to read one of the user 
defined catalog tables (user_catalog_table=true)


I am not sure if this is a bug in the time-travel support in the logical 
decoding support of if I'm just using it wrong (ie not getting a 
sufficient lock on the relation or something).



This is the interesting part of the stack trace

#4  0x000000000091bbc8 in HeapTupleSatisfiesHistoricMVCC 
(htup=0x7fffcf42a900,    snapshot=0x7f786ffe92d8, buffer=10568) at tqual.c:1631
#5  0x00000000004aedf3 in heapgetpage (scan=0x28d7080, page=0) at 
heapam.c:399
#6  0x00000000004b0182 in heapgettup_pagemode (scan=0x28d7080,    dir=ForwardScanDirection, nkeys=0, key=0x0) at
heapam.c:747
#7  0x00000000004b1ba6 in heap_getnext (scan=0x28d7080,    direction=ForwardScanDirection) at heapam.c:1475
#8  0x00007f787002dbfb in lookupSlonyInfo (tableOid=91754, ctx=0x2826118,    origin_id=0x7fffcf42ab8c,
table_id=0x7fffcf42ab88,
 
set_id=0x7fffcf42ab84)    at slony_logical.c:663
#9  0x00007f787002b7a3 in pg_decode_change (ctx=0x2826118, txn=0x28cbec0,    relation=0x7f787a3446a8,
change=0x7f786ffe3268)at slony_logical.c:237
 
#10 0x00000000007497d4 in change_cb_wrapper (cache=0x28cbda8, 
txn=0x28cbec0,    relation=0x7f787a3446a8, change=0x7f786ffe3268) at logical.c:704



Here is what the code in lookupSlonyInfo is doing
------------------
  sltable_oid = get_relname_relid("sl_table",slony_namespace);
  sltable_rel = relation_open(sltable_oid,AccessShareLock);  tupdesc=RelationGetDescr(sltable_rel);
scandesc=heap_beginscan(sltable_rel,
 
GetCatalogSnapshot(sltable_oid),0,NULL);  reloid_attnum = get_attnum(sltable_oid,"tab_reloid");
  if(reloid_attnum == InvalidAttrNumber)  elog(ERROR,"sl_table does not have a tab_reloid column");  set_attnum =
get_attnum(sltable_oid,"tab_set");
  if(set_attnum == InvalidAttrNumber)  elog(ERROR,"sl_table does not have a tab_set column");  tableid_attnum =
get_attnum(sltable_oid,"tab_id");
 
  if(tableid_attnum == InvalidAttrNumber)  elog(ERROR,"sl_table does not have a tab_id column");
  while( (tuple = heap_getnext(scandesc,ForwardScanDirection) ))






Re: logical decoding - reading a user catalog table

From
Andres Freund
Date:
On 2014-10-25 18:18:07 -0400, Steve Singer wrote:
> My logical decoding plugin is occasionally getting this  error
> 
> "could not resolve cmin/cmax of catalog tuple"
> 
> I get this when my output plugin is trying to read one of the user defined
> catalog tables (user_catalog_table=true)

Hm. That should obviously not happen.

Could you describe how that table is modified? Does that bug happen
initially, or only after a while?

Do you have a testcase that would allow me to easily reproduce the
problem?

> I am not sure if this is a bug in the time-travel support in the logical
> decoding support of if I'm just using it wrong (ie not getting a sufficient
> lock on the relation or something).

I don't know yet...

> This is the interesting part of the stack trace
> 
> #4  0x000000000091bbc8 in HeapTupleSatisfiesHistoricMVCC
> (htup=0x7fffcf42a900,
>     snapshot=0x7f786ffe92d8, buffer=10568) at tqual.c:1631
> #5  0x00000000004aedf3 in heapgetpage (scan=0x28d7080, page=0) at
> heapam.c:399
> #6  0x00000000004b0182 in heapgettup_pagemode (scan=0x28d7080,
>     dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:747
> #7  0x00000000004b1ba6 in heap_getnext (scan=0x28d7080,
>     direction=ForwardScanDirection) at heapam.c:1475
> #8  0x00007f787002dbfb in lookupSlonyInfo (tableOid=91754, ctx=0x2826118,
>     origin_id=0x7fffcf42ab8c, table_id=0x7fffcf42ab88,
> set_id=0x7fffcf42ab84)
>     at slony_logical.c:663
> #9  0x00007f787002b7a3 in pg_decode_change (ctx=0x2826118, txn=0x28cbec0,
>     relation=0x7f787a3446a8, change=0x7f786ffe3268) at slony_logical.c:237
> #10 0x00000000007497d4 in change_cb_wrapper (cache=0x28cbda8, txn=0x28cbec0,
>     relation=0x7f787a3446a8, change=0x7f786ffe3268) at logical.c:704
> 
> 
> 
> Here is what the code in lookupSlonyInfo is doing
> ------------------
> 
>   sltable_oid = get_relname_relid("sl_table",slony_namespace);
> 
>   sltable_rel = relation_open(sltable_oid,AccessShareLock);
>   tupdesc=RelationGetDescr(sltable_rel);
>   scandesc=heap_beginscan(sltable_rel,
> GetCatalogSnapshot(sltable_oid),0,NULL);
>   reloid_attnum = get_attnum(sltable_oid,"tab_reloid");
> 
>   if(reloid_attnum == InvalidAttrNumber)
>       elog(ERROR,"sl_table does not have a tab_reloid column");
>   set_attnum = get_attnum(sltable_oid,"tab_set");
> 
>   if(set_attnum == InvalidAttrNumber)
>       elog(ERROR,"sl_table does not have a tab_set column");
>   tableid_attnum = get_attnum(sltable_oid, "tab_id");
> 
>   if(tableid_attnum == InvalidAttrNumber)
>       elog(ERROR,"sl_table does not have a tab_id column");
> 
>   while( (tuple = heap_getnext(scandesc,ForwardScanDirection) ))

(Except missing spaces ;)) I don't see anything obviously wrong with
this.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - reading a user catalog table

From
Steve Singer
Date:
On 10/28/2014 01:31 PM, Andres Freund wrote:
> On 2014-10-25 18:18:07 -0400, Steve Singer wrote:
>> My logical decoding plugin is occasionally getting this  error
>>
>> "could not resolve cmin/cmax of catalog tuple"
>>
>> I get this when my output plugin is trying to read one of the user defined
>> catalog tables (user_catalog_table=true)
> Hm. That should obviously not happen.
>
> Could you describe how that table is modified? Does that bug happen
> initially, or only after a while?

It doesn't happen right away, in this case it was maybe 4 minutes after 
creating the slot.
The error also doesn't always happen when I run the this test workload 
but it is reproducible with some trying.
I' don't do anything special to that table, it gets created then I do 
inserts on it.  I don't do an alter table or anything fancy like that.  
I was running the slony failover test (all nodes under the same 
postmaster) which involves the occasional dropping and recreating of 
databases along with normal query load + replication.

I'll send you tar of the data directory off list with things in this state.

> Do you have a testcase that would allow me to easily reproduce the
> problem?

I don't have a isolated test case that does this.  The test that I'm 
hitting this with does lots of stuff and doesn't even always hit this.

>> I am not sure if this is a bug in the time-travel support in the logical
>> decoding support of if I'm just using it wrong (ie not getting a sufficient
>> lock on the relation or something).
> I don't know yet...
>
>> This is the interesting part of the stack trace
>>
>> #4  0x000000000091bbc8 in HeapTupleSatisfiesHistoricMVCC
>> (htup=0x7fffcf42a900,
>>      snapshot=0x7f786ffe92d8, buffer=10568) at tqual.c:1631
>> #5  0x00000000004aedf3 in heapgetpage (scan=0x28d7080, page=0) at
>> heapam.c:399
>> #6  0x00000000004b0182 in heapgettup_pagemode (scan=0x28d7080,
>>      dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:747
>> #7  0x00000000004b1ba6 in heap_getnext (scan=0x28d7080,
>>      direction=ForwardScanDirection) at heapam.c:1475
>> #8  0x00007f787002dbfb in lookupSlonyInfo (tableOid=91754, ctx=0x2826118,
>>      origin_id=0x7fffcf42ab8c, table_id=0x7fffcf42ab88,
>> set_id=0x7fffcf42ab84)
>>      at slony_logical.c:663
>> #9  0x00007f787002b7a3 in pg_decode_change (ctx=0x2826118, txn=0x28cbec0,
>>      relation=0x7f787a3446a8, change=0x7f786ffe3268) at slony_logical.c:237
>> #10 0x00000000007497d4 in change_cb_wrapper (cache=0x28cbda8, txn=0x28cbec0,
>>      relation=0x7f787a3446a8, change=0x7f786ffe3268) at logical.c:704
>>
>>
>>
>> Here is what the code in lookupSlonyInfo is doing
>> ------------------
>>
>>    sltable_oid = get_relname_relid("sl_table",slony_namespace);
>>
>>    sltable_rel = relation_open(sltable_oid,AccessShareLock);
>>    tupdesc=RelationGetDescr(sltable_rel);
>>    scandesc=heap_beginscan(sltable_rel,
>> GetCatalogSnapshot(sltable_oid),0,NULL);
>>    reloid_attnum = get_attnum(sltable_oid,"tab_reloid");
>>
>>    if(reloid_attnum == InvalidAttrNumber)
>>       elog(ERROR,"sl_table does not have a tab_reloid column");
>>    set_attnum = get_attnum(sltable_oid,"tab_set");
>>
>>    if(set_attnum == InvalidAttrNumber)
>>       elog(ERROR,"sl_table does not have a tab_set column");
>>    tableid_attnum = get_attnum(sltable_oid, "tab_id");
>>
>>    if(tableid_attnum == InvalidAttrNumber)
>>       elog(ERROR,"sl_table does not have a tab_id column");
>>
>>    while( (tuple = heap_getnext(scandesc,ForwardScanDirection) ))
> (Except missing spaces ;)) I don't see anything obviously wrong with
> this.
>
> Greetings,
>
> Andres Freund
>




Re: logical decoding - reading a user catalog table

From
Andres Freund
Date:
Hi Steve,

On 2014-10-28 22:38:29 -0400, Steve Singer wrote:
> On 10/28/2014 01:31 PM, Andres Freund wrote:
> >On 2014-10-25 18:18:07 -0400, Steve Singer wrote:
> >>My logical decoding plugin is occasionally getting this  error
> >>
> >>"could not resolve cmin/cmax of catalog tuple"
> >>
> >>I get this when my output plugin is trying to read one of the user defined
> >>catalog tables (user_catalog_table=true)
> >Hm. That should obviously not happen.

I've pushed a fix for a bug that could possibly also cause
this. Although it'd be odd that it always hits the user catalog
table. Except if your tests mostly modify the slony tables, but do not
do much DDL otherwise?

> I'll send you tar of the data directory off list with things in this state.
> 
> >Do you have a testcase that would allow me to easily reproduce the
> >problem?
> 
> I don't have a isolated test case that does this.  The test that I'm hitting
> this with does lots of stuff and doesn't even always hit this.

If it still happens, could you send me instructions of how to reproduce
the problem after cloning the necessary source repositories? It's quite
hard to validate a possible fix otherwise.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - reading a user catalog table

From
Steve Singer
Date:
On 11/13/2014 02:44 PM, Andres Freund wrote:
> H

> I've pushed a fix for a bug that could possibly also cause
> this. Although it'd be odd that it always hits the user catalog
> table. Except if your tests mostly modify the slony tables, but do not
> do much DDL otherwise?

The test I was running doesn't do DDL, so only the user catalog tables 
would have changes being tracked.

I still sometimes get the error. When I get sometime on the weekend I'll 
work on some detailed  instructions on how to grab and setup the various 
components to replicate this test.

Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure 
in a normal client connection, not the walsender

#3  0x00000000006b4978 in GetSerializableTransactionSnapshotInt (    snapshot=snapshot@entry=0xbfa8a0
<CurrentSnapshotData>,   sourcexid=sourcexid@entry=0) at predicate.c:1738
 
#4  0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>)    at predicate.c:1517
#5  GetSerializableTransactionSnapshot (    snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598
#6  0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200
#7  0x00000000006c0e35 in exec_simple_query (    query_string=0x1fd01b8 "select ev_origin, ev_seqno, 
ev_timestamp,        ev_snapshot, 
\"pg_catalog\".txid_snapshot_xmin(ev_snapshot), 
\"pg_catalog\".txid_snapshot_xmax(ev_snapshot), 
coalesce(ev_provider_xid,\""...)    at postgres.c:959
#8  PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50,


I have no idea if this has anything to do with your recent changes or 
some other change. I haven't so far been able to replicate that since 
the first time I saw it.



>> I'll send you tar of the data directory off list with things in this state.
>>
>>> Do you have a testcase that would allow me to easily reproduce the
>>> problem?
>> I don't have a isolated test case that does this.  The test that I'm hitting
>> this with does lots of stuff and doesn't even always hit this.
> If it still happens, could you send me instructions of how to reproduce
> the problem after cloning the necessary source repositories? It's quite
> hard to validate a possible fix otherwise.
>
> Greetings,
>
> Andres Freund
>




Re: logical decoding - reading a user catalog table

From
Steve Singer
Date:
On 11/13/2014 02:44 PM, Andres Freund wrote:
> Hi Steve,

> If it still happens, could you send me instructions of how to reproduce
> the problem after cloning the necessary source repositories? It's quite
> hard to validate a possible fix otherwise.

1. Install PG 9.4

2. Perform an initdb   max_connections = 200   wal_level=logical   max_walsenders=50   wal_keep_segments = 100
wal_sender_timeout= 600s   max_replication_slots = 120
 



Create a user slon with superuser
create a user test
(set passwords for them you'll use them later)

Configure pg_hba.conf to allow slon to connect as a replication user


3. Download slony from github (https://github.com/ssinger/slony1-engine.git)   checkout  the branch
logical_remote_helper   ./configure --with-pgconfigdir=/path_to_your_pgcbindir
 
make
make install

4. Download clustertest & compile clustertest from 
(https://github.com/clustertest/clustertest-framework).
5. In the slony source tree cd to the clustertest directory
6. cp conf/disorder.properties.sample to conf/disorder.properties
Edit the file to have the correct paths, ports, users, passwords.

7 cp conf/java.properties.sample to conf/java.properties edit the file 
to point at the JAR you downloaded earlier.

I run with all 5 databases on the same PG cluster. Performance will be 
much better with 5 different clusters, but I recommend trying to emulate 
my configuration as much as possible to replicate this

To run the tests then do
./run_all_disorder_tests.sh


At the moment (commit df5acfd1a3) is configured to just run the Failover 
node test cases where I am seeing this not the entire suite.

It typically takes between 30 minutes to an hour to run though.

I installed things following the above steps on a different system than 
my usual development laptop and I have been unable to reproduce the 
error so for (on that system).  But I am still able to reproduce it on 
occasion on my normal development laptop.



> Greetings,
>
> Andres Freund
>




Re: logical decoding - reading a user catalog table

From
Steve Singer
Date:
On 11/16/2014 04:49 PM, Steve Singer wrote:
>
>
> I installed things following the above steps on a different system 
> than my usual development laptop and I have been unable to reproduce 
> the error so for (on that system).  But I am still able to reproduce 
> it on occasion on my normal development laptop.
>
>
>

After continuously running the test I was eventually able to reproduce 
the error on the other system as well.


>> Greetings,
>>
>> Andres Freund
>>
>
>
>




Re: logical decoding - reading a user catalog table

From
Andres Freund
Date:
On 2014-11-17 10:33:52 -0500, Steve Singer wrote:
> On 11/16/2014 04:49 PM, Steve Singer wrote:
> >
> >
> >I installed things following the above steps on a different system than my
> >usual development laptop and I have been unable to reproduce the error so
> >for (on that system).  But I am still able to reproduce it on occasion on
> >my normal development laptop.

Thanks for the testcase! I'll try to reproduce/fix.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - reading a user catalog table

From
Andres Freund
Date:
On 2014-11-13 22:23:02 -0500, Steve Singer wrote:
> On 11/13/2014 02:44 PM, Andres Freund wrote:
> >H
> 
> >I've pushed a fix for a bug that could possibly also cause
> >this. Although it'd be odd that it always hits the user catalog
> >table. Except if your tests mostly modify the slony tables, but do not
> >do much DDL otherwise?
> 
> The test I was running doesn't do DDL, so only the user catalog tables would
> have changes being tracked.
> 
> I still sometimes get the error. When I get sometime on the weekend I'll
> work on some detailed  instructions on how to grab and setup the various
> components to replicate this test.
> 
> Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in
> a normal client connection, not the walsender
> 
> #3  0x00000000006b4978 in GetSerializableTransactionSnapshotInt (
>     snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>,
>     sourcexid=sourcexid@entry=0) at predicate.c:1738
> #4  0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>)
>     at predicate.c:1517
> #5  GetSerializableTransactionSnapshot (
>     snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598
> #6  0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200
> #7  0x00000000006c0e35 in exec_simple_query (
>     query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp,
> ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot),
> \"pg_catalog\".txid_snapshot_xmax(ev_snapshot),
> coalesce(ev_provider_xid,\""...)
>     at postgres.c:959
> #8  PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50,
> 
> 
> I have no idea if this has anything to do with your recent changes or some
> other change. I haven't so far been able to replicate that since the first
> time I saw it.

That crash is decidedly odd. Any chance you still have the full
backtrace around?

This is in the SSI code... I'm not immediately seeing how this could be
related to logical decoding. It can't even be a imported snapshot,
because the exported snapshot is marked repeatable read.

Are you actually using serializable transactions? If so, how and why?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - reading a user catalog table

From
Steve Singer
Date:
On 11/17/2014 10:37 AM, Andres Freund wrote:
> On 2014-11-13 22:23:02 -0500, Steve Singer wrote:
>
>
> Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in
> a normal client connection, not the walsender
>
> #3  0x00000000006b4978 in GetSerializableTransactionSnapshotInt (
>      snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>,
>      sourcexid=sourcexid@entry=0) at predicate.c:1738
> #4  0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>)
>      at predicate.c:1517
> #5  GetSerializableTransactionSnapshot (
>      snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598
> #6  0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200
> #7  0x00000000006c0e35 in exec_simple_query (
>      query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp,
> ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot),
> \"pg_catalog\".txid_snapshot_xmax(ev_snapshot),
> coalesce(ev_provider_xid,\""...)
>      at postgres.c:959
> #8  PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50,
>
>
> I have no idea if this has anything to do with your recent changes or some
> other change. I haven't so far been able to replicate that since the first
> time I saw it.
> That crash is decidedly odd. Any chance you still have the full
> backtrace around?

Yes I still have the core file

>
> This is in the SSI code... I'm not immediately seeing how this could be
> related to logical decoding. It can't even be a imported snapshot,
> because the exported snapshot is marked repeatable read.
>
> Are you actually using serializable transactions? If so, how and why?

Yes, the test client that performs the 'simulated workload' does some 
serializable transactions.  It connects as a normal client via JDBC and 
does a prepareStatement("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE") 
and then does some DML. Why? because it seemed like a good thing to 
include in the test suite.

Your right this might have nothing to do with logical decoding.   I 
haven't been able to reproduce again either, I can't even say if this problem was introduced to 9.4 in the past month
orif it has 
 
been around much longer and I just haven't hit it before.




>
> Greetings,
>
> Andres Freund
>




Re: logical decoding - reading a user catalog table

From
Andres Freund
Date:
Hi,

Kevin: CCed you, because it doesn't really look like a logical decoding
related issue.

On 2014-11-17 11:25:40 -0500, Steve Singer wrote:
> On 11/17/2014 10:37 AM, Andres Freund wrote:
> >On 2014-11-13 22:23:02 -0500, Steve Singer wrote:
> >
> >
> >Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in
> >a normal client connection, not the walsender
> >
> >#3  0x00000000006b4978 in GetSerializableTransactionSnapshotInt (
> >     snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>,
> >     sourcexid=sourcexid@entry=0) at predicate.c:1738
> >#4  0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>)
> >     at predicate.c:1517
> >#5  GetSerializableTransactionSnapshot (
> >     snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598
> >#6  0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200
> >#7  0x00000000006c0e35 in exec_simple_query (
> >     query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp,
> >ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot),
> >\"pg_catalog\".txid_snapshot_xmax(ev_snapshot),
> >coalesce(ev_provider_xid,\""...)
> >     at postgres.c:959
> >#8  PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50,
> >
> >
> >I have no idea if this has anything to do with your recent changes or some
> >other change. I haven't so far been able to replicate that since the first
> >time I saw it.
> >That crash is decidedly odd. Any chance you still have the full
> >backtrace around?
> 
> Yes I still have the core file

Cool, could you show the full thing? Or did you just snip it because
it's just the Assert/ExceptionalCondition thing?

Could you print *snapshot in frame #3?

> >
> >This is in the SSI code... I'm not immediately seeing how this could be
> >related to logical decoding. It can't even be a imported snapshot,
> >because the exported snapshot is marked repeatable read.
> >
> >Are you actually using serializable transactions? If so, how and why?
> 
> Yes, the test client that performs the 'simulated workload' does some
> serializable transactions.  It connects as a normal client via JDBC and does
> a prepareStatement("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE") and then
> does some DML. Why? because it seemed like a good thing to include in the
> test suite.

Yes, it's a good reason as the above backtrace proves ;). I'm just
trying to understand uner which circumstances it happens.

The above backtrace looks like it can only be triggered if you do a
BEGIN TRANSACTION SERIALIZABLE DEFERRABLE READ ONLY; Is that something
you do?

> Your right this might have nothing to do with logical decoding.   I haven't
> been able to reproduce again either, I can't even say
>  if this problem was introduced to 9.4 in the past month or if it has been
> around much longer and I just haven't hit it before.

It's not hard to imagine that the safe/deferred snapshot code isn't all
that well exercised. I don't think many people use it yet.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: logical decoding - reading a user catalog table

From
Steve Singer
Date:
On 11/17/2014 11:34 AM, Andres Freund wrote:
> Hi,
>
> Kevin: CCed you, because it doesn't really look like a logical decoding
> related issue.
>
> On 2014-11-17 11:25:40 -0500, Steve Singer wrote:
>> On 11/17/2014 10:37 AM, Andres Freund wrote:
>>> On 2014-11-13 22:23:02 -0500, Steve Singer wrote:
>>>
>>>
>>> Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in
>>> a normal client connection, not the walsender
>>>
>>> #3  0x00000000006b4978 in GetSerializableTransactionSnapshotInt (
>>>      snapshot=snapshot@entry=0xbfa8a0 <CurrentSnapshotData>,
>>>      sourcexid=sourcexid@entry=0) at predicate.c:1738
>>> #4  0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>)
>>>      at predicate.c:1517
>>> #5  GetSerializableTransactionSnapshot (
>>>      snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598
>>> #6  0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200
>>> #7  0x00000000006c0e35 in exec_simple_query (
>>>      query_string=0x1fd01b8 "select ev_origin, ev_seqno, ev_timestamp,
>>> ev_snapshot, \"pg_catalog\".txid_snapshot_xmin(ev_snapshot),
>>> \"pg_catalog\".txid_snapshot_xmax(ev_snapshot),
>>> coalesce(ev_provider_xid,\""...)
>>>      at postgres.c:959
>>> #8  PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50,
>>>
>>>
>>> I have no idea if this has anything to do with your recent changes or some
>>> other change. I haven't so far been able to replicate that since the first
>>> time I saw it.
>>> That crash is decidedly odd. Any chance you still have the full
>>> backtrace around?
>> Yes I still have the core file
> Cool, could you show the full thing? Or did you just snip it because
> it's just the Assert/ExceptionalCondition thing?

I just snipped the exception stuff. Here is the full thing

(gdb) backtrace
#0  0x00007f6fb22e8295 in __GI_raise (sig=sig@entry=6)    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f6fb22eb438 in __GI_abort () at abort.c:90
#2  0x00000000007a08e7 in ExceptionalCondition (    conditionName=conditionName@entry=0x918e88 
"!(TransactionIdFollows(snapshot->xmin, PredXact->SxactGlobalXmin))",    errorType=errorType@entry=0x7da390
"FailedAssertion",   fileName=fileName@entry=0x9182c3 "predicate.c",    lineNumber=lineNumber@entry=1738) at
assert.c:54
#3  0x00000000006b4978 in GetSerializableTransactionSnapshotInt (    snapshot=snapshot@entry=0xbfa8a0
<CurrentSnapshotData>,   sourcexid=sourcexid@entry=0) at predicate.c:1738
 
#4  0x00000000006b66c3 in GetSafeSnapshot (origSnapshot=<optimized out>)    at predicate.c:1517
#5  GetSerializableTransactionSnapshot (    snapshot=0xbfa8a0 <CurrentSnapshotData>) at predicate.c:1598
#6  0x00000000007d16dd in GetTransactionSnapshot () at snapmgr.c:200
#7  0x00000000006c0e35 in exec_simple_query (    query_string=0x1fd01b8 "select ev_origin, ev_seqno, 
ev_timestamp,        ev_snapshot, 
\"pg_catalog\".txid_snapshot_xmin(ev_snapshot), 
\"pg_catalog\".txid_snapshot_xmax(ev_snapshot), 
coalesce(ev_provider_xid,\""...)    at postgres.c:959
#8  PostgresMain (argc=<optimized out>, argv=argv@entry=0x1f5ab50,
---Type <return> to continue, or q <return> to quit---    dbname=0x1f5ab30 "test2", username=<optimized out>) at
postgres.c:4016
#9  0x000000000046a08e in BackendRun (port=0x1f83010) at postmaster.c:4123
#10 BackendStartup (port=0x1f83010) at postmaster.c:3797
#11 ServerLoop () at postmaster.c:1576
#12 0x0000000000665eae in PostmasterMain (argc=argc@entry=3,    argv=argv@entry=0x1f59d00) at postmaster.c:1223
#13 0x000000000046ab58 in main (argc=3, argv=0x1f59d00) at main.c:227


>
> Could you print *snapshot in frame #3?

(gdb) p *snapshot
$1 = {satisfies = 0x7d0330 <HeapTupleSatisfiesMVCC>, xmin = 412635,  xmax = 412637, xip = 0x1f86e40, xcnt = 1, subxcnt
=0, subxip = 
 
0x1fd2190,  suboverflowed = 0 '\000', takenDuringRecovery = 0 '\000', copied = 0 
'\000',  curcid = 0, active_count = 0, regd_count = 0}
(gdb)


>>> This is in the SSI code... I'm not immediately seeing how this could be
>>> related to logical decoding. It can't even be a imported snapshot,
>>> because the exported snapshot is marked repeatable read.
>>>
>>> Are you actually using serializable transactions? If so, how and why?
>> Yes, the test client that performs the 'simulated workload' does some
>> serializable transactions.  It connects as a normal client via JDBC and does
>> a prepareStatement("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE") and then
>> does some DML. Why? because it seemed like a good thing to include in the
>> test suite.
> Yes, it's a good reason as the above backtrace proves ;). I'm just
> trying to understand uner which circumstances it happens.
>
> The above backtrace looks like it can only be triggered if you do a
> BEGIN TRANSACTION SERIALIZABLE DEFERRABLE READ ONLY; Is that something
> you do?

The slony remote listener connection does this when it selects from 
sl_event.  We switched to this shortly after 9.1 came out.