Thread: logical replication - negative bitmapset member not allowed
I'm getting this message every 5 seconds on a single-master, single-slave replication of PG10.7->PG10.7 both on Centos. Its over the 'net but otherwise seems to perform excellently. Any ideas what's causing it and how to fix? -- Tim Clarke IT Director Direct: +44 (0)1376 504510 | Mobile: +44 (0)7887 563420 Main: +44 (0)1376 503500 | Fax: +44 (0)1376 503550 Web: https://www.manifest.co.uk/ Minerva Analytics Ltd 9 Freebournes Court | Newland Street | Witham | Essex | CM8 2BL | England ---------------------------------------------------------------------------------------------------------------------------- Copyright: This e-mail may contain confidential or legally privileged information. If you are not the named addressee youmust not use or disclose such information, instead please report it to admin@minerva.info<mailto:admin@minerva.info> Legal: Minerva Analytics is the trading name of: Minerva Analytics Ltd: Registered in England Number 11260966 & The Manifest Voting Agency Ltd: Registered in England Number 2920820 RegisteredOffice at above address. Please Click Here >><https://www.manifest.co.uk/legal/> for further information.
Tim Clarke <tim.clarke@minerva.info> writes: > I'm getting this message every 5 seconds on a single-master, > single-slave replication of PG10.7->PG10.7 both on Centos. Its over the > 'net but otherwise seems to perform excellently. Any ideas what's > causing it and how to fix? That'd certainly be a bug, but we'd need to reproduce it to fix it. What are you doing that's different from everybody else? Can you provide any other info to narrow down the problem? regards, tom lane
Dang. I just replicated ~380 tables. One was missing an index so I paused replication, added a unique key on publisher and subscriber, re-enabled replication and refreshed the subscription. The table has only 7 columns, I added a primary key with a default value from a new sequence. Tim Clarke IT Director Direct: +44 (0)1376 504510 | Mobile: +44 (0)7887 563420 On 01/04/2019 15:02, Tom Lane wrote: > Tim Clarke <tim.clarke@minerva.info> writes: >> I'm getting this message every 5 seconds on a single-master, >> single-slave replication of PG10.7->PG10.7 both on Centos. Its over the >> 'net but otherwise seems to perform excellently. Any ideas what's >> causing it and how to fix? > That'd certainly be a bug, but we'd need to reproduce it to fix it. > What are you doing that's different from everybody else? Can you > provide any other info to narrow down the problem? > > regards, tom lane Main: +44 (0)1376 503500 | Fax: +44 (0)1376 503550 Web: https://www.manifest.co.uk/ Minerva Analytics Ltd 9 Freebournes Court | Newland Street | Witham | Essex | CM8 2BL | England ---------------------------------------------------------------------------------------------------------------------------- Copyright: This e-mail may contain confidential or legally privileged information. If you are not the named addressee youmust not use or disclose such information, instead please report it to admin@minerva.info<mailto:admin@minerva.info> Legal: Minerva Analytics is the trading name of: Minerva Analytics Ltd: Registered in England Number 11260966 & The Manifest Voting Agency Ltd: Registered in England Number 2920820 RegisteredOffice at above address. Please Click Here >><https://www.manifest.co.uk/legal/> for further information.
On 2019-Apr-01, Tom Lane wrote: > Tim Clarke <tim.clarke@minerva.info> writes: > > I'm getting this message every 5 seconds on a single-master, > > single-slave replication of PG10.7->PG10.7 both on Centos. Its over the > > 'net but otherwise seems to perform excellently. Any ideas what's > > causing it and how to fix? > > That'd certainly be a bug, but we'd need to reproduce it to fix it. > What are you doing that's different from everybody else? Can you > provide any other info to narrow down the problem? Maybe the replica identity of a table got set to a unique index on oid? Or something else involving system columns? (If replication is otherwise working, the I suppose there's a separate publication that's having the error; the first thing to isolate would be to see what tables are involved in that publication). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tim Clarke <tim.clarke@minerva.info> writes: > Dang. I just replicated ~380 tables. One was missing an index so I > paused replication, added a unique key on publisher and subscriber, > re-enabled replication and refreshed the subscription. Well, that's not much help :-(. Can you provide any info to narrow down where this is happening? I mean, you haven't even told us whether it's the primary or the slave that is complaining. Does it seem to be associated with any particular command? (Turning on log_statement and/or log_replication_commands would likely help with that.) Does data seem to be getting transferred despite the complaint? If not, what's missing on the slave? regards, tom lane
On 02/04/2019 14:59, Tom Lane wrote: > Well, that's not much help :-(. Can you provide any info to narrow > down where this is happening? I mean, you haven't even told us whether > it's the primary or the slave that is complaining. Does it seem to > be associated with any particular command? (Turning on log_statement > and/or log_replication_commands would likely help with that.) Does > data seem to be getting transferred despite the complaint? If not, > what's missing on the slave? > > regards, tom lane I've been working to narrow it, the error is being reported on the slave. The only schema changes have been the two primary keys added to two tables. The problem occurred during this cycle: 1) Replication proceeding fine for ~380 tables, all added individually not "all tables". 2) Add primary key on master. 3) Add primary key on slave. 4) Refresh subscription on slave; error starts being reported. I've cleared it by dropping the slave database, re-creating from the live schema then fully replicating. Its all running happily now. Tim Clarke Main: +44 (0)1376 503500 | Fax: +44 (0)1376 503550 Web: https://www.manifest.co.uk/ Minerva Analytics Ltd 9 Freebournes Court | Newland Street | Witham | Essex | CM8 2BL | England ---------------------------------------------------------------------------------------------------------------------------- Copyright: This e-mail may contain confidential or legally privileged information. If you are not the named addressee youmust not use or disclose such information, instead please report it to admin@minerva.info<mailto:admin@minerva.info> Legal: Minerva Analytics is the trading name of: Minerva Analytics Ltd: Registered in England Number 11260966 & The Manifest Voting Agency Ltd: Registered in England Number 2920820 RegisteredOffice at above address. Please Click Here >><https://www.manifest.co.uk/legal/> for further information.
Tim Clarke <tim.clarke@minerva.info> writes: > I've cleared it by dropping the slave database, re-creating from the > live schema then fully replicating. Its all running happily now. I'm glad you're out of the woods, but we still have a bug there waiting to bite the next person. I wonder if you'd be willing to spend some time trying to develop a reproduction sequence for this (obviously, working on a test setup not your live servers). Presumably there's something in the subscription-alteration logic that needs work, but I don't think we have enough detail here for somebody else to reproduce the error without a lot of guesswork. regards, tom lane
On 02/04/2019 15:46, Tom Lane wrote: > I'm glad you're out of the woods, but we still have a bug there > waiting to bite the next person. I wonder if you'd be willing to > spend some time trying to develop a reproduction sequence for this > (obviously, working on a test setup not your live servers). > Presumably there's something in the subscription-alteration logic > that needs work, but I don't think we have enough detail here for > somebody else to reproduce the error without a lot of guesswork. > > regards, tom lane I'll do what I can :) Tim Clarke Main: +44 (0)1376 503500 | Fax: +44 (0)1376 503550 Web: https://www.manifest.co.uk/ Minerva Analytics Ltd 9 Freebournes Court | Newland Street | Witham | Essex | CM8 2BL | England ---------------------------------------------------------------------------------------------------------------------------- Copyright: This e-mail may contain confidential or legally privileged information. If you are not the named addressee youmust not use or disclose such information, instead please report it to admin@minerva.info<mailto:admin@minerva.info> Legal: Minerva Analytics is the trading name of: Minerva Analytics Ltd: Registered in England Number 11260966 & The Manifest Voting Agency Ltd: Registered in England Number 2920820 RegisteredOffice at above address. Please Click Here >><https://www.manifest.co.uk/legal/> for further information.
On 2019-04-01 23:43, Alvaro Herrera wrote: > Maybe the replica identity of a table got set to a unique index on oid? > Or something else involving system columns? (If replication is > otherwise working, the I suppose there's a separate publication that's > having the error; the first thing to isolate would be to see what tables > are involved in that publication). Looking through the code, the bms_add_member() call in logicalrep_read_attrs() does not use the usual FirstLowInvalidHeapAttributeNumber offset, so that seems like a possible problem. However, I can't quite reproduce this. There are various other checks that prevent this scenario, but it's plausible that with a bit of whacking around you could hit this error message. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/04/2019 22:37, Peter Intrauterine wrote: > On 2019-04-01 23:43, Alvaro Herrera wrote: >> Maybe the replica identity of a table got set to a unique index on oid? >> Or something else involving system columns? (If replication is >> otherwise working, the I suppose there's a separate publication that's >> having the error; the first thing to isolate would be to see what tables >> are involved in that publication). > Looking through the code, the bms_add_member() call in > logicalrep_read_attrs() does not use the usual > FirstLowInvalidHeapAttributeNumber offset, so that seems like a possible > problem. > > However, I can't quite reproduce this. There are various other checks > that prevent this scenario, but it's plausible that with a bit of > whacking around you could hit this error message. > Promise I've not been whacking around...... Tim Clarke Main: +44 (0)1376 503500 | Fax: +44 (0)1376 503550 Web: https://www.manifest.co.uk/ Minerva Analytics Ltd 9 Freebournes Court | Newland Street | Witham | Essex | CM8 2BL | England ---------------------------------------------------------------------------------------------------------------------------- Copyright: This e-mail may contain confidential or legally privileged information. If you are not the named addressee youmust not use or disclose such information, instead please report it to admin@minerva.info<mailto:admin@minerva.info> Legal: Minerva Analytics is the trading name of: Minerva Analytics Ltd: Registered in England Number 11260966 & The Manifest Voting Agency Ltd: Registered in England Number 2920820 RegisteredOffice at above address. Please Click Here >><https://www.manifest.co.uk/legal/> for further information.
Re: logical replication - negative bitmapset member not allowed
From
Jehan-Guillaume de Rorthais
Date:
Hello, On Thu, 4 Apr 2019 23:37:04 +0200 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2019-04-01 23:43, Alvaro Herrera wrote: > > Maybe the replica identity of a table got set to a unique index on oid? > > Or something else involving system columns? (If replication is > > otherwise working, the I suppose there's a separate publication that's > > having the error; the first thing to isolate would be to see what tables > > are involved in that publication). > > Looking through the code, the bms_add_member() call in > logicalrep_read_attrs() does not use the usual > FirstLowInvalidHeapAttributeNumber offset, so that seems like a possible > problem. > > However, I can't quite reproduce this. There are various other checks > that prevent this scenario, but it's plausible that with a bit of > whacking around you could hit this error message. Here is a script to reproduce it under version 10, 11 and 12: ################################################ # env PUB=/tmp/pub SUB=/tmp/sub unset PGPORT PGHOST PGDATABASE PGDATA export PGUSER=postgres # cleanup kill %1 pg_ctl -w -s -D "$PUB" -m immediate stop; echo $? pg_ctl -w -s -D "$SUB" -m immediate stop; echo $? rm -r "$PUB" "$SUB" # cluster initdb -U postgres -N "$PUB" &>/dev/null; echo $? initdb -U postgres -N "$SUB" &>/dev/null; echo $? echo "wal_level=logical" >> "$PUB"/postgresql.conf echo "port=5433" >> "$SUB"/postgresql.conf pg_ctl -w -s -D $PUB -l "$PUB"-"$(date +%FT%T)".log start; echo $? pg_ctl -w -s -D $SUB -l "$SUB"-"$(date +%FT%T)".log start; echo $? pgbench -p 5432 -qi pg_dump -p 5432 -s | psql -qXp 5433 # fake activity pgbench -p 5432 -T 300 -c 2 & # replication setup psql -p 5432 -Xc "CREATE PUBLICATION prov FOR ALL TABLES" psql -p 5433 -Xc "CREATE SUBSCRIPTION sub CONNECTION 'port=5432' PUBLICATION prov" # wait for the streaming unset V; while [ "$V" != "streaming" ]; do sleep 1 V=$(psql -AtXc "SELECT 'streaming' FROM pg_stat_replication WHERE state='streaming'") done # trigger the error message psql -p 5433 -Xc "ALTER SUBSCRIPTION sub DISABLE" psql -p 5433 -Xc "ALTER TABLE pgbench_history ADD id SERIAL PRIMARY KEY" psql -p 5432 -Xc "ALTER TABLE pgbench_history ADD id SERIAL PRIMARY KEY" psql -p 5433 -Xc "ALTER SUBSCRIPTION sub ENABLE" ################################################ Regards,
Re: logical replication - negative bitmapset member not allowed
From
Jehan-Guillaume de Rorthais
Date:
On Thu, 10 Oct 2019 15:15:46 +0200 Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: [...] > Here is a script to reproduce it under version 10, 11 and 12: I investigated on this bug while coming back from pgconf.eu. Bellow what I found so far. The message "negative bitmapset member not allowed" comes from logicalrep_rel_open(). Every field that are unknown, dropped or generated are mapped to remote attnum -1. See backend/replication/logical/relation.c: if (attr->attisdropped || attr->attgenerated) { entry->attrmap[i] = -1; continue; } attnum = logicalrep_rel_att_by_name(remoterel, NameStr(attr->attname)); Note that logicalrep_rel_att_by_name returns -1 on unknown fields. Later in the same function, we check if fields belonging to some PK or unique index appears in remote keys as well: while ((i = bms_next_member(idkey, i)) >= 0) { [...] if (!bms_is_member(entry->attrmap[attnum], remoterel->attkeys)) { entry->updatable = false; break; } } However, before checking if the local attribute belong to the remote keys, it should check if it actually mapped to a remote one. In other words, I suppose we should check entry->attrmap[attnum] > 0 before calling bms_is_member(). The trivial patch would be: - if (!bms_is_member(entry->attrmap[attnum], remoterel->attkeys)) + if (entry->attrmap[attnum] < 0 || + !bms_is_member(entry->attrmap[attnum], remoterel->attkeys)) { entry->updatable = false; break; } I tested with the attached scenario and it sound to work correctly. Note that while trying to fix this bug, I found a segment fault while compiling with asserts. You might want to review/test without --enable-cassert. I will report in another thread as this seems not related to this bug or fix.
Attachment
On 2019-10-25 17:38, Jehan-Guillaume de Rorthais wrote: > On Thu, 10 Oct 2019 15:15:46 +0200 > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > > [...] >> Here is a script to reproduce it under version 10, 11 and 12: > > I investigated on this bug while coming back from pgconf.eu. Bellow what I found > so far. I have simplified your reproduction steps from the previous message to a test case, and I can confirm that your proposed fix addresses the issue. A patch is attached. Maybe someone can look it over. I target next week's minor releases. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: logical replication - negative bitmapset member not allowed
From
Jehan-Guillaume de Rorthais
Date:
On Tue, 5 Nov 2019 16:02:51 +0100 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2019-10-25 17:38, Jehan-Guillaume de Rorthais wrote: > > On Thu, 10 Oct 2019 15:15:46 +0200 > > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > > > > [...] > >> Here is a script to reproduce it under version 10, 11 and 12: > > > > I investigated on this bug while coming back from pgconf.eu. Bellow what I > > found so far. > > I have simplified your reproduction steps from the previous message to a > test case, and I can confirm that your proposed fix addresses the issue. Thanks for the feedback and the test case. I wonder if ALTER SUBSCRIPTION DISABLE/ENABLE is useful in the test case? Is it something recommended during DDL on logically replicated relation? If yes, I suppose we should update the first point of the restriction chapter in documentation: https://www.postgresql.org/docs/11/logical-replication-restrictions Regards,
Hi, On 2019-11-05 16:02:51 +0100, Peter Eisentraut wrote: > $node_publisher->stop('fast'); > + > + > +# TODO: https://www.postgresql.org/message-id/flat/a9139c29-7ddd-973b-aa7f-71fed9c38d75%40minerva.info > + > +$node_publisher = get_new_node('publisher3'); > +$node_publisher->init(allows_streaming => 'logical'); > +$node_publisher->start; > + > +$node_subscriber = get_new_node('subscriber3'); > +$node_subscriber->init(allows_streaming => 'logical'); > +$node_subscriber->start; Do we really have to create a new subscriber for this test? The creation of one isn't free. Nor is the amount of test code duplication neglegible. Greetings, Andres Freund
On 2019-11-05 17:05, Jehan-Guillaume de Rorthais wrote: >> I have simplified your reproduction steps from the previous message to a >> test case, and I can confirm that your proposed fix addresses the issue. > > Thanks for the feedback and the test case. I wonder if ALTER SUBSCRIPTION > DISABLE/ENABLE is useful in the test case? Turns out it's not necessary. Attached is an updated patch that simplifies the test even further and moves it into the 008_diff_schema.pl file. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2019-11-05 17:18, Andres Freund wrote: > On 2019-11-05 16:02:51 +0100, Peter Eisentraut wrote: >> $node_publisher->stop('fast'); >> + >> + >> +# TODO: https://www.postgresql.org/message-id/flat/a9139c29-7ddd-973b-aa7f-71fed9c38d75%40minerva.info >> + >> +$node_publisher = get_new_node('publisher3'); >> +$node_publisher->init(allows_streaming => 'logical'); >> +$node_publisher->start; >> + >> +$node_subscriber = get_new_node('subscriber3'); >> +$node_subscriber->init(allows_streaming => 'logical'); >> +$node_subscriber->start; > > Do we really have to create a new subscriber for this test? The creation > of one isn't free. Nor is the amount of test code duplication > neglegible. I changed that in the v2 patch. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: logical replication - negative bitmapset member not allowed
From
Jehan-Guillaume de Rorthais
Date:
On Thu, 7 Nov 2019 16:02:21 +0100 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2019-11-05 17:05, Jehan-Guillaume de Rorthais wrote: > >> I have simplified your reproduction steps from the previous message to a > >> test case, and I can confirm that your proposed fix addresses the issue. > > > > Thanks for the feedback and the test case. I wonder if ALTER SUBSCRIPTION > > DISABLE/ENABLE is useful in the test case? > > Turns out it's not necessary. Attached is an updated patch that > simplifies the test even further and moves it into the > 008_diff_schema.pl file. OK. No further comments on my side. Thanks,
On 2019-11-07 16:18, Jehan-Guillaume de Rorthais wrote: > On Thu, 7 Nov 2019 16:02:21 +0100 > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > >> On 2019-11-05 17:05, Jehan-Guillaume de Rorthais wrote: >>>> I have simplified your reproduction steps from the previous message to a >>>> test case, and I can confirm that your proposed fix addresses the issue. >>> >>> Thanks for the feedback and the test case. I wonder if ALTER SUBSCRIPTION >>> DISABLE/ENABLE is useful in the test case? >> >> Turns out it's not necessary. Attached is an updated patch that >> simplifies the test even further and moves it into the >> 008_diff_schema.pl file. > > OK. No further comments on my side. Committed and backpatched. Thanks! -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: logical replication - negative bitmapset member not allowed
From
Jehan-Guillaume de Rorthais
Date:
On Sat, 9 Nov 2019 09:18:21 +0100 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2019-11-07 16:18, Jehan-Guillaume de Rorthais wrote: > > On Thu, 7 Nov 2019 16:02:21 +0100 > > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > > >> On 2019-11-05 17:05, Jehan-Guillaume de Rorthais wrote: > >>>> I have simplified your reproduction steps from the previous message to a > >>>> test case, and I can confirm that your proposed fix addresses the > >>>> issue. > >>> > >>> Thanks for the feedback and the test case. I wonder if ALTER SUBSCRIPTION > >>> DISABLE/ENABLE is useful in the test case? > >> > >> Turns out it's not necessary. Attached is an updated patch that > >> simplifies the test even further and moves it into the > >> 008_diff_schema.pl file. > > > > OK. No further comments on my side. > > Committed and backpatched. Thanks! I'm glad to help! Thanks,