On Mon, May 8, 2023 at 7:55 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote:
> > Bug-in-PostgreSQL explanations could include that we forgot it was
> > dirty, or some backend wrote it out to the wrong file; but if we were
> > forgetting something like permanent or dirty, would there be a more
> > systematic failure? Oh, it could require special rare timing if it is
> > similar to 8a8661828's confusion about permanence level or otherwise
> > somehow not setting BM_PERMANENT, but in the target blocks, so I think
> > that'd require a checkpoint AND a crash. It doesn't reproduce for me,
> > but perhaps more unlucky ingredients are needed.
> >
> > Bug-in-OS/FS explanations could include that a whole lot of writes
> > were mysteriously lost in some time window, so all those files still
> > contain the zeroes we write first in smgrextend(). I guess this
> > previously rare (previously limited to hash indexes?) use of sparse
> > file hole-punching could be a factor in an it's-all-ZFS's-fault
> > explanation:
>
> Yes, you would need a bit of all that.
>
> I can reproduce the same backtrace here. That's just my usual laptop
> with ext4, so this would be a Postgres bug. First, here are the four
> things running in parallel so as I can get a failure in loading a
> critical index when connecting:
> 1) Create and drop a database with WAL_LOG as strategy and the
> regression database as template:
> while true; do
> createdb --template=regression --strategy=wal_log testdb;
> dropdb testdb;
> done
> 2) Feeding more data to pg_class in the middle, while testing the
> connection to the database created:
> while true;
> do psql -c 'create table popo as select 1 as a;' regression > /dev/null 2>&1 ;
> psql testdb -c "select 1" > /dev/null 2>&1 ;
> psql -c 'drop table popo' regression > /dev/null 2>&1 ;
> psql testdb -c "select 1" > /dev/null 2>&1 ;
> done;
> 3) Force some checkpoints:
> while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done
> 4) Force a few crashes and recoveries:
> while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done
>
I am able to reproduce this using the steps given above, I am also
trying to analyze this further. I will send the update once I get
some clue.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com