Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 - Mailing list pgsql-hackers

From Timur Magomedov
Subject Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Date
Msg-id 36cedffdfcac437afd692442cf9c1d16d7f28b01.camel@postgrespro.ru
Whole thread Raw
In response to Re: [WIP]Vertical Clustered Index (columnar store extension) - take2  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
List pgsql-hackers
Hello Peter!
I've succeeded in making a reproducer for a infrequent bug I've seen
several times with ROS control daemon enabled.
Looks like WAL records produced by ROS control daemon while processing
"vci_rc_update_del_vec" command are not compatible with what
heap_xlog_prune_freeze() function expects to read from WAL. Those
records are produced in cleanUpWos(), specific line looks like "recptr
= XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_ON_ACCESS);".
The reproducing steps can look tricky, any ideas of improving them are
welcome. This would ideally be a TAP test. For now I just patch code so
that ROS daemon terminates rignt after "update delete vector" command,
kill all postgress processes and next time PostgreSQL is started it
catches assertion inside heap_xlog_prune_freeze() function.

This is the reproduction routine in four steps:

1. Patch VCI using vci_always_fail_update_delete_vector.patch and build
it.

2. Setup VCI in config file (ros_control_daemon enabled):
shared_preload_libraries = 'vci'
max_worker_processes = 20
vci.table_rows_threshold = 0
vci.cost_threshold = 0
vci.enable_ros_control_daemon = true

3. Run reproducer.sh script that runs pgbench on VCI-enabled table and
terminates all postgres processes immediately using "killall -s 9
postgres" after pgbench failed. "pg_ctl stop" can't terminate
PostgreSQL here. "update delete vector" command is usually executed in
less than ten minutes on my system but it needs to wait some time.

4. Here we are with some WAL records on storage that (at least on my
machine) PostgreSQL is unable to apply and fails the assertion:

$ pg_ctl start
...
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 0/017689D8
..TRAP: failed Assert("do_prune || nplans > 0 || vmflags &
VISIBILITYMAP_VALID_BITS"), File: "heapam_xlog.c", Line: 117, PID:
841207

Debugger shows data actually contains some offsets, in order, but the
format and flags combination are unexpected:
#6  0x000055fad895d1f7 in heap_xlog_prune_freeze
(record=0x55faf23f0ce0) at heapam_xlog.c:117
117            Assert(do_prune || nplans > 0 || vmflags &
VISIBILITYMAP_VALID_BITS);
(gdb) print dataptr
$1 = 0x7072b5e18248 "\001"
(gdb) print datalen
$2 = 370
(gdb) print frz_offsets
$3 = (OffsetNumber *) 0x7072b5e18248
(gdb) print *frz_offsets@185
$4 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
  37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
  72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
105,
  106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133,
  134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147,
148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161,
  162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
176, 177, 178, 179, 180, 181, 182, 183, 184, 185}
(gdb) print frz_offsets==dataptr
$5 = 1

I also attached backtrace from GDB.

I don't understand yet how to fix this and the reproducing is clunky so
any ideas are welcome.
Does this reproduce on your system too? Is it some known problem?

--
Regards,
Timur Magomedov


Attachment

pgsql-hackers by date:

Previous
From: "Vitaly Davydov"
Date:
Subject: RE: Newly created replication slot may be invalidated by checkpoint
Next
From: Álvaro Herrera
Date:
Subject: Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement