Thread: B-tree crash recovery error in 8.3 beta 2

B-tree crash recovery error in 8.3 beta 2

From
Koichi Suzuki
Date:
Hi,

I've found that B-tree crash recovery in 8.3 beta2 could make some
tuples invisible through B-tree.  They're visible if we read using but
Seq-Scan.   This happens in 8.3 beta2, but not in 8.2.4.  Here's how it
happens.

1. Create b-tree for a text type column.
2. Make B-tree three-story, that is, root-intermediate-leaf.  Insert
tuples sufficient to construct such B-tree.
3. No checkpoint should occur during 2.
4. Kill postmaster.
5. Restart postmaster.   Crash recovery will be done.
6. Tuples with column values less than HIKEY becomes invisible through
Idx-scan, still visible through Seq-scan.

>From the dump of B-tree, it seems that HIKEY value is cleared (only
tuple header is left).  No problem was found in the case of integer or
numeric type columns.

Attached is the shell script, postgresql.conf (almost the default one)
to reproduce the problem, and the log of the problem reproduction.

--
Koichi Suzuki


DROP TABLE
CREATE TABLE
CREATE INDEX
INSERT 0 27249
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    3 |     1 |        3 |         1
(1 row)

CHECKPOINT
INSERT 0 1
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |  231 |     2 |      231 |         2
(1 row)

 itemoffset |  ctid   | itemlen | nulls | vars |                                  data
 

------------+---------+---------+-------+------+-------------------------------------------------------------------------
          1 | (71,1)  |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 31 39 30 36 36 00
00
          2 | (1,1)   |       8 | f     | f    |
          3 | (229,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 31 30 36 00
00
          4 | (228,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 32 32 36 00
00
          5 | (227,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 33 34 36 00
00
          6 | (226,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 34 36 36 00
00
          7 | (225,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 35 38 36 00
00
          8 | (224,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 37 30 36 00
00
          9 | (223,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 38 32 36 00
00
(9 rows)

--  Visible tuples (name = HIKEY)
explain select * from test where name = 'PostgreSQL-0000000147';
                               QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=9.14..173.19 rows=114 width=32)
   Recheck Cond: (name = 'PostgreSQL-0000000147'::text)
   ->  Bitmap Index Scan on id_test  (cost=0.00..9.11 rows=114 width=0)
         Index Cond: (name = 'PostgreSQL-0000000147'::text)
(4 rows)

select * from test where name = 'PostgreSQL-0000000147';
         name
-----------------------
 PostgreSQL-0000000147
(1 row)

-- Still visible tuples (name = HIKEY)
explain select * from test where name = 'PostgreSQL-0000000146';
                               QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=9.14..173.19 rows=114 width=32)
   Recheck Cond: (name = 'PostgreSQL-0000000146'::text)
   ->  Bitmap Index Scan on id_test  (cost=0.00..9.11 rows=114 width=0)
         Index Cond: (name = 'PostgreSQL-0000000146'::text)
(4 rows)

select * from test where name = 'PostgreSQL-0000000146';
         name
-----------------------
 PostgreSQL-0000000146
(1 row)

-- They are vislble when no indexes are used
explain select * from test where name like '%PostgreSQL-0000000146';
                      QUERY PLAN
-------------------------------------------------------
 Seq Scan on test  (cost=0.00..458.93 rows=1 width=32)
   Filter: (name ~~ '%PostgreSQL-0000000146'::text)
(2 rows)

select * from test where name like '%PostgreSQL-0000000146';
         name
-----------------------
 PostgreSQL-0000000146
(1 row)

###########################################################
# Server crashed
###########################################################
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start... done
server started
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |  231 |     2 |      231 |         2
(1 row)

 itemoffset |  ctid   | itemlen | nulls | vars |                                  data
 

------------+---------+---------+-------+------+-------------------------------------------------------------------------
          1 | (71,1)  |       8 | f     | f    |
          2 | (1,1)   |       8 | f     | f    |
          3 | (229,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 31 30 36 00
00
          4 | (228,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 32 32 36 00
00
          5 | (227,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 33 34 36 00
00
          6 | (226,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 34 36 36 00
00
          7 | (225,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 35 38 36 00
00
          8 | (224,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 37 30 36 00
00
          9 | (223,1) |      32 | f     | t    | 2d 50 6f 73 74 67 72 65 53 51 4c 2d 30 30 30 30 30 30 30 38 32 36 00
00
(9 rows)

-- Visible tuples (name = HIKEY)
explain select * from test where name = 'PostgreSQL-0000000147';
                               QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=9.14..173.19 rows=114 width=32)
   Recheck Cond: (name = 'PostgreSQL-0000000147'::text)
   ->  Bitmap Index Scan on id_test  (cost=0.00..9.11 rows=114 width=0)
         Index Cond: (name = 'PostgreSQL-0000000147'::text)
(4 rows)

select * from test where name = 'PostgreSQL-0000000147';
 name
------
(0 rows)

-- They become invisible! (name < HIKEY)
explain select * from test where name = 'PostgreSQL-0000000146';
                               QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=9.14..173.19 rows=114 width=32)
   Recheck Cond: (name = 'PostgreSQL-0000000146'::text)
   ->  Bitmap Index Scan on id_test  (cost=0.00..9.11 rows=114 width=0)
         Index Cond: (name = 'PostgreSQL-0000000146'::text)
(4 rows)

select * from test where name = 'PostgreSQL-0000000146';
 name
------
(0 rows)

-- Without index, they're still visible.
explain select * from test where name like '%PostgreSQL-0000000146';
                      QUERY PLAN
-------------------------------------------------------
 Seq Scan on test  (cost=0.00..458.93 rows=1 width=32)
   Filter: (name ~~ '%PostgreSQL-0000000146'::text)
(2 rows)

select * from test where name like '%PostgreSQL-0000000146';
         name
-----------------------
 PostgreSQL-0000000146
(1 row)

# -----------------------------
# PostgreSQL configuration file
# -----------------------------
#
# This file consists of lines of the form:
#
#   name = value
#
# (The '=' is optional.)  White space may be used.  Comments are introduced
# with '#' anywhere on a line.  The complete list of option names and
# allowed values can be found in the PostgreSQL documentation.  The
# commented-out settings shown in this file represent the default values.
#
# Please note that re-commenting a setting is NOT sufficient to revert it
# to the default value, unless you restart the server.
#
# Any option can also be given as a command line switch to the server,
# e.g., 'postgres -c log_connections=on'.  Some options can be changed at
# run-time with the 'SET' SQL command.
#
# This file is read on server startup and when the server receives a
# SIGHUP.  If you edit the file on a running system, you have to SIGHUP the
# server for the changes to take effect, or use "pg_ctl reload". Some
# settings, which are marked below, require a server shutdown and restart
# to take effect.
#
# Memory units:  kB = kilobytes MB = megabytes GB = gigabytes
# Time units:    ms = milliseconds s = seconds min = minutes h = hours d = days


#---------------------------------------------------------------------------
# FILE LOCATIONS
#---------------------------------------------------------------------------

# The default values of these variables are driven from the -D command line
# switch or PGDATA environment variable, represented here as ConfigDir.

#data_directory = 'ConfigDir'        # use data in another directory
                    # (change requires restart)
#hba_file = 'ConfigDir/pg_hba.conf'    # host-based authentication file
                    # (change requires restart)
#ident_file = 'ConfigDir/pg_ident.conf'    # ident configuration file
                    # (change requires restart)

# If external_pid_file is not explicitly set, no extra PID file is written.
#external_pid_file = '(none)'        # write an extra PID file
                    # (change requires restart)


#---------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
#---------------------------------------------------------------------------

# - Connection Settings -

#listen_addresses = 'localhost'        # what IP address(es) to listen on;
                    # comma-separated list of addresses;
                    # defaults to 'localhost', '*' = all
                    # (change requires restart)
#port = 5432                # (change requires restart)
max_connections = 100            # (change requires restart)
# Note: increasing max_connections costs ~400 bytes of shared memory per
# connection slot, plus lock space (see max_locks_per_transaction).  You
# might also need to raise shared_buffers to support more connections.
#superuser_reserved_connections = 3    # (change requires restart)
#unix_socket_directory = ''        # (change requires restart)
#unix_socket_group = ''            # (change requires restart)
#unix_socket_permissions = 0777        # octal
                    # (change requires restart)
#bonjour_name = ''            # defaults to the computer name
                    # (change requires restart)

# - Security & Authentication -

#authentication_timeout = 1min        # 1s-600s
#ssl = off                # (change requires restart)
#ssl_ciphers = 'ALL:!ADH:!LOW:!EXP:!MD5:@STRENGTH'    # Allowed SSL ciphers
                    # (change requires restart)
#password_encryption = on
#db_user_namespace = off

# Kerberos and GSSAPI
#krb_server_keyfile = ''        # (change requires restart)
#krb_srvname = 'postgres'        # (change requires restart, kerberos only)
#krb_server_hostname = ''        # empty string matches any keytab entry
                    # (change requires restart, kerberos only)
#krb_caseins_users = off        # (change requires restart)

# - TCP Keepalives -
# see 'man 7 tcp' for details

#tcp_keepalives_idle = 0        # TCP_KEEPIDLE, in seconds;
                    # 0 selects the system default
#tcp_keepalives_interval = 0        # TCP_KEEPINTVL, in seconds;
                    # 0 selects the system default
#tcp_keepalives_count = 0        # TCP_KEEPCNT;
                    # 0 selects the system default


#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------

# - Memory -

shared_buffers = 32MB            # min 128kB or max_connections*16kB
                    # (change requires restart)
#temp_buffers = 8MB            # min 800kB
#max_prepared_transactions = 5        # can be 0 or more
                    # (change requires restart)
# Note: increasing max_prepared_transactions costs ~600 bytes of shared memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
#work_mem = 1MB                # min 64kB
#maintenance_work_mem = 16MB        # min 1MB
#max_stack_depth = 2MB            # min 100kB

# - Free Space Map -

max_fsm_pages = 204800            # min max_fsm_relations*16, 6 bytes each
                    # (change requires restart)
#max_fsm_relations = 1000        # min 100, ~70 bytes each
                    # (change requires restart)

# - Kernel Resource Usage -

#max_files_per_process = 1000        # min 25
                    # (change requires restart)
#shared_preload_libraries = ''        # (change requires restart)

# - Cost-Based Vacuum Delay -

#vacuum_cost_delay = 0            # 0-1000 milliseconds
#vacuum_cost_page_hit = 1        # 0-10000 credits
#vacuum_cost_page_miss = 10        # 0-10000 credits
#vacuum_cost_page_dirty = 20        # 0-10000 credits
#vacuum_cost_limit = 200        # 0-10000 credits

# - Background writer -

#bgwriter_delay = 200ms            # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100        # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0        # 0-10.0 multipler on buffers scanned/round


#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

#fsync = on                # turns forced synchronization on or off
#synchronous_commit = on        # immediate fsync at commit
#wal_sync_method = fsync        # the default is the first option
                    # supported by the operating system:
                    #   open_datasync
                    #   fdatasync
                    #   fsync
                    #   fsync_writethrough
                    #   open_sync
#full_page_writes = on            # recover from partial page writes
full_page_writes = off            # recover from partial page writes
#wal_buffers = 64kB            # min 32kB
                    # (change requires restart)
#wal_writer_delay = 200ms        # range 1-10000, in milliseconds

#commit_delay = 0            # range 0-100000, in microseconds
#commit_siblings = 5            # range 1-1000

# - Checkpoints -

#checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min        # range 30s-1h
#checkpoint_completion_target = 0.5    # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s        # 0 is off

# - Archiving -

#archive_mode = off        # allows archiving to be done
                # (change requires restart)
archive_mode = on
#archive_command = ''        # command to use to archive a logfile segment
#archive_command = 'cp "%p" /home/lesslog/archlog/"%f"'
archive_command = 'cp "%p" /home/lesslog/lesslog/archlog/cp/"%f" && pg_compresslog "%p"
/home/lesslog/lesslog/archlog/cmp/"%f"'
#archive_timeout = 0        # force a logfile segment switch after this
                # many seconds; 0 is off


#---------------------------------------------------------------------------
# QUERY TUNING
#---------------------------------------------------------------------------

# - Planner Method Configuration -

#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on

# - Planner Cost Constants -

#seq_page_cost = 1.0            # measured on an arbitrary scale
#random_page_cost = 4.0            # same scale as above
#cpu_tuple_cost = 0.01            # same scale as above
#cpu_index_tuple_cost = 0.005        # same scale as above
#cpu_operator_cost = 0.0025        # same scale as above
#effective_cache_size = 128MB

# - Genetic Query Optimizer -

#geqo = on
#geqo_threshold = 12
#geqo_effort = 5            # range 1-10
#geqo_pool_size = 0            # selects default based on effort
#geqo_generations = 0            # selects default based on effort
#geqo_selection_bias = 2.0        # range 1.5-2.0

# - Other Planner Options -

#default_statistics_target = 10        # range 1-1000
#constraint_exclusion = off
#from_collapse_limit = 8
#join_collapse_limit = 8        # 1 disables collapsing of explicit
                    # JOINs


#---------------------------------------------------------------------------
# ERROR REPORTING AND LOGGING
#---------------------------------------------------------------------------

# - Where to Log -

#log_destination = 'stderr'        # Valid values are combinations of
                    # stderr, csvlog, syslog and eventlog,
                    # depending on platform.
                    # csvlog requires logging_collector to be on

# This is used when logging to stderr:
#logging_collector = off        # Enable capturing of stderr and csvlog
                    # into log files. Required to be on for csvlogs.
                    # (change requires restart)
logging_collector = on

# These are only used if logging_collector is on:
#log_directory = 'pg_log'        # Directory where log files are written
                    # Can be absolute or relative to PGDATA
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # Log file name pattern.
                    # Can include strftime() escapes
log_filename = 'postgresql-%Y-%m-%d.log'
#log_truncate_on_rotation = off # If on, any existing log file of the same
                    # name as the new log file will be
                    # truncated rather than appended to. But
                    # such truncation only occurs on
                    # time-driven rotation, not on restarts
                    # or size-driven rotation. Default is
                    # off, meaning append to existing files
                    # in all cases.
#log_rotation_age = 1d            # Automatic rotation of logfiles will
                    # happen after that time.  0 to
                    # disable.
#log_rotation_size = 10MB        # Automatic rotation of logfiles will
                    # happen after that much log
                    # output.  0 to disable.

# These are relevant when logging to syslog:
#syslog_facility = 'LOCAL0'
#syslog_ident = 'postgres'


# - When to Log -

#client_min_messages = notice        # Values in order of decreasing detail:
                    #   debug5
                    #   debug4
                    #   debug3
                    #   debug2
                    #   debug1
                    #   log
                    #   notice
                    #   warning
                    #   error

#log_min_messages = notice        # Values in order of decreasing detail:
log_min_messages = debug1
                    #   debug5
                    #   debug4
                    #   debug3
                    #   debug2
                    #   debug1
                    #   info
                    #   notice
                    #   warning
                    #   error
                    #   log
                    #   fatal
                    #   panic

#log_error_verbosity = default        # terse, default, or verbose messages

#log_min_error_statement = error    # Values in order of decreasing detail:
                     #   debug5
                    #   debug4
                    #   debug3
                    #   debug2
                    #   debug1
                     #   info
                    #   notice
                    #   warning
                    #   error
                    #   log
                    #   fatal
                    #   panic (effectively off)

#log_min_duration_statement = -1    # -1 is disabled, 0 logs all statements
                    # and their durations, > 0 logs only
                    # statements running at least N msec.


#silent_mode = off            # DO NOT USE without syslog or
                    # logging_collector
                    # (change requires restart)

# - What to Log -

#debug_print_parse = off
#debug_print_rewritten = off
#debug_print_plan = off
#debug_pretty_print = off
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_duration = off
#log_hostname = off
#log_line_prefix = ''            # Special values:
                    #   %u = user name
                    #   %d = database name
                    #   %r = remote host and port
                    #   %h = remote host
                    #   %p = PID
                    #   %t = timestamp (no milliseconds)
                    #   %m = timestamp with milliseconds
                    #   %i = command tag
                    #   %c = session id
                    #   %l = session line number
                    #   %s = session start timestamp
                    #   %v = virtual transaction id
                    #   %x = transaction id (0 if none)
                    #   %q = stop here in non-session
                    #        processes
                    #   %% = '%'
                    # e.g. '<%u%%%d> '
log_line_prefix = '%t '
#log_lock_waits = off            # Log lock waits >= deadlock_timeout
#log_statement = 'none'            # none, ddl, mod, all
#log_temp_files = -1            # Log temporary files equal or larger
                    # than specified number of kilobytes.
                    # -1 disables;  0 logs all temp files
#log_timezone = unknown            # actually, defaults to TZ
                    # environment setting

#---------------------------------------------------------------------------
# RUNTIME STATISTICS
#---------------------------------------------------------------------------

# - Query/Index Statistics Collector -

#track_activities = on
#track_counts = on
#update_process_title = on


# - Statistics Monitoring -

#log_parser_stats = off
#log_planner_stats = off
#log_executor_stats = off
#log_statement_stats = off


#---------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#---------------------------------------------------------------------------

#autovacuum = on            # enable autovacuum subprocess?
#autovacuum = off
                    # 'on' requires track_counts
                    # to also be on
#log_autovacuum_min_duration = -1    # -1 disables, 0 logs all actions
                    # and their durations, > 0 logs only
                    # actions running at least N msec.
#autovacuum_max_workers = 3        # max # of autovacuum subprocesses
#autovacuum_naptime = 1min        # time between autovacuum runs
#autovacuum_vacuum_threshold = 50    # min # of tuple updates before
                    # vacuum
#autovacuum_analyze_threshold = 50    # min # of tuple updates before
                    # analyze
#autovacuum_vacuum_scale_factor = 0.2    # fraction of rel size before
                    # vacuum
#autovacuum_analyze_scale_factor = 0.1    # fraction of rel size before
                    # analyze
#autovacuum_freeze_max_age = 200000000    # maximum XID age before forced vacuum
                    # (change requires restart)
#autovacuum_vacuum_cost_delay = 20    # default vacuum cost delay for
                    # autovacuum, -1 means use
                    # vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1    # default vacuum cost limit for
                    # autovacuum, -1 means use
                    # vacuum_cost_limit


#---------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#---------------------------------------------------------------------------

# - Statement Behavior -

#search_path = '"$user",public'        # schema names
#default_tablespace = ''        # a tablespace name, '' uses
                    # the default
#temp_tablespaces = ''            # a list of tablespace names,
                    # '' uses only default tablespace
#check_function_bodies = on
#default_transaction_isolation = 'read committed'
#default_transaction_read_only = off
#statement_timeout = 0            # 0 is disabled
#session_replication_role = "origin"
#vacuum_freeze_min_age = 100000000
#xmlbinary = 'base64'
#xmloption = 'content'

# - Locale and Formatting -

datestyle = 'iso, mdy'
#timezone = unknown            # actually, defaults to TZ
                    # environment setting
#timezone_abbreviations = 'Default'     # select the set of available timezone
                    # abbreviations. Currently, there are
                    #   Default
                    #   Australia
                    #   India
                    # However you can also create your own
                    # file in share/timezonesets/.
#extra_float_digits = 0            # min -15, max 2
#client_encoding = sql_ascii        # actually, defaults to database
                    # encoding

# These settings are initialized by initdb -- they might be changed
lc_messages = 'C'            # locale for system error message
                    # strings
lc_monetary = 'C'            # locale for monetary formatting
lc_numeric = 'C'            # locale for number formatting
lc_time = 'C'                # locale for time formatting

# default configuration for text search
default_text_search_config = 'pg_catalog.english'

# - Other Defaults -

#explain_pretty_print = on
#dynamic_library_path = '$libdir'
#local_preload_libraries = ''


#---------------------------------------------------------------------------
# LOCK MANAGEMENT
#---------------------------------------------------------------------------

#deadlock_timeout = 1s
#max_locks_per_transaction = 64        # min 10
                    # (change requires restart)
# Note: each lock table slot uses ~270 bytes of shared memory, and there are
# max_locks_per_transaction * (max_connections + max_prepared_transactions)
# lock table slots.


#---------------------------------------------------------------------------
# VERSION/PLATFORM COMPATIBILITY
#---------------------------------------------------------------------------

# - Previous Postgres Versions -

#add_missing_from = off
#array_nulls = on
#backslash_quote = safe_encoding    # on, off, or safe_encoding
#default_with_oids = off
#escape_string_warning = on
#standard_conforming_strings = off
#regex_flavor = advanced        # advanced, extended, or basic
#sql_inheritance = on

# - Other Platforms & Clients -

#transform_null_equals = off


#---------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#---------------------------------------------------------------------------

#custom_variable_classes = ''        # list of custom variable class names

Attachment

Re: B-tree crash recovery error in 8.3 beta 2

From
Heikki Linnakangas
Date:
Koichi Suzuki wrote:
> I've found that B-tree crash recovery in 8.3 beta2 could make some
> tuples invisible through B-tree.  They're visible if we read using but
> Seq-Scan.   This happens in 8.3 beta2, but not in 8.2.4.  Here's how it
> happens.
>
> 1. Create b-tree for a text type column.
> 2. Make B-tree three-story, that is, root-intermediate-leaf.  Insert
> tuples sufficient to construct such B-tree.
> 3. No checkpoint should occur during 2.
> 4. Kill postmaster.
> 5. Restart postmaster.   Crash recovery will be done.
> 6. Tuples with column values less than HIKEY becomes invisible through
> Idx-scan, still visible through Seq-scan.
>
> From the dump of B-tree, it seems that HIKEY value is cleared (only
> tuple header is left).  No problem was found in the case of integer or
> numeric type columns.
>
> Attached is the shell script, postgresql.conf (almost the default one)
> to reproduce the problem, and the log of the problem reproduction.

Thanks for the excellent reproducer script!

There seems to be a bug in the B-tree split WAL reduction patch from
Februrary. On split, we copy the HIKEY of the left page from the
leftmost item on the right page, but that doesn't work because the
leftmost key is not stored on intermediate levels.

Patch attached that stores the high key explicitly in the WAL record on
intermediate levels.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/nbtree/nbtinsert.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v
retrieving revision 1.161
diff -c -r1.161 nbtinsert.c
*** src/backend/access/nbtree/nbtinsert.c    15 Nov 2007 21:14:32 -0000    1.161
--- src/backend/access/nbtree/nbtinsert.c    16 Nov 2007 11:03:46 -0000
***************
*** 1004,1010 ****
          xl_btree_split xlrec;
          uint8        xlinfo;
          XLogRecPtr    recptr;
!         XLogRecData rdata[6];
          XLogRecData *lastrdata;

          xlrec.node = rel->rd_node;
--- 1004,1010 ----
          xl_btree_split xlrec;
          uint8        xlinfo;
          XLogRecPtr    recptr;
!         XLogRecData rdata[7];
          XLogRecData *lastrdata;

          xlrec.node = rel->rd_node;
***************
*** 1020,1034 ****

          lastrdata = &rdata[0];

-         /* Log downlink on non-leaf pages. */
          if (ropaque->btpo.level > 0)
          {
              lastrdata->next = lastrdata + 1;
              lastrdata++;

              lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
              lastrdata->len = sizeof(BlockIdData);
              lastrdata->buffer = InvalidBuffer;
          }

          /*
--- 1020,1044 ----

          lastrdata = &rdata[0];

          if (ropaque->btpo.level > 0)
          {
+             /* Log downlink */
              lastrdata->next = lastrdata + 1;
              lastrdata++;

              lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
              lastrdata->len = sizeof(BlockIdData);
              lastrdata->buffer = InvalidBuffer;
+
+             /* Log high key of left page */
+             lastrdata->next = lastrdata + 1;
+             lastrdata++;
+
+             itemid = PageGetItemId(origpage, P_HIKEY);
+             item = (IndexTuple) PageGetItem(origpage, itemid);
+             lastrdata->data = (char *) item;
+             lastrdata->len = IndexTupleSize(item);
+             lastrdata->buffer = InvalidBuffer;
          }

          /*
Index: src/backend/access/nbtree/nbtxlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtxlog.c,v
retrieving revision 1.47
diff -c -r1.47 nbtxlog.c
*** src/backend/access/nbtree/nbtxlog.c    15 Nov 2007 21:14:32 -0000    1.47
--- src/backend/access/nbtree/nbtxlog.c    16 Nov 2007 10:59:59 -0000
***************
*** 273,278 ****
--- 273,280 ----
      OffsetNumber newitemoff = 0;
      Item        newitem = NULL;
      Size        newitemsz = 0;
+     Item        left_hikey = NULL; /* only 16-bit aligned */
+     Size        left_hikeysz = 0;

      reln = XLogOpenRelation(xlrec->node);

***************
*** 289,294 ****
--- 291,303 ----
          datalen -= sizeof(BlockIdData);

          forget_matching_split(xlrec->node, downlink, false);
+
+         /* Extract left hikey and its size (still assuming 16-bit alignment) */
+         left_hikey = (Item) datapos;
+         left_hikeysz = IndexTupleSize(left_hikey);
+
+         datapos += left_hikeysz;
+         datalen -= left_hikeysz;
      }

      /* Extract newitem and newitemoff, if present */
***************
*** 333,338 ****
--- 342,357 ----

      _bt_restore_page(rpage, datapos, datalen);

+     /* On leaf level, the high key of the left page is equal to the
+      * first key on the right page.
+      */
+     if (xlrec->level == 0)
+     {
+         ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
+         left_hikey = PageGetItem(rpage, hiItemId);
+         left_hikeysz = ItemIdGetLength(hiItemId);
+     }
+
      PageSetLSN(rpage, lsn);
      PageSetTLI(rpage, ThisTimeLineID);
      MarkBufferDirty(rbuf);
***************
*** 360,367 ****
                  OffsetNumber maxoff = PageGetMaxOffsetNumber(lpage);
                  OffsetNumber deletable[MaxOffsetNumber];
                  int            ndeletable = 0;
-                 ItemId        hiItemId;
-                 Item        hiItem;

                  /*
                   * Remove the items from the left page that were copied to the
--- 379,384 ----
***************
*** 394,404 ****
                          elog(PANIC, "failed to add new item to left page after split");
                  }

!                 /* Set high key equal to the first key on the right page */
!                 hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
!                 hiItem = PageGetItem(rpage, hiItemId);
!
!                 if (PageAddItem(lpage, hiItem, ItemIdGetLength(hiItemId),
                                  P_HIKEY, false, false) == InvalidOffsetNumber)
                      elog(PANIC, "failed to add high key to left page after split");

--- 411,418 ----
                          elog(PANIC, "failed to add new item to left page after split");
                  }

!                 /* Set high key */
!                 if (PageAddItem(lpage, left_hikey, left_hikeysz,
                                  P_HIKEY, false, false) == InvalidOffsetNumber)
                      elog(PANIC, "failed to add high key to left page after split");

Index: src/include/access/nbtree.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v
retrieving revision 1.114
diff -c -r1.114 nbtree.h
*** src/include/access/nbtree.h    15 Nov 2007 21:14:42 -0000    1.114
--- src/include/access/nbtree.h    16 Nov 2007 10:59:59 -0000
***************
*** 289,294 ****
--- 289,298 ----
       * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit
       * aligned.)
       *
+      * If level > 0, IndexTuple representing the HIKEY of the left page
+      * follows. We don't need it on leaf pages, because it's the same
+      * as the leftmost key on the new right page.
+      *
       * In the _L variants, next are OffsetNumber newitemoff and the new item.
       * (In the _R variants, the new item is one of the right page's tuples.)
       *

Re: B-tree crash recovery error in 8.3 beta 2

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> There seems to be a bug in the B-tree split WAL reduction patch from
> Februrary. On split, we copy the HIKEY of the left page from the
> leftmost item on the right page, but that doesn't work because the
> leftmost key is not stored on intermediate levels.
> Patch attached that stores the high key explicitly in the WAL record on
> intermediate levels.

Applied with revisions --- mostly, being safe about alignment issues.
You wouldn't notice that on an Intel CPU, but some machines are
pickier ...

            regards, tom lane