Re: Wrong value in metapage of GIN INDEX. - Mailing list pgsql-hackers

From keisuke kuroda
Subject Re: Wrong value in metapage of GIN INDEX.
Date
Msg-id CANDwggLfKsT28YhcoR6k3N-k1Aw887vT5-1cUXZ0XKPBvHw4Rw@mail.gmail.com
Whole thread Raw
In response to Wrong value in metapage of GIN INDEX.  ("Moon, Insung" <tsukiwamoon.pgsql@gmail.com>)
List pgsql-hackers
Hi Moon-san.

Thank you for posting.

We are testing the GIN index onJSONB type.
The default maintenance_work_mem (64MB) was fine in usually cases.
However, this problem occurs when indexing very large JSONB data.

best regards,
Keisuke Kuroda

2019年8月29日(木) 17:20 Moon, Insung <tsukiwamoon.pgsql@gmail.com>:
Dear Hackers.

Kuroda-san and I are interested in the GIN index and have been testing
various things.
While testing, we are found a little bug.
Some cases, the value of nEntries in the metapage was set to the wrong value.

This is a reproduce of bug situation.
=# SET maintenance_work_mem TO '1MB';
=# CREATE TABLE foo(i jsonb);
=# INSERT INTO foo(i) select jsonb_build_object('foobar001', i) FROM
generate_series(1, 10000) AS i;

# Input the same value again.
=# INSERT INTO foo(i) select jsonb_build_object('foobar001', i) FROM
generate_series(1, 10000) AS i;
# Creates GIN Index.
=# CREATE INDEX foo_idx ON foo USING gin (i jsonb_ops);


=# SELECT * FROM gin_metapage_info(get_raw_page('foo_idx', 0)) WITH
(fastupdate=off);
-[ RECORD 1 ]----+-----------
pending_head     | 4294967295
pending_tail     | 4294967295
tail_free_size   | 0
n_pending_pages  | 0
n_pending_tuples | 0
n_total_pages    | 74
n_entry_pages    | 69
n_data_pages     | 4
n_entries        | 20004 <--★
version          | 2

In this example, the nentries value should be 10001 because the gin
index stores duplicate values in one leaf(posting tree or posting
list).
But, if look at the nentries value of metapage using pageinspect, it
is stored as 20004.
So, Let's run the vacuum.


=# VACUUM foo;
=# SELECT * FROM gin_metapage_info(get_raw_page('foo_idx', 0));
-[ RECORD 1 ]----+-----------
pending_head     | 4294967295
pending_tail     | 4294967295
tail_free_size   | 0
n_pending_pages  | 0
n_pending_tuples | 0
n_total_pages    | 74
n_entry_pages    | 69
n_data_pages     | 4
n_entries        | 10001 <--★
version          | 2

Ah. Run to the vacuum, nEntries is changing the normal value.

There is a problem with the ginEntryInsert function. That calls the
table scan when creating the gin index, ginBuildCallback function
stores the new heap value inside buildstate struct.
And next step, If GinBuildState struct is the size of the memory to be
using is equal to or larger than the maintenance_work_mem value, run
to input value into the GIN index.
This process is a function called ginEnctryInsert.
The ginEntryInsert function called at this time determines that a new
entry is added and increase the value of nEntries.
However currently, ginEntryInsert is first to increase in the value of
nEntries, and to determine if there are the same entries in the
current GIN index.
That causes the bug.

The patch is very simple.
Fix to increase the value of nEntries only when a non-duplicate GIN
index leaf added.

This bug detection and code fix worked with Kuroda-san.

Best Regards.
Moon.

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: REINDEX filtering in the backend
Next
From: Richard Guo
Date:
Subject: Re: A problem about partitionwise join