Re: GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information) - Mailing list pgsql-hackers

From Amit Langote
Subject Re: GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information)
Date
Msg-id CA+HiwqGO9RM5ak2kVMTjbYKNthf5oEE7TM3cM_zY1uVWmG8iYg@mail.gmail.com
Whole thread Raw
In response to GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information)  (Amit Langote <amitlangote09@gmail.com>)
List pgsql-hackers
On Wed, Jan 22, 2014 at 9:12 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 01/22/2014 03:39 AM, Tomas Vondra wrote:
>>
>> What annoys me a bit is the huge size difference between the index
>> updated incrementally (by a sequence of INSERT commands), and the index
>> rebuilt from scratch using VACUUM FULL. It's a bit better with the patch
>> (2288 vs. 2035 MB), but is there a chance to improve this?
>
>
> Hmm. What seems to be happening is that pending item list pages that the
> fast update mechanism uses are not getting recycled. When enough list pages
> are filled up, they are flushed into the main index and the list pages are
> marked as deleted. But they are not recorded in the FSM, so they won't be
> recycled until the index is vacuumed. Almost all of the difference can be
> attributed to deleted pages left behind like that.
>
> So this isn't actually related to the packed postinglists patch at all. It
> just makes the bloat more obvious, because it makes the actual size of the
> index size, excluding deleted pages, smaller. But it can be observed on git
> master as well:
>
> I created a simple test table and index like this:
>
> create table foo (intarr int[]);
> create index i_foo on foo using gin(intarr) with (fastupdate=on);
>
> I filled the table like this:
>
> insert into foo select array[-1] from generate_series(1, 10000000) g;
>
> postgres=# \d+i
>                    List of relations
>  Schema | Name | Type  | Owner  |  Size  | Description
> --------+------+-------+--------+--------+-------------
>  public | foo  | table | heikki | 575 MB |
> (1 row)
>
> postgres=# \di+
>                        List of relations
>  Schema | Name  | Type  | Owner  | Table |  Size  | Description
> --------+-------+-------+--------+-------+--------+-------------
>  public | i_foo | index | heikki | foo   | 251 MB |
> (1 row)
>
> I wrote a little utility that scans all pages in a gin index, and prints out
> the flags indicating what kind of a page it is. The distribution looks like
> this:
>
>      19 DATA
>    7420 DATA LEAF
>   24701 DELETED
>       1 LEAF
>       1 META
>
> I think we need to add the deleted pages to the FSM more aggressively.
>
> I tried simply adding calls to RecordFreeIndexPage, after the list pages
> have been marked as deleted, but unfortunately that didn't help. The problem
> is that the FSM is organized into a three-level tree, and
> RecordFreeIndexPage only updates the bottom level. The upper levels are not
> updated until the FSM is vacuumed, so the pages are still not visible to
> GetFreeIndexPage calls until next vacuum. The simplest fix would be to add a
> call to IndexFreeSpaceMapVacuum after flushing the pending list, per
> attached patch. I'm slightly worried about the performance impact of the
> IndexFreeSpaceMapVacuum() call. It scans the whole FSM of the index, which
> isn't exactly free. So perhaps we should teach RecordFreeIndexPage to update
> the upper levels of the FSM in a retail-fashion instead.
>

I wonder if you pursued this further?

You recently added a number of TODO items related to GIN index; is it
worth adding this to the list?

--
Amit



pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: Re: [bug fix] Memory leak in dblink
Next
From: Pavel Stehule
Date:
Subject: Re: WIP patch for multiple column assignment in UPDATE