Re: GIN improvements part 1: additional information - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: GIN improvements part 1: additional information |
Date | |
Msg-id | 51CEA13C.8040103@vmware.com Whole thread Raw |
In response to | Re: GIN improvements part 1: additional information (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: GIN improvements part 1: additional information
Re: GIN improvements part 1: additional information |
List | pgsql-hackers |
On 25.06.2013 01:03, Alexander Korotkov wrote: > New revision of patch is attached. Now it includes some docs. Thanks! I'm looking into this in detail now. First, this patch actually contains two major things: 1. Pack item pointers more tightly on posting data leaf pages. 2. Allow opclass implementation to attach "additional information" to each item pointer. These are two very distinct features, so this patch needs to be split into two. I extracted the 1st part into a separate patch, attached, and am going to focus on that now. I made one significant change: I removed the 'freespace' field you added to GinpageOpaque. Instead, on data leaf pages the offset from the beginning of the page where the packed items end is stored in place of the 'maxoff' field. This allows for quick calculation of the free space, but there is no count of item pointers stored on the page anymore, so some code that looped through all the item pointers relying on 'maxoff' had to be changed to work with the end offset instead. I'm not 100% wedded on this, but I'd like to avoid adding the redundant freespace field on pages that don't need it, because it's confusing and you have to remember to keep them in sync. The patch needs a lot of cleanup still, and I may well have broken some stuff, but I'm quite pleased with the performance. I tested this with two tables; one is the titles from the DBLP dataset. Another is integer arrays, created with this: create function randomintarr() returns int[] as $$ select array_agg((random() * 1000.0)::int4) from generate_series(1,10) $$ language sql; create table intarrtbl as select randomintarr() as ii from generate_series(1, 10000000); The effect on the index sizes is quite dramatic: postgres=# \di+ List of relations Schema | Name | Type | Owner | Table | Size | --------+--------------------+-------+--------+-------------+--------+ public | gin_intarr_master | index | heikki | intarrtbl | 585 MB | public | gin_intarr_patched | index | heikki | intarrtbl | 211 MB | public | gin_title | index | heikki | dblp_titles | 93 MB | public | gin_title_master | index | heikki | dblp_titles | 180 MB | (4 rows) Tomas Vondra tested the search performance of an earlier version of this patch: http://www.postgresql.org/message-id/50BFF89A.7080908@fuzzy.cz). He initially saw a huge slowdown, but could not reproduce it with a later version of the patch. I did not see much difference in a few quick queries I ran, so we're probably good on that front. There's a few open questions: 1. How are we going to handle pg_upgrade? It would be nice to be able to read the old page format, or convert on-the-fly. OTOH, if it gets too complicated, might not be worth it. The indexes are much smaller with the patch, so anyone using GIN probably wants to rebuild them anyway, sooner or later. Still, I'd like to give it a shot. 2. The patch introduces a small fixed 32-entry index into the packed items. Is that an optimal number? 3. I'd like to see some performance testing of insertions, deletions, and vacuum. I suspect that maintaining the 32-entry index might be fairly expensive, as it's rewritten on every update to a leaf page. - Heikki
Attachment
pgsql-hackers by date: