Re: [HACKERS] pgsql 10: hash indexes testing - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [HACKERS] pgsql 10: hash indexes testing
Date
Msg-id CAA4eK1KKq80BYOc+mcmHcQzV0Mcs3AHGjEEf--TnLaJbkeTgmg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] pgsql 10: hash indexes testing  (AP <ap@zip.com.au>)
Responses Re: [HACKERS] pgsql 10: hash indexes testing
List pgsql-hackers
On Fri, Jul 7, 2017 at 8:22 AM, AP <ap@zip.com.au> wrote:
> On Thu, Jul 06, 2017 at 05:19:59PM +0530, Amit Kapila wrote:
>> I think if you are under development, it is always advisable to create
>> indexes after initial bulk load.  That way it will be faster and will
>> take lesser space atleast in case of hash index.
>
> This is a bit of a pickle, actually:
> * if I do have a hash index I'll wind up with a bloated one at some stage
>   that refused to allow more inserts until the index is re-created
> * if I don't have an index then I'll wind up with a table where I cannot
>   create a hash index because it has too many rows for it to handle
>
> I'm at a bit of a loss as to how to deal with this.
>

I can understand your concerns.  To address first concern we need to
work on one or more of following work items: (a) work on vacuums that
can be triggered on insert only workload (it should perform index
vacuum as well) (b) separate utility statement/function to squeeze
hash index (c) db internally does squeezing like after each split, so
that chances of such a problem can be reduced, but that will be at the
cost of performance reduction in other workloads, so not sure if it is
advisable.  Among these (b) is simplest to do but may not be
convenient for the user.

To address your second concern, we need to speed up the creation of
hash index which is a relatively big project.  Having said that, I
think in your case, this is one-time operation so spending once more
time might be okay.

>
>> >> As mentioned above REINDEX might be a better option.  I think for such
>> >> situation we should have some provision to allow squeeze functionality
>> >> of hash exposed to the user, this will be less costly than REINDEX and
>> >> might serve the purpose for the user.  Hey, can you try some hack in
>> >
>> > Assuming it does help, would this be something one would need to guess
>> > at? "I did a whole bunch of concurrent INSERT heavy transactions so I
>> > guess I should do a squeeze now"?
>> >
>> > Or could it be figured out programmatically?
>>
>> I think one can refer free_percent and number of overflow pages to
>> perform such a command.  It won't be 100% correct, but we can make a
>> guess.  We can even check free space in overflow pages with page
>> inspect to make it more accurate.
>
> Does this take much time? Main reason I am asking is that this looks like
> something that the db ought to handle underneath (say as part of an autovac
> run) and so if there are stats that the index code can maintain that can
> then be used by the autovac (or something) code to trigger a cleanup this
> I think would be of benefit.
>

Sure, I agree that database should automatically handle bloat, but as
said above this will be a separate project and may not be straight
forward.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] New partitioning - some feedback
Next
From: Alexander Korotkov
Date:
Subject: Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors