Re: Fast insertion indexes: why no developments - Mailing list pgsql-hackers

From Leonardo Francalanci
Subject Re: Fast insertion indexes: why no developments
Date
Msg-id 1384345889267-5778124.post@n5.nabble.com
Whole thread Raw
In response to Re: Fast insertion indexes: why no developments  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Fast insertion indexes: why no developments  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Simon Riggs wrote
> So in the use case you describe, the min max index would require a
> scan of only 25% of the table, not the 80% described earlier for
> random inserts. In my experience, people wish to keep data for much
> longer periods and so the percentage of scan required would drop lower
> than 25%, possibly to 5% or less for many applications.
> 
> The plan would use sequential I/O so could still work quickly; given
> the low read rate, longer query times could be acceptable.


"Quickly"??? Seq scan 25% of a table of TB is not "quick".



Simon Riggs wrote
> Minmax indexes are simply a way to make this use case happen
> automatically, without the need for manual partitioning of the table.

You logic assumes that we don't index anything but the call_timestamp.
That would lead to huge query response times.
Plus: btree doesn't have a big problem to keep up in sequential insertion
scenario (such as the call_timestamp). So I still don't see the gain
in using the minmax indexes: again, could you point me to some
performance tests of *any* use case?


Simon Riggs wrote
> They are not the answer to every prayer, but with respect they are
> better than you had claimed they would be. (25% not 80%, in your use
> case). I saw this was likely to be the case and this is why I
> challenged you to describe in more detail. Thank you.

I claimed they would scan the 80% of the table because I assumed I
had to use them in the random fields; not in the call_timestamp field.
I don't need a better index in the call_timestamp, because it's sequential,
I don't have problems with that. But it's useless: I don't want to seq scan
25% of a multi-TB table.



Simon Riggs wrote
> Performance tests are only possible with a clear use case. 

Well, so I can add my weird_index patch in postgresql source code,
and it would be committed right away??? I assumed you had
to prove somehow that the new index was better than what
is already available, at least for some cases.

Or, in other words: what are you going to write in the minmax index
documentation, "try and see if they work better for you"?


Simon Riggs wrote
> Please see that Alvaro and I have gone out of our way to provide a new
> facility to help you and others, and that it requires changing how we
> think about the solution. I accept it may not provide for every case
> but it requires careful analysis before deciding that is so.

If I came out too rough, I ask your pardon. I always appreciate people
taking their time to help someone else for free. 

Plus, I'm *very* interested in the minmax index, especially for
call_timestamp (some queries are date-range only, such as "give me
all the calls in this particular 2 secs range) or the id column I have.
But, at the same time, I don't see any evidence that they work
better than btrees (except for the size of the index). 
I would like to see some numbers.

I worked a little in the bitmap index implementation, and I stopped because
on the large tables these indexes are supposed to be used, the heap
lookup took so much time that the (slightly) faster index access didn't
really help, because it was a fraction of the query time... I'm afraid
it would be the same with minmax indexes, that's why I wanted to 
see some numbers...


Simon Riggs wrote
>> And, again, I think that random values insertion is the worst
>> use case for minmax indexes.
> 
> I agree, it is. What we have disagreed on is the extent to which that
> scenario exists for use cases on very large tables, which are
> typically "append-mostly" with most queries accessing a subset of the
> table, e.g. date range.

Mhh... maybe this is this point we don't understand each other?

I query the table by userid + date range.
The date range is *not* selective enough (it's, as you said, 25%
of the multi-TB table). The userid is selective *a lot*. 

I'm looking for a "better" index for the userid column(s).

The "new" indexes I mentioned in the OP claim they are better
in this scenario (but I don't blindly believe them....)

I don't see how indexing the call_timestamp only could be
of any interest, since it would require seq-scanning 25%
of a huge table for every query.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778124.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: GIN bugs in master branch
Next
From: Leonardo Francalanci
Date:
Subject: Re: Fast insertion indexes: why no developments