Re: Fast insertion indexes: why no developments - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Fast insertion indexes: why no developments |
Date | |
Msg-id | CA+U5nM+MCH0wQ=TZ_VKGW_MPWd-E93=0p8fE2CrdjsgvxZWF6Q@mail.gmail.com Whole thread Raw |
In response to | Re: Fast insertion indexes: why no developments (Leonardo Francalanci <m_lists@yahoo.it>) |
Responses |
Re: Fast insertion indexes: why no developments
|
List | pgsql-hackers |
On 13 November 2013 06:07, Leonardo Francalanci <m_lists@yahoo.it> wrote: > The use case is pretty simple. > Think it as the NSA, as it would be much easier. > Collect all calls made/received by any user on a mobile network. > (in fact, it's something more than calls, so in fact is 2-10 rows per call). > Keep the data for 20 days. > That's the insert part. > > Query: > search calls made/received by the user using IMSI (caller id) or IMEI > (phone id). Date range is usually days (past 4 days, from 10 days ago > to 5 days ago...) > > The result is just a very small percentage of the rows present in the > table: a single user doesn't call that much! > Searches are made by a human, so no that many request per second. > > It's not a "write mostly" scenario, it's a 99% write 1% read scenario. > > Problem? having 4 btree indexes on random values (imsi+imei * 2, > since we have calling and caller) kills the performance in insertion > after a while. > Solution so far? partition every 15 minutes, create the indexes in bulk. So in the use case you describe, the min max index would require a scan of only 25% of the table, not the 80% described earlier for random inserts. In my experience, people wish to keep data for much longer periods and so the percentage of scan required would drop lower than 25%, possibly to 5% or less for many applications. The plan would use sequential I/O so could still work quickly; given the low read rate, longer query times could be acceptable. Minmax indexes are simply a way to make this use case happen automatically, without the need for manual partitioning of the table. They are not the answer to every prayer, but with respect they are better than you had claimed they would be. (25% not 80%, in your use case). I saw this was likely to be the case and this is why I challenged you to describe in more detail. Thank you. > Simon Riggs wrote >> If we have a query to show the most recent calls by a particular caller >> >> SELECT * >> FROM cdr >> WHERE callerid = X >> ORDER BY call_timestamp DESC >> LIMIT 100 >> >> then this could potentially be optimised using a minmax index, by >> traversing the data ranges in call_timestamp order. That is not part >> of the code in this initial release, since the main use case is for >> WHERE call_timestamp >= X, or WHERE primarykey = Y > > I don't understand how a index on call_timestamp would help > in the query above. The min max index would cover call_timestamp and would be used to optimise the query. That is not in the current version, it is a later optimisation that I think is possible after considering your use case and similar ones. This is a similar optimisation to the Merge Append case for partitioned tables. > Simon Riggs wrote >> I don't believe there is a credible business case for running that >> same query but without the ORDER BY and LIMIT, since it could >> potentially return gazillions of rows > > > Gazillion of rows??? We're talking about calls made/received by > one user here. How many calls do you make in 10 days??? > > > Simon Riggs wrote >> so it isn't surprising at all >> that it would access a large % of the table. > > In fact, the query I use return a fraction of the table, and only > a very small amount of users get searched. > > Simon, you keep on talking about these minmax indexes, and > I still don't see any reference to some performance tests. Performance tests are only possible with a clear use case. It would be helpful if you could benchmark your own use case and I request others do the same. I have and will challenge people that simply assert this new type of index is not useful based on a generic argument, with no use case and no tests. That is nothing personal, it is simply that I do not wish misunderstandings to block the adoption of new features. Please see that Alvaro and I have gone out of our way to provide a new facility to help you and others, and that it requires changing how we think about the solution. I accept it may not provide for every case but it requires careful analysis before deciding that is so. > And, again, I think that random values insertion is the worst > use case for minmax indexes. I agree, it is. What we have disagreed on is the extent to which that scenario exists for use cases on very large tables, which are typically "append-mostly" with most queries accessing a subset of the table, e.g. date range. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: