Thread: Postgres cluster
Hi. I'm looking for a somewhat standardized means of configuring a cluster of postgres nodes for better performance and availability. I have a SAN available for data storage. Can this be done? Thanks, Morten
Hi, in case that one has a table R(att0 int4, att1 int4), an index on att1 and a query selecting only att1 whith a range condition over att1, then the optimal access path for a rdbms would be "INDEX ONLY SCAN", which means scan only the index, and not the relation (all the needed information exists in index, because the select clause has only the att1 attribute). This access path is selected in this case by Oracle. As far as i can see, postgresql select a simple "INDEX SCAN using index" on table. Is there any tuning that one has to do, in order to make postgresql's optimizer to select "INDEX ONLY SCAN" ? Or postgresql doesn't support this feature at all? thanks.
On Sun, Mar 06, 2005 at 10:38:22PM +0200, Ioannis Theoharis wrote: > in case that one has a table R(att0 int4, att1 int4), an index on att1 > and a query selecting only att1 whith a range condition over att1, then > the optimal access path for a rdbms would be "INDEX ONLY SCAN", which means > scan only the index, and not the relation (all the needed information > exists in index, because the select clause has only the att1 attribute). > This access path is selected in this case by Oracle. > > As far as i can see, postgresql select a simple "INDEX SCAN using index" > on table. That's because the index does not hold all the information. So INDEX ONLY is impossible. > Is there any tuning that one has to do, in order to make postgresql's > optimizer to select "INDEX ONLY SCAN" ? > Or postgresql doesn't support this feature at all? Nope. There's been discussion, but I wouldn't hold my breath. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
> > > > As far as i can see, postgresql select a simple "INDEX SCAN using index" > > on table. > > That's because the index does not hold all the information. So INDEX > ONLY is impossible. > Thank you for your quick answer, but there is something a don't understand: the index holds all attributes' values (one index entry/one table entry). Right? given a query with only this attribute on SELECT clause, which is the missing information, that do not exist in index?
On Sun, Mar 06, 2005 at 11:39:05PM +0200, Ioannis Theoharis wrote: > > > As far as i can see, postgresql select a simple "INDEX SCAN using index" > > > on table. > > > > That's because the index does not hold all the information. So INDEX > > ONLY is impossible. > > > > Thank you for your quick answer, but there is something a don't > understand: > > the index holds all attributes' values (one index entry/one table entry). > Right? > given a query with only this attribute on SELECT clause, which is the > missing information, that do not exist in index? Visibility information. An index contains references to all rows in all transactions. You need to go back to the table to work out if the row is visible in your transaction. Hope this helpsm -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
> > Visibility information. An index contains references to all rows in all > transactions. You need to go back to the table to work out if the row > is visible in your transaction. > > Hope this helpsm Yes at all. You have a lot of work for future postgresql' versions :-))
On Sun, Mar 06, 2005 at 11:54:52PM +0200, Ioannis Theoharis wrote: > > Visibility information. An index contains references to all rows in all > > transactions. You need to go back to the table to work out if the row > > is visible in your transaction. > > > > Hope this helpsm > > Yes at all. > You have a lot of work for future postgresql' versions :-)) It's a fairly corner case feature, only for the case where you're looking for the existance of an index key but don't want any other data. I don't personally have any queries that could use such a construct, though maybe I'm missing the point. If someone wants to step up and write the code, well... Bit hard, it means that everytime a row is updated the index needs to be updated too, gotta get all the corner cases, locking, etc. It's not high on anybodies list AFAIK... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Martijn van Oosterhout <kleptog@svana.org> writes: > If someone wants to step up and write the code, well... Bit hard, it > means that everytime a row is updated the index needs to be updated > too, gotta get all the corner cases, locking, etc. Even if someone did step up and write the code, the patch would probably be rejected. Copying row commit data into index entries would have some serious negative effects: * Index bloat. It would add probably 20 bytes to the size of an index entry, which is a lot (it would double the size of an index on a simple integer column, for instance). * Extra time and extra I/O to update these additional copies. Not only do we have to write the index changes themselves, but we'd have to make WAL log entries about each one (since the WAL replay environment is not nearly bright enough to find index entries for itself). That's expensive. The second point makes any such change a sure loser for an update-heavy database. In a nearly-read-only database, it might possibly be a winner, but in view of the index bloat problem it's not a sure thing even then. In short: this has been discussed and rejected before. Many times. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > It's a fairly corner case feature, only for the case where you're > looking for the existance of an index key but don't want any other > data. I don't personally have any queries that could use such a > construct, though maybe I'm missing the point. You don't have any many-to-many relationships? Or any queries that use EXISTS with indexable where clauses? Both of those are typical use cases for this path. In any case I think this type of scenario is more likely to be attacked via some sort of "index organized table". But even that's not on the horizon right now. -- greg
> > It's a fairly corner case feature, only for the case where you're > looking for the existance of an index key but don't want any other > data. I don't personally have any queries that could use such a > construct, though maybe I'm missing the point. I understand the difficulty of implementation of such a new feature and the advantages and disadvantages coming with it. But let me give you one of the most important usage of index only scan: if one has two relations A(a1, a2, ..., an) and B(b1, b2, ..., bn) and let b1 be a foreign key of one of the ai of A, (e.g of aj). And let q be a query like: select A.ai from A, B where aj = b1 which is one of most common join queries, then there if one has an index on b1 on relation B, there is no reason to scann the whole table B, but only the b1 attribute. A good optimizer would select index only access path, because it's the most selective acess path. Eg. index nested loop algorithm for join, taking relation A as outer, and B as inner, has to scan the whole relation B (Vs the whole index on B) for each tuple(or block) of outer rel A. Also, there are other less frequent examples in which index only acess path is the optimal one.
Hi folks, Sorry to ask a newbie SQL question but I'm struggling... I have a website clickstream log: request_id session_id sequence_num url 100 xxx 1 /foo 101 xxx 2 /bar 102 xxx 3 /hoo 103 yyy 1 /foo 104 yyy 2 /bar I need to count the number of times each url was an exit page. An exit page is the highest sequence_num for each session_id - 102 & 104 in the example. The report would look like: Exit pages: /foo 4555 /bar 3204 /hoo 2337 etc... Any pointers would be much appreciated! ------------------ Geoff Caplan Vario Software Ltd (+44) 121-515 1154
Geoff Caplan <geoff@variosoft.com> writes: > Hi folks, > > Sorry to ask a newbie SQL question but I'm struggling... There's no efficient way to write this in standard SQL. However Postgres has an extension DISTINCT ON that would do it: select url,count(*) from (select distinct on (session_id) url from clickstream order by session_id,sequence_num desc ) group by url This isn't going to be a superfast query. It has to sort all the clickstream records by session and sequence, take just the last one, then probably sort those again. You could maybe make it faster by having an index on <session_id,sequence_num> and doing order by "session_id desc, sequence_num desc". And giving this session a larger than normal sort_mem would give it a better chance of being able to use hash_agg for the count. -- greg
Greg, GS> There's no efficient way to write this in standard SQL. GS> However Postgres has an extension DISTINCT ON that would GS> do it: Works as advertised - many thanks! I'd missed the DISTINCT ON extension... This really is a great list - you've saved me a couple of hours of agony, I suspect. ------------------ Geoff Caplan Vario Software Ltd (+44) 121-515 1154
gsstark@mit.edu (Greg Stark) writes: > Geoff Caplan <geoff@variosoft.com> writes: > > > Hi folks, > > > > Sorry to ask a newbie SQL question but I'm struggling... > > There's no efficient way to write this in standard SQL. However Postgres has > an extension DISTINCT ON that would do it: > > select url,count(*) > from (select distinct on (session_id) > url > from clickstream > order by session_id,sequence_num desc > ) > group by url > > This isn't going to be a superfast query. It has to sort all the clickstream > records by session and sequence, take just the last one, then probably sort > those again. > As an experiment I tried a more 'standard SQL' approach to this problem: SELECT url, count(1) FROM clickstream WHERE (session_id, sequence_num) IN (SELECT session_id, max(sequence_num) FROM clickstream GROUP BY session_id) GROUP BY url; On a table with about 100,000 rows this runs in about 720ms on my system , compared to the ON DISTICNT version which runs in about 1000ms. Adding an index on (session_id, sequence_num) reduced the run time to about 690ms, but made no difference to the DISTINCT ON version. With only about 10,000 rows, there's no appreciable difference. This surprised me, because I expected the DISTINCT ON to be better. -- Remove -42 for email