Thread: Postgres cluster

Postgres cluster

From

Morten

Date:

06 March 2005, 14:16:21

Hi.

I'm looking for a somewhat standardized means of configuring a cluster
of postgres nodes for better performance and availability. I have a SAN
available for data storage.

Can this be done?

Thanks,

Morten

About Access paths

From

Ioannis Theoharis

Date:

06 March 2005, 20:38:35


Hi,

in case that one has a table R(att0 int4, att1 int4), an index on att1
and a query selecting only att1 whith a range condition over att1, then
the optimal access path for a rdbms would be "INDEX ONLY SCAN", which means
scan only the index, and not the relation (all the needed information
exists in index, because the select clause has only the att1 attribute).
This access path is selected in this case by Oracle.

As far as i can see, postgresql select a simple "INDEX SCAN using index"
on table.

Is there any tuning that one has to do, in order to make postgresql's
optimizer to select "INDEX ONLY SCAN" ?
Or postgresql doesn't support this feature at all?

thanks.

Re: About Access paths

From

Martijn van Oosterhout

Date:

06 March 2005, 21:29:57

On Sun, Mar 06, 2005 at 10:38:22PM +0200, Ioannis Theoharis wrote:
> in case that one has a table R(att0 int4, att1 int4), an index on att1
> and a query selecting only att1 whith a range condition over att1, then
> the optimal access path for a rdbms would be "INDEX ONLY SCAN", which means
> scan only the index, and not the relation (all the needed information
> exists in index, because the select clause has only the att1 attribute).
> This access path is selected in this case by Oracle.
>
> As far as i can see, postgresql select a simple "INDEX SCAN using index"
> on table.

That's because the index does not hold all the information. So INDEX
ONLY is impossible.

> Is there any tuning that one has to do, in order to make postgresql's
> optimizer to select "INDEX ONLY SCAN" ?
> Or postgresql doesn't support this feature at all?

Nope. There's been discussion, but I wouldn't hold my breath.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-93941.dat

Re: About Access paths

From

Ioannis Theoharis

Date:

06 March 2005, 21:39:10


> >
> > As far as i can see, postgresql select a simple "INDEX SCAN using index"
> > on table.
>
> That's because the index does not hold all the information. So INDEX
> ONLY is impossible.
>

Thank you for your quick answer, but there is something a don't
understand:

the index holds all attributes' values (one index entry/one table entry).
Right?
given a query with only this attribute on SELECT clause, which is the
missing information, that do not exist in index?

Re: About Access paths

From

Martijn van Oosterhout

Date:

06 March 2005, 21:41:33

On Sun, Mar 06, 2005 at 11:39:05PM +0200, Ioannis Theoharis wrote:
> > > As far as i can see, postgresql select a simple "INDEX SCAN using index"
> > > on table.
> >
> > That's because the index does not hold all the information. So INDEX
> > ONLY is impossible.
> >
>
> Thank you for your quick answer, but there is something a don't
> understand:
>
> the index holds all attributes' values (one index entry/one table entry).
> Right?
> given a query with only this attribute on SELECT clause, which is the
> missing information, that do not exist in index?

Visibility information. An index contains references to all rows in all
transactions. You need to go back to the table to work out if the row
is visible in your transaction.

Hope this helpsm
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-93944.dat

Re: About Access paths

From

Ioannis Theoharis

Date:

06 March 2005, 21:54:55


>
> Visibility information. An index contains references to all rows in all
> transactions. You need to go back to the table to work out if the row
> is visible in your transaction.
>
> Hope this helpsm

Yes at all.
You have a lot of work for future postgresql' versions :-))

Re: About Access paths

From

Martijn van Oosterhout

Date:

06 March 2005, 22:14:53

On Sun, Mar 06, 2005 at 11:54:52PM +0200, Ioannis Theoharis wrote:
> > Visibility information. An index contains references to all rows in all
> > transactions. You need to go back to the table to work out if the row
> > is visible in your transaction.
> >
> > Hope this helpsm
>
> Yes at all.
> You have a lot of work for future postgresql' versions :-))

It's a fairly corner case feature, only for the case where you're
looking for the existance of an index key but don't want any other
data. I don't personally have any queries that could use such a
construct, though maybe I'm missing the point.

If someone wants to step up and write the code, well... Bit hard, it
means that everytime a row is updated the index needs to be updated
too, gotta get all the corner cases, locking, etc.

It's not high on anybodies list AFAIK...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-93947.dat

Re: About Access paths

From

Tom Lane

Date:

07 March 2005, 00:03:56

Martijn van Oosterhout <kleptog@svana.org> writes:
> If someone wants to step up and write the code, well... Bit hard, it
> means that everytime a row is updated the index needs to be updated
> too, gotta get all the corner cases, locking, etc.

Even if someone did step up and write the code, the patch would probably
be rejected.  Copying row commit data into index entries would have some
serious negative effects:

* Index bloat.  It would add probably 20 bytes to the size of an index
entry, which is a lot (it would double the size of an index on a simple
integer column, for instance).

* Extra time and extra I/O to update these additional copies.  Not only
do we have to write the index changes themselves, but we'd have to make
WAL log entries about each one (since the WAL replay environment is not
nearly bright enough to find index entries for itself).  That's
expensive.

The second point makes any such change a sure loser for an update-heavy
database.  In a nearly-read-only database, it might possibly be a
winner, but in view of the index bloat problem it's not a sure thing
even then.

In short: this has been discussed and rejected before.  Many times.

            regards, tom lane

Re: About Access paths

From

Greg Stark

Date:

07 March 2005, 07:57:17

Martijn van Oosterhout <kleptog@svana.org> writes:

> It's a fairly corner case feature, only for the case where you're
> looking for the existance of an index key but don't want any other
> data. I don't personally have any queries that could use such a
> construct, though maybe I'm missing the point.

You don't have any many-to-many relationships? Or any queries that use EXISTS
with indexable where clauses? Both of those are typical use cases for this
path.

In any case I think this type of scenario is more likely to be attacked via
some sort of "index organized table". But even that's not on the horizon right
now.

--
greg

Re: About Access paths

From

Ioannis Theoharis

Date:

07 March 2005, 17:47:34


>
> It's a fairly corner case feature, only for the case where you're
> looking for the existance of an index key but don't want any other
> data. I don't personally have any queries that could use such a
> construct, though maybe I'm missing the point.

I understand the difficulty of implementation of such a new feature and
the advantages and disadvantages coming with it.

But let me give you one of the most important usage of index only scan:

if one has two relations A(a1, a2, ..., an) and B(b1, b2, ..., bn) and let
b1 be a foreign key of one of the ai of A, (e.g of aj).

And let q be a query like:

select     A.ai
from     A, B
where    aj = b1

which is one of most common join queries,

then there if one has an index on b1 on relation B, there is no reason to
scann the whole table B, but only the b1 attribute. A good optimizer would
select index only access path, because it's the most selective acess path.

Eg. index nested loop algorithm for join, taking relation A as outer, and
B as inner, has to scan the whole relation B (Vs the whole index on B) for
each tuple(or block) of outer rel A.

Also, there are other less frequent examples in which index only acess
path is the optimal one.

Stuck with a query...

From

Geoff Caplan

Date:

08 March 2005, 17:47:39

Hi folks,

Sorry to ask a newbie SQL question but I'm struggling...

I have a website clickstream log:

request_id   session_id  sequence_num  url
100          xxx         1             /foo
101          xxx         2             /bar
102          xxx         3             /hoo
103          yyy         1             /foo
104          yyy         2             /bar

I need to count the number of times each url was an exit
page. An exit page is the highest sequence_num for each
session_id - 102 & 104 in the example. The report would look
like:

Exit pages:

/foo 4555
/bar 3204
/hoo 2337

etc...

Any pointers would be much appreciated!

------------------
Geoff Caplan
Vario Software Ltd
(+44) 121-515 1154

Re: Stuck with a query...

From

Greg Stark

Date:

08 March 2005, 18:04:54

Geoff Caplan <geoff@variosoft.com> writes:

> Hi folks,
>
> Sorry to ask a newbie SQL question but I'm struggling...

There's no efficient way to write this in standard SQL. However Postgres has
an extension DISTINCT ON that would do it:

select url,count(*)
  from (select distinct on (session_id)
               url
          from clickstream
         order by session_id,sequence_num  desc
       )
 group by url

This isn't going to be a superfast query. It has to sort all the clickstream
records by session and sequence, take just the last one, then probably sort
those again.

You could maybe make it faster by having an index on <session_id,sequence_num>
and doing order by "session_id desc, sequence_num desc". And giving this
session a larger than normal sort_mem would give it a better chance of being
able to use hash_agg for the count.

--
greg

Re: Stuck with a query...

From

Geoff Caplan

Date:

08 March 2005, 18:54:12

Greg,

GS> There's no efficient way to write this in standard SQL.
GS> However Postgres has an extension DISTINCT ON that would
GS> do it:

Works as advertised - many thanks! I'd missed the DISTINCT
ON extension...

This really is a great list - you've saved me a couple of
hours of agony, I suspect.

------------------
Geoff Caplan
Vario Software Ltd
(+44) 121-515 1154

Re: Stuck with a query...

From

Edmund Bacon

Date:

14 March 2005, 02:27:37

gsstark@mit.edu (Greg Stark) writes:

> Geoff Caplan <geoff@variosoft.com> writes:
>
> > Hi folks,
> >
> > Sorry to ask a newbie SQL question but I'm struggling...
>
> There's no efficient way to write this in standard SQL. However Postgres has
> an extension DISTINCT ON that would do it:
>
> select url,count(*)
>   from (select distinct on (session_id)
>                url
>           from clickstream
>          order by session_id,sequence_num  desc
>        )
>  group by url
>
> This isn't going to be a superfast query. It has to sort all the clickstream
> records by session and sequence, take just the last one, then probably sort
> those again.
>

As an experiment I tried a more  'standard SQL' approach to this problem:

SELECT url, count(1)
  FROM clickstream
 WHERE (session_id, sequence_num) IN
     (SELECT session_id, max(sequence_num)
        FROM clickstream
        GROUP BY session_id)
GROUP BY url;

On a table with about 100,000 rows this runs in about 720ms on my
system , compared to the ON DISTICNT version which runs in about
1000ms.  Adding an index on (session_id, sequence_num) reduced the run
time to about 690ms, but made no difference to the DISTINCT ON
version.  With only about 10,000 rows, there's no appreciable
difference. This surprised me, because I expected the DISTINCT ON to
be better.

--
Remove -42 for email