Thread: Hash partitioning.

Hash partitioning.

From
"Yuri Levinsky"
Date:
<div class="WordSection1"><p class="MsoNormal"><span style="font-family:"Georgia","serif";color:black"> Hi,</span><p
class="MsoNormal"><spanstyle="font-family:"Georgia","serif";color:black">Do we have any plans to implement Hash
Partitioning,maybe I missing this feature? </span><p class="MsoNormal"><span
style="font-family:"Georgia","serif";color:black"> </span><pclass="MsoNormal"><i><span
style="font-size:10.0pt;font-family:"Arial","sans-serif";color:#111111">Sincerelyyours</span></i><span
style="color:black">,</span><pclass="MsoNormal"><span style="color:black"> </span><p class="MsoNormal"><span
style="color:black"><imgalt="Description: Celltick logo_highres" height="53" id="Picture_x0020_1"
src="cid:image002.jpg@01CE71BB.6A05EF80"width="143" /></span><p class="MsoNormal"><span
style="font-size:8.0pt;font-family:"Arial","sans-serif";color:black">YuriLevinsky, DBA</span><span
style="font-size:10.0pt;font-family:"Arial","sans-serif";color:black"></span><pclass="MsoNormal"><span
style="font-size:8.0pt;font-family:"Arial","sans-serif";color:black">CelltickTechnologies Ltd., 32 Maskit St., Herzliya
46733,Israel</span><span style="font-size:10.0pt;font-family:"Arial","sans-serif";color:black"></span><p
class="MsoNormal"><spanstyle="font-size:8.0pt;font-family:"Arial","sans-serif";color:black">Mobile: +972 54 6107703,
Office:+972 9 9710239; Fax: +972 9 9710222</span><span
style="font-size:10.0pt;font-family:"Arial","sans-serif";color:black"></span><pclass="MsoNormal"> </div> 

Re: Hash partitioning.

From
Bruce Momjian
Date:
On Tue, Jun 25, 2013 at 03:48:19PM +0300, Yuri Levinsky wrote:
> Hi,
> 
> Do we have any plans to implement Hash Partitioning, maybe I missing this
> feature?

You can do it by writing your own constraint and trigger functions that
control the hashing.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Hash partitioning.

From
"Yuri Levinsky"
Date:
Bruce,
Many thanks. According to PostgreSQL documentation it's only range and
list partitions are supported. My question is: when I am following your
advice, is PostgreSQL will do partitioning pruning on select? My
expectation is:
I divided my table on 128 hash partitions according let's say user_id.
When I do select * from users where user_id=? , I am expecting the
engine select from some particular partition according to my function.
The issue is critical when you working with big tables, that you can't
normally partition by range/list. The feature allow parallel select from
such table: each thread might select from his own dedicated partition.
The feature also (mainly) allow to decrease index b-tree level on
partition key column by dividing index into smaller parts.

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-----Original Message-----
From: Bruce Momjian [mailto:bruce@momjian.us]
Sent: Tuesday, June 25, 2013 4:21 PM
To: Yuri Levinsky
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Hash partitioning.

On Tue, Jun 25, 2013 at 03:48:19PM +0300, Yuri Levinsky wrote:
> Hi,
>
> Do we have any plans to implement Hash Partitioning, maybe I missing
> this feature?

You can do it by writing your own constraint and trigger functions that
control the hashing.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

This mail was received via Mail-SeCure System.





Re: Hash partitioning.

From
Bruce Momjian
Date:
On Tue, Jun 25, 2013 at 05:19:47PM +0300, Yuri Levinsky wrote:
> Bruce,
> Many thanks. According to PostgreSQL documentation it's only range and
> list partitions are supported. My question is: when I am following your
> advice, is PostgreSQL will do partitioning pruning on select? My
> expectation is:
> I divided my table on 128 hash partitions according let's say user_id.
> When I do select * from users where user_id=? , I am expecting the
> engine select from some particular partition according to my function.
> The issue is critical when you working with big tables, that you can't
> normally partition by range/list. The feature allow parallel select from
> such table: each thread might select from his own dedicated partition.
> The feature also (mainly) allow to decrease index b-tree level on
> partition key column by dividing index into smaller parts.    

Uh, where do you see that we only support range and list?  You aren't
using an EnterpriseDB closed-source product, are you?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Hash partitioning.

From
Robert Haas
Date:
On Tue, Jun 25, 2013 at 9:21 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Tue, Jun 25, 2013 at 03:48:19PM +0300, Yuri Levinsky wrote:
>> Hi,
>>
>> Do we have any plans to implement Hash Partitioning, maybe I missing this
>> feature?
>
> You can do it by writing your own constraint and trigger functions that
> control the hashing.

Not really.  Constraint exclusion won't kick in for a constraint like
CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form  a = 42.

Of course, since partitioning generally doesn't improve performance in
PostgreSQL anyway, it's not clear why you'd want to do this in the
first place.  But the fact that constraint exclusion won't work if you
do is kind of a knockout blow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Hash partitioning.

From
Bruce Momjian
Date:
On Tue, Jun 25, 2013 at 11:02:40AM -0400, Robert Haas wrote:
> On Tue, Jun 25, 2013 at 9:21 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Tue, Jun 25, 2013 at 03:48:19PM +0300, Yuri Levinsky wrote:
> >> Hi,
> >>
> >> Do we have any plans to implement Hash Partitioning, maybe I missing this
> >> feature?
> >
> > You can do it by writing your own constraint and trigger functions that
> > control the hashing.
> 
> Not really.  Constraint exclusion won't kick in for a constraint like
> CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form  a = 42.

Uh, I thought we checked the constant against every CHECK constraint and
only scanned partitions that matched.  Why does this not work?

> Of course, since partitioning generally doesn't improve performance in
> PostgreSQL anyway, it's not clear why you'd want to do this in the

I think partitioning does improve performance by reducing index depth.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Hash partitioning.

From
Robert Haas
Date:
On Tue, Jun 25, 2013 at 11:06 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> Not really.  Constraint exclusion won't kick in for a constraint like
>> CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form  a = 42.
>
> Uh, I thought we checked the constant against every CHECK constraint and
> only scanned partitions that matched.  Why does this not work?

That's a pretty fuzzy description of what we do.  For this to work,
we'd have to be able to use the predicate a = 42 to prove that
hashme(a) % 16 = 3 is false.  But we can't actually substitute 42 in
for a and then evaluate hashme(42) % 16  = 3, because we don't know
that the a = 42 in the WHERE clause means exact equality for all
purposes, only that it means "has the numerically same value".  For
integers, equality under = is sufficient to prove equivalence.

But for numeric values, for example, it is not.  The values
'42'::numeric and '42.0'::numeric are equal according to =(numeric,
numeric), but they are not the same.  If the hashme() function did
something like length($1::text), it would get different answers for
those two values.  IOW, the theorem prover has no way of knowing that
the hash function provided has semantics that are compatible with the
opclass of the operator used in the query.

>> Of course, since partitioning generally doesn't improve performance in
>> PostgreSQL anyway, it's not clear why you'd want to do this in the
>
> I think partitioning does improve performance by reducing index depth.

Generally, I think traversing an extra level of the index is cheaper
than opening extra relations and going through the theorem-prover
machinery.  There are benefits to partitioning, but they have to do
with management - e.g. each partition can be vacuumed independently;
old partitions can be dropped more efficiently than you can
bulk-delete rows spread throughout a table - rather than performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Hash partitioning.

From
Bruce Momjian
Date:
On Tue, Jun 25, 2013 at 11:15:24AM -0400, Robert Haas wrote:
> On Tue, Jun 25, 2013 at 11:06 AM, Bruce Momjian <bruce@momjian.us> wrote:
> >> Not really.  Constraint exclusion won't kick in for a constraint like
> >> CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form  a = 42.
> >
> > Uh, I thought we checked the constant against every CHECK constraint and
> > only scanned partitions that matched.  Why does this not work?
> 
> That's a pretty fuzzy description of what we do.  For this to work,
> we'd have to be able to use the predicate a = 42 to prove that
> hashme(a) % 16 = 3 is false.  But we can't actually substitute 42 in
> for a and then evaluate hashme(42) % 16  = 3, because we don't know
> that the a = 42 in the WHERE clause means exact equality for all
> purposes, only that it means "has the numerically same value".  For
> integers, equality under = is sufficient to prove equivalence.
> 
> But for numeric values, for example, it is not.  The values
> '42'::numeric and '42.0'::numeric are equal according to =(numeric,
> numeric), but they are not the same.  If the hashme() function did
> something like length($1::text), it would get different answers for
> those two values.  IOW, the theorem prover has no way of knowing that
> the hash function provided has semantics that are compatible with the
> opclass of the operator used in the query.

I looked at predtest.c but I can't see how we accept >= and <= ranges,
but not CHECK (a % 16 == 3).  It is the '%' operator?  I am not sure why
the hashme() function is there.  Wouldn't it work if hashme() was an
immutable function?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Hash partitioning.

From
Robert Haas
Date:
On Tue, Jun 25, 2013 at 11:45 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Tue, Jun 25, 2013 at 11:15:24AM -0400, Robert Haas wrote:
>> On Tue, Jun 25, 2013 at 11:06 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> >> Not really.  Constraint exclusion won't kick in for a constraint like
>> >> CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form  a = 42.
>> >
>> > Uh, I thought we checked the constant against every CHECK constraint and
>> > only scanned partitions that matched.  Why does this not work?
>>
>> That's a pretty fuzzy description of what we do.  For this to work,
>> we'd have to be able to use the predicate a = 42 to prove that
>> hashme(a) % 16 = 3 is false.  But we can't actually substitute 42 in
>> for a and then evaluate hashme(42) % 16  = 3, because we don't know
>> that the a = 42 in the WHERE clause means exact equality for all
>> purposes, only that it means "has the numerically same value".  For
>> integers, equality under = is sufficient to prove equivalence.
>>
>> But for numeric values, for example, it is not.  The values
>> '42'::numeric and '42.0'::numeric are equal according to =(numeric,
>> numeric), but they are not the same.  If the hashme() function did
>> something like length($1::text), it would get different answers for
>> those two values.  IOW, the theorem prover has no way of knowing that
>> the hash function provided has semantics that are compatible with the
>> opclass of the operator used in the query.
>
> I looked at predtest.c but I can't see how we accept >= and <= ranges,
> but not CHECK (a % 16 == 3).  It is the '%' operator?  I am not sure why
> the hashme() function is there.  Wouldn't it work if hashme() was an
> immutable function?

Let me back up a minute.  You told the OP that he could make hash
partitioning by writing his own constraint and trigger functions.  I
think that won't work.  But I'm happy to be proven wrong.  Do you have
an example showing how to do it?

Here's why I think it WON'T work:

rhaas=# create table foo (a int, b text);
CREATE TABLE
rhaas=# create table foo0 (check ((a % 16) = 0)) inherits (foo);
CREATE TABLE
rhaas=# create table foo1 (check ((a % 16) = 1)) inherits (foo);
CREATE TABLE
rhaas=# create table foo2 (check ((a % 16) = 2)) inherits (foo);
CREATE TABLE
rhaas=# create table foo3 (check ((a % 16) = 3)) inherits (foo);
CREATE TABLE
rhaas=# explain select * from foo where a = 1;                        QUERY PLAN
------------------------------------------------------------Append  (cost=0.00..101.50 rows=25 width=36)  ->  Seq Scan
onfoo  (cost=0.00..0.00 rows=1 width=36)        Filter: (a = 1)  ->  Seq Scan on foo0  (cost=0.00..25.38 rows=6
width=36)       Filter: (a = 1)  ->  Seq Scan on foo1  (cost=0.00..25.38 rows=6 width=36)        Filter: (a = 1)  ->
SeqScan on foo2  (cost=0.00..25.38 rows=6 width=36)        Filter: (a = 1)  ->  Seq Scan on foo3  (cost=0.00..25.38
rows=6width=36)        Filter: (a = 1)
 
(11 rows)

Notice we get a scan on every partition.  Now let's try it with no
modulo arithmetic, just a straightforward one-partition-per-value:

rhaas=# create table foo (a int, b text);
CREATE TABLE
rhaas=# create table foo0 (check (a = 0)) inherits (foo);
CREATE TABLE
rhaas=# create table foo1 (check (a = 1)) inherits (foo);
CREATE TABLE
rhaas=# create table foo2 (check (a = 2)) inherits (foo);
CREATE TABLE
rhaas=# create table foo3 (check (a = 3)) inherits (foo);
CREATE TABLE
rhaas=# explain select * from foo where a = 1;                        QUERY PLAN
------------------------------------------------------------Append  (cost=0.00..25.38 rows=7 width=36)  ->  Seq Scan on
foo (cost=0.00..0.00 rows=1 width=36)        Filter: (a = 1)  ->  Seq Scan on foo1  (cost=0.00..25.38 rows=6 width=36)
     Filter: (a = 1)
 
(5 rows)

Voila, now constraint exclusion is working.

I confess that I'm not entirely clear about the details either, but
the above tests speak for themselves.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Hash partitioning.

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> I looked at predtest.c but I can't see how we accept >= and <= ranges,
> but not CHECK (a % 16 == 3).  It is the '%' operator?  I am not sure why
> the hashme() function is there.  Wouldn't it work if hashme() was an
> immutable function?

No.  Robert's description is exactly correct: it's a question of whether
we can know that the semantics of function X have anything to do with
the behavior of operator Y.  In the case of something like CHECK (X >= 16)
combined with WHERE X = 10, if the given = and >= operators belong to
the same btree opclass family then we can assume that their semantics
are compatible and then apply reasoning to show that these two clauses
can't both be true for the same value of X.  We can *not* use "X = 10"
to reason about the behavior of anything that isn't in the = operator's
btree opclass, because we don't assume that "=" means "absolutely
identical for every purpose".  And in fact it does not mean that for
several pretty common datatypes (float being another example besides
numeric).
        regards, tom lane



Re: Hash partitioning.

From
"Yuri Levinsky"
Date:
Guys,
I am sorry for taking your time. The reason for my question is:
As former Oracle DBA and now simple beginner PostgreSQL DBA I would like
to say: the current partitioning mechanism might be improved. Sorry, it
seems to me far behind yesterday requirements. As model for improvement
the Oracle might be taken as example. Unfortunately I am not writing an
C code and see my benefit to PostgreSQL community in only rising this
issue. I'll be very happy to be helpful in something else, but...

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-----Original Message-----
From: Robert Haas [mailto:robertmhaas@gmail.com]
Sent: Tuesday, June 25, 2013 6:55 PM
To: Bruce Momjian
Cc: Yuri Levinsky; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Hash partitioning.

On Tue, Jun 25, 2013 at 11:45 AM, Bruce Momjian <bruce@momjian.us>
wrote:
> On Tue, Jun 25, 2013 at 11:15:24AM -0400, Robert Haas wrote:
>> On Tue, Jun 25, 2013 at 11:06 AM, Bruce Momjian <bruce@momjian.us>
wrote:
>> >> Not really.  Constraint exclusion won't kick in for a constraint
>> >> like CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form  a
= 42.
>> >
>> > Uh, I thought we checked the constant against every CHECK
>> > constraint and only scanned partitions that matched.  Why does this
not work?
>>
>> That's a pretty fuzzy description of what we do.  For this to work,
>> we'd have to be able to use the predicate a = 42 to prove that
>> hashme(a) % 16 = 3 is false.  But we can't actually substitute 42 in
>> for a and then evaluate hashme(42) % 16  = 3, because we don't know
>> that the a = 42 in the WHERE clause means exact equality for all
>> purposes, only that it means "has the numerically same value".  For
>> integers, equality under = is sufficient to prove equivalence.
>>
>> But for numeric values, for example, it is not.  The values
>> '42'::numeric and '42.0'::numeric are equal according to =(numeric,
>> numeric), but they are not the same.  If the hashme() function did
>> something like length($1::text), it would get different answers for
>> those two values.  IOW, the theorem prover has no way of knowing that

>> the hash function provided has semantics that are compatible with the

>> opclass of the operator used in the query.
>
> I looked at predtest.c but I can't see how we accept >= and <= ranges,

> but not CHECK (a % 16 == 3).  It is the '%' operator?  I am not sure
> why the hashme() function is there.  Wouldn't it work if hashme() was
> an immutable function?

Let me back up a minute.  You told the OP that he could make hash
partitioning by writing his own constraint and trigger functions.  I
think that won't work.  But I'm happy to be proven wrong.  Do you have
an example showing how to do it?

Here's why I think it WON'T work:

rhaas=# create table foo (a int, b text); CREATE TABLE rhaas=# create
table foo0 (check ((a % 16) = 0)) inherits (foo); CREATE TABLE rhaas=#
create table foo1 (check ((a % 16) = 1)) inherits (foo); CREATE TABLE
rhaas=# create table foo2 (check ((a % 16) = 2)) inherits (foo); CREATE
TABLE rhaas=# create table foo3 (check ((a % 16) = 3)) inherits (foo);
CREATE TABLE rhaas=# explain select * from foo where a = 1;                        QUERY PLAN
------------------------------------------------------------Append  (cost=0.00..101.50 rows=25 width=36)  ->  Seq Scan
onfoo  (cost=0.00..0.00 rows=1 width=36)        Filter: (a = 1)  ->  Seq Scan on foo0  (cost=0.00..25.38 rows=6
width=36)       Filter: (a = 1)  ->  Seq Scan on foo1  (cost=0.00..25.38 rows=6 width=36)        Filter: (a = 1)  ->
SeqScan on foo2  (cost=0.00..25.38 rows=6 width=36)        Filter: (a = 1)  ->  Seq Scan on foo3  (cost=0.00..25.38
rows=6width=36)        Filter: (a = 1) 
(11 rows)

Notice we get a scan on every partition.  Now let's try it with no
modulo arithmetic, just a straightforward one-partition-per-value:

rhaas=# create table foo (a int, b text); CREATE TABLE rhaas=# create
table foo0 (check (a = 0)) inherits (foo); CREATE TABLE rhaas=# create
table foo1 (check (a = 1)) inherits (foo); CREATE TABLE rhaas=# create
table foo2 (check (a = 2)) inherits (foo); CREATE TABLE rhaas=# create
table foo3 (check (a = 3)) inherits (foo); CREATE TABLE rhaas=# explain
select * from foo where a = 1;                        QUERY PLAN
------------------------------------------------------------Append  (cost=0.00..25.38 rows=7 width=36)  ->  Seq Scan on
foo (cost=0.00..0.00 rows=1 width=36)        Filter: (a = 1)  ->  Seq Scan on foo1  (cost=0.00..25.38 rows=6 width=36)
     Filter: (a = 1) 
(5 rows)

Voila, now constraint exclusion is working.

I confess that I'm not entirely clear about the details either, but the
above tests speak for themselves.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
Company

This mail was received via Mail-SeCure System.





Re: Hash partitioning.

From
Bruce Momjian
Date:
On Tue, Jun 25, 2013 at 12:08:34PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I looked at predtest.c but I can't see how we accept >= and <= ranges,
> > but not CHECK (a % 16 == 3).  It is the '%' operator?  I am not sure why
> > the hashme() function is there.  Wouldn't it work if hashme() was an
> > immutable function?
> 
> No.  Robert's description is exactly correct: it's a question of whether
> we can know that the semantics of function X have anything to do with
> the behavior of operator Y.  In the case of something like CHECK (X >= 16)
> combined with WHERE X = 10, if the given = and >= operators belong to
> the same btree opclass family then we can assume that their semantics
> are compatible and then apply reasoning to show that these two clauses
> can't both be true for the same value of X.  We can *not* use "X = 10"
> to reason about the behavior of anything that isn't in the = operator's
> btree opclass, because we don't assume that "=" means "absolutely
> identical for every purpose".  And in fact it does not mean that for
> several pretty common datatypes (float being another example besides
> numeric).

OK, so it is really the index comparisons that we are using;  makes
sense.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Hash partitioning.

From
Alvaro Herrera
Date:
Yuri Levinsky escribió:

> As former Oracle DBA and now simple beginner PostgreSQL DBA I would like
> to say: the current partitioning mechanism might be improved. Sorry, it
> seems to me far behind yesterday requirements.

I don't think you'll find anybody that disagrees with this.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Hash partitioning.

From
Claudio Freire
Date:
On Tue, Jun 25, 2013 at 12:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Let me back up a minute.  You told the OP that he could make hash
> partitioning by writing his own constraint and trigger functions.  I
> think that won't work.  But I'm happy to be proven wrong.  Do you have
> an example showing how to do it?
>
> Here's why I think it WON'T work:
>
> rhaas=# create table foo (a int, b text);
> CREATE TABLE
> rhaas=# create table foo0 (check ((a % 16) = 0)) inherits (foo);
> CREATE TABLE
> rhaas=# create table foo1 (check ((a % 16) = 1)) inherits (foo);
> CREATE TABLE
> rhaas=# create table foo2 (check ((a % 16) = 2)) inherits (foo);
> CREATE TABLE
> rhaas=# create table foo3 (check ((a % 16) = 3)) inherits (foo);
> CREATE TABLE
> rhaas=# explain select * from foo where a = 1;
>                          QUERY PLAN
> ------------------------------------------------------------
>  Append  (cost=0.00..101.50 rows=25 width=36)
>    ->  Seq Scan on foo  (cost=0.00..0.00 rows=1 width=36)
>          Filter: (a = 1)
>    ->  Seq Scan on foo0  (cost=0.00..25.38 rows=6 width=36)
>          Filter: (a = 1)
>    ->  Seq Scan on foo1  (cost=0.00..25.38 rows=6 width=36)
>          Filter: (a = 1)
>    ->  Seq Scan on foo2  (cost=0.00..25.38 rows=6 width=36)
>          Filter: (a = 1)
>    ->  Seq Scan on foo3  (cost=0.00..25.38 rows=6 width=36)
>          Filter: (a = 1)
> (11 rows)
>
> Notice we get a scan on every partition.  Now let's try it with no
> modulo arithmetic, just a straightforward one-partition-per-value:
>
> rhaas=# create table foo (a int, b text);
> CREATE TABLE
> rhaas=# create table foo0 (check (a = 0)) inherits (foo);
> CREATE TABLE
> rhaas=# create table foo1 (check (a = 1)) inherits (foo);
> CREATE TABLE
> rhaas=# create table foo2 (check (a = 2)) inherits (foo);
> CREATE TABLE
> rhaas=# create table foo3 (check (a = 3)) inherits (foo);
> CREATE TABLE
> rhaas=# explain select * from foo where a = 1;


Did you try "select * from foo where (a % 16) = (1::int % 16)"?

A few views I have that span multiple "partitions" (in quotes since
they're not exactly partitions, but close), I can make constraint
exclusion work if I match the expression EXACTLY, including types
(I've posted a few questions about this to pg-performance).



Re: Hash partitioning.

From
Christopher Browne
Date:
<div dir="ltr">On Tue, Jun 25, 2013 at 12:08 PM, Yuri Levinsky <span dir="ltr"><<a href="mailto:yuril@celltick.com"
target="_blank">yuril@celltick.com</a>></span>wrote:<br /><div class="gmail_extra"><div
class="gmail_quote"><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">Guys,<br/> I am sorry for taking your time. The reason for my question is:<br /> As former
OracleDBA and now simple beginner PostgreSQL DBA I would like<br /> to say: the current partitioning mechanism might be
improved.Sorry, it<br /> seems to me far behind yesterday requirements. As model for improvement<br /> the Oracle might
betaken as example. Unfortunately I am not writing an<br /> C code and see my benefit to PostgreSQL community in only
risingthis<br /> issue. I'll be very happy to be helpful in something else, but...<br /></blockquote></div><br
/></div><divclass="gmail_extra">Please don't flee over this...<br /><br /></div><div class="gmail_extra">As I think you
cansee, now, the partitioning problem is tougher than it<br />may at first seem to be.  It's quite useful to quickly
getto the point of<br /> understanding that.<br /><br /></div><div class="gmail_extra">There would indeed be merit in
improvingthe partitioning apparatus,<br /></div><div class="gmail_extra">and actually, I think it's been a couple of
yearssince there has been<br /> serious discussion of this.<br /><br /></div><div class="gmail_extra">The discussion
tendsto head into the rabbit hole of disputing about<br />whether one mechanism or another is ideal.  That's the wrong
starting<br/>point - we shouldn't start with what's easiest to make "ideal," we <br /> should start by determining what
isrequired/desirable, without too<br />much reference, at least initially, on to how to achieve it.<br /></div><div
class="gmail_extra">--<br />When confronted by a difficult problem, solve it by reducing it to the<br /> question, "How
wouldthe Lone Ranger handle this?"<br /></div></div> 

Re: Hash partitioning.

From
Tom Lane
Date:
Christopher Browne <cbbrowne@gmail.com> writes:
> There would indeed be merit in improving the partitioning apparatus,
> and actually, I think it's been a couple of years since there has been
> serious discussion of this.

We could certainly use a partitioning mechanism that's easier to use
than what we have now, which is basically "build it yourself, here's
the parts bin".  There would also be some performance benefits from
moving the partitioning logic into hard-wired code.

However, I find it hard to think that hash partitioning as such is very
high on the to-do list.  As was pointed out upthread, the main practical
advantage of partitioning is *not* performance of routine queries, but
improved bulk-data management such as the ability to do periodic
housecleaning by dropping a partition.  If your partitioning is on a
hash, you've thrown away any such advantage, because there's no
real-world meaning to the way the data's been split up.  So I find range
and list partitioning way more plausible.
        regards, tom lane



Re: Hash partitioning.

From
Claudio Freire
Date:
On Tue, Jun 25, 2013 at 4:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, I find it hard to think that hash partitioning as such is very
> high on the to-do list.  As was pointed out upthread, the main practical
> advantage of partitioning is *not* performance of routine queries, but
> improved bulk-data management such as the ability to do periodic
> housecleaning by dropping a partition.  If your partitioning is on a
> hash, you've thrown away any such advantage, because there's no
> real-world meaning to the way the data's been split up.  So I find range
> and list partitioning way more plausible.


It would be nice if range partitioning based on some user-defined
function was completely automatic, as in:

* You define a function that returns a partition name for a given input.
* You define a table to somehow be auto-partitioned on
your_function(some_column)
* The planner knows now it's some_column applied to your_function, so
it can do constraint exclusion checks (your_function would probably
need to be stable at least)
* If a returned partition is missing... what? (auto-create? that'd be nice)

It's pretty much what we have already, albeit easier to use. And,
perhaps constraint exclusion logic could be specialized for this case,
and made more robust.



Re: Hash partitioning.

From
Kevin Grittner
Date:
Claudio Freire <klaussfreire@gmail.com> wrote:

> Did you try "select * from foo where (a % 16) = (1::int % 16)"?

I did.  Using Robert's hashed partitioning table definitions:

test=# explain select * from foo where a = 1 and (a % 16) = (1 % 16);
                         QUERY PLAN                        
------------------------------------------------------------
 Append  (cost=0.00..31.53 rows=2 width=36)
   ->  Seq Scan on foo  (cost=0.00..0.00 rows=1 width=36)
         Filter: ((a = 1) AND ((a % 16) = 1))
   ->  Seq Scan on foo1  (cost=0.00..31.53 rows=1 width=36)
         Filter: ((a = 1) AND ((a % 16) = 1))
(5 rows)

So if you are generating your queries through something capable of
generating that last clause off of the first, this could work.  Not
all applications need to remain as flexible about the operators as
we want the database engine itself to be.

I agree though, that having an index implementation that can do the
first level split faster than any partitioning mechanism can do is
better, and that the main benefits of partitioning are in
administration, *not* searching.  At least until we have parallel
query execution.  At *that* point this all changes.

One other thing worth noting is that I have several times seen
cases where the planner cannot exclude partitions, but at execution
time it finds that it doesn't need to execute all of the plan
nodes.  I think it makes sense to not work quite so hard to
eliminate partitions at plan time if we can skip the unneeded ones
at run time, once we have more data values resolved.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Hash partitioning.

From
Claudio Freire
Date:
On Tue, Jun 25, 2013 at 6:52 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> I agree though, that having an index implementation that can do the
> first level split faster than any partitioning mechanism can do is
> better, and that the main benefits of partitioning are in
> administration, *not* searching.


Indeed, but the proposal for hash partitions isn't fundamentally
different from range partitions. It's "easy-to-use partitions over
user-defined functions", hash or not.



Re: Hash partitioning.

From
"Yuri Levinsky"
Date:
 Tom,
I clearly understand your point. I actually came from corporate market
such as Verizon, Barclays... I remember very good that PostgreSQL is
open source, but let's forget it for a moment. The key issue for
corporate market always been a partitioning(vertical and lately
horizontal). Because of that Oracle has too many types and combinations
of partitions, the other vendors as well. Easy partitions maintenance
(automatic, simple syntax) is very important for everybody who lives in
corporate RDBMS world and not only use "DB's for free" in order to
create some virtual shop. The main purpose of partitioning in my world
is to store billions of rows and be able to search by date, hour or even
minute as fast as possible. When you dealing with company, which has
~350.000.000 users, and you don't want to use key/value data stores: you
need hash partitioned tables and hash partitioned table clusters to
perform fast search and 4-6 tables join based on user phone number for
example.  I believe to increase PostgreSQL popularity in corporate
world, to make real money from support, the next features might be:
better vertical and later horizontal partitioning,  columnar-oriented
tables, DB freeze for NetApp/EMC snapshots and similar.

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Tuesday, June 25, 2013 10:33 PM
To: Christopher Browne
Cc: Yuri Levinsky; Robert Haas; Bruce Momjian; PostgreSQL Mailing Lists
Subject: Re: [HACKERS] Hash partitioning.

Christopher Browne <cbbrowne@gmail.com> writes:
> There would indeed be merit in improving the partitioning apparatus,
> and actually, I think it's been a couple of years since there has been

> serious discussion of this.

We could certainly use a partitioning mechanism that's easier to use
than what we have now, which is basically "build it yourself, here's the
parts bin".  There would also be some performance benefits from moving
the partitioning logic into hard-wired code.

However, I find it hard to think that hash partitioning as such is very
high on the to-do list.  As was pointed out upthread, the main practical
advantage of partitioning is *not* performance of routine queries, but
improved bulk-data management such as the ability to do periodic
housecleaning by dropping a partition.  If your partitioning is on a
hash, you've thrown away any such advantage, because there's no
real-world meaning to the way the data's been split up.  So I find range
and list partitioning way more plausible.
        regards, tom lane



Re: Hash partitioning.

From
Heikki Linnakangas
Date:
On 26.06.2013 11:17, Yuri Levinsky wrote:
> The main purpose of partitioning in my world
> is to store billions of rows and be able to search by date, hour or even
> minute as fast as possible.

Hash partitioning sounds like a bad fit for that use case. A regular 
b-tree, possibly with range partitioning, sounds optimal for that.

> When you dealing with company, which has
> ~350.000.000 users, and you don't want to use key/value data stores: you
> need hash partitioned tables and hash partitioned table clusters to
> perform fast search and 4-6 tables join based on user phone number for
> example.

B-trees are surprisingly fast for key-value lookups. There is no reason 
to believe that a hash partitioned table would be faster for that than a 
plain table.

- Heikki



Re: Hash partitioning.

From
Bruce Momjian
Date:
On Tue, Jun 25, 2013 at 02:52:33PM -0700, Kevin Grittner wrote:
> Claudio Freire <klaussfreire@gmail.com> wrote:
> 
> > Did you try "select * from foo where (a % 16) = (1::int % 16)"?
> 
> I did.  Using Robert's hashed partitioning table definitions:
> 
> test=# explain select * from foo where a = 1 and (a % 16) = (1 % 16);
>                          QUERY PLAN                        
> ------------------------------------------------------------
>  Append  (cost=0.00..31.53 rows=2 width=36)
>    ->  Seq Scan on foo  (cost=0.00..0.00 rows=1 width=36)
>          Filter: ((a = 1) AND ((a % 16) = 1))
>    ->  Seq Scan on foo1  (cost=0.00..31.53 rows=1 width=36)
>          Filter: ((a = 1) AND ((a % 16) = 1))
> (5 rows)
> 
> So if you are generating your queries through something capable of
> generating that last clause off of the first, this could work.  Not

OK, so what is it in our code that requires that?  It is a type
mismatch?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Hash partitioning.

From
"Yuri Levinsky"
Date:
Heikki,
As far as I understand the height of the btree will affect the number of
I/Os necessary. The height of the tree does not increase linearly with
the number of records. May be I wrong in terminology but when I am
trying to insert data into empty table the insertion time is increasing
when number of records is growing. In order to keep indexes as small as
possible I usually split the table by hash if I don't have any better
alternative. On some systems hash functions +index might work faster
when only index for insert and search operations. This especially usable
when you have non unique index with small number of possible values that
you don't know in advance or that changing between your customers. In
that case the hash partition has to be used instead of index.

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-----Original Message-----
From: Heikki Linnakangas [mailto:hlinnakangas@vmware.com]
Sent: Wednesday, June 26, 2013 2:23 PM
To: Yuri Levinsky
Cc: Tom Lane; Christopher Browne; Robert Haas; Bruce Momjian; PostgreSQL
Mailing Lists
Subject: Re: [HACKERS] Hash partitioning.

On 26.06.2013 11:17, Yuri Levinsky wrote:
> The main purpose of partitioning in my world is to store billions of
> rows and be able to search by date, hour or even minute as fast as
> possible.

Hash partitioning sounds like a bad fit for that use case. A regular
b-tree, possibly with range partitioning, sounds optimal for that.

> When you dealing with company, which has
> ~350.000.000 users, and you don't want to use key/value data stores:
> you need hash partitioned tables and hash partitioned table clusters
> to perform fast search and 4-6 tables join based on user phone number
> for example.

B-trees are surprisingly fast for key-value lookups. There is no reason
to believe that a hash partitioned table would be faster for that than a
plain table.

- Heikki

This mail was received via Mail-SeCure System.





Re: Hash partitioning.

From
Markus Wanner
Date:
On 06/25/2013 11:52 PM, Kevin Grittner wrote:
> At least until we have parallel
> query execution.  At *that* point this all changes.

Can you elaborate on that, please? I currently have a hard time
imagining how partitions can help performance in that case, either. At
least compared to modern RAID and read-ahead capabilities.

After all, RAID can be thought of as hash partitioning with a very weird
hash function. Or maybe rather range partitioning on an internal key.

Put another way: ideally, the system should take care of optimally
distributing data across its physical storage itself. If you need to do
partitioning manually for performance reasons, that's actually a
deficiency of it, not a feature.

I certainly agree that manageability may be a perfectly valid reason to
partition your data. Maybe there even exist other good reasons. I don't
think performance optimization is one. (It's more like giving the system
a hint. And we all dislike hints, don't we? *ducks*)

Regards

Markus Wanner



Re: Hash partitioning.

From
"ktm@rice.edu"
Date:
On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote:
> On 06/25/2013 11:52 PM, Kevin Grittner wrote:
> > At least until we have parallel
> > query execution.  At *that* point this all changes.
> 
> Can you elaborate on that, please? I currently have a hard time
> imagining how partitions can help performance in that case, either. At
> least compared to modern RAID and read-ahead capabilities.
> 
> After all, RAID can be thought of as hash partitioning with a very weird
> hash function. Or maybe rather range partitioning on an internal key.
> 
> Put another way: ideally, the system should take care of optimally
> distributing data across its physical storage itself. If you need to do
> partitioning manually for performance reasons, that's actually a
> deficiency of it, not a feature.
> 
> I certainly agree that manageability may be a perfectly valid reason to
> partition your data. Maybe there even exist other good reasons. I don't
> think performance optimization is one. (It's more like giving the system
> a hint. And we all dislike hints, don't we? *ducks*)
> 
> Regards
> 
> Markus Wanner
> 

Hi Markus,

I think he is referring to the fact that with parallel query execution,
multiple partitions can be processed simultaneously instead of serially
as they are now with the resulting speed increase.

Regards,
Ken



Re: Hash partitioning.

From
Heikki Linnakangas
Date:
On 26.06.2013 16:41, Yuri Levinsky wrote:
> Heikki,
> As far as I understand the height of the btree will affect the number of
> I/Os necessary. The height of the tree does not increase linearly with
> the number of records.

The height of a b-tree is O(log n), where n is the number of records. 
Informally, if we assume that you have on average, say, 1000 keys on one 
b-tree page, a two-level b-tree can hold one million items, and a three 
level one billion items, and so on. The height of the tree affects the 
number of I/Os needed for searches, but keep in mind that the top levels 
of the tree are going to be very frequently accessed and in practice 
will stay permanently cached. You will only perform actual I/O on the 
1-2 bottom levels of the tree (or 0 if it all fits in cache)

Now let's compare that with a hash partitioned table, with 1000 
partitions and a b-tree index on every partition. When doing a search, 
you first hash the key to look up the right partition, then you search 
the index of that partition. This is almost equivalent to just having a 
b-tree that's one level taller - instead of looking up the right 
partition in the hash table, you look up the right child page at the 
root of the b-tree. From a very coarse theoretical point of view, the 
only difference is that you replaced the binary search on the b-tree 
root page with an equivalent hash lookup. A hash lookup can be somewhat 
cheaper than binary search, but in practice there is very little 
difference. There certainly isn't any difference in the number of actual 
I/O performed.

In practice, there might be a lot of quirks and inefficiencies and 
locking contention etc. involved in various DBMS's, that you might be 
able to work around with hash partitioning. But from a theoretical point 
of view, there is no reason to expect just partitioning a table on a 
hash to make key-value lookups any faster.

- Heikki



Re: Hash partitioning.

From
"Yuri Levinsky"
Date:
Markus,
It's no relation between partitions and raids despite they both
distribute data somehow. By the end of the day when you use the raid you
have one single device with some performance limitations. When you want
to improve your data access after that and not to work with huge indexes
that you unable to maintain or you don't want to use index like in case
of range partition by time or hash partition: you welcome to use
partitions. You typically don't want to use b-tree index when yo select
more when ~1-2% of your data.

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-----Original Message-----
From: ktm@rice.edu [mailto:ktm@rice.edu]
Sent: Wednesday, June 26, 2013 5:01 PM
To: Markus Wanner
Cc: Kevin Grittner; Claudio Freire; Robert Haas; Bruce Momjian; Yuri
Levinsky; PostgreSQL-Dev
Subject: Re: [HACKERS] Hash partitioning.

On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote:
> On 06/25/2013 11:52 PM, Kevin Grittner wrote:
> > At least until we have parallel
> > query execution.  At *that* point this all changes.
>
> Can you elaborate on that, please? I currently have a hard time
> imagining how partitions can help performance in that case, either. At

> least compared to modern RAID and read-ahead capabilities.
>
> After all, RAID can be thought of as hash partitioning with a very
> weird hash function. Or maybe rather range partitioning on an internal
key.
>
> Put another way: ideally, the system should take care of optimally
> distributing data across its physical storage itself. If you need to
> do partitioning manually for performance reasons, that's actually a
> deficiency of it, not a feature.
>
> I certainly agree that manageability may be a perfectly valid reason
> to partition your data. Maybe there even exist other good reasons. I
> don't think performance optimization is one. (It's more like giving
> the system a hint. And we all dislike hints, don't we? *ducks*)
>
> Regards
>
> Markus Wanner
>

Hi Markus,

I think he is referring to the fact that with parallel query execution,
multiple partitions can be processed simultaneously instead of serially
as they are now with the resulting speed increase.

Regards,
Ken

This mail was received via Mail-SeCure System.





Re: Hash partitioning.

From
Bruce Momjian
Date:
On Wed, Jun 26, 2013 at 05:10:00PM +0300, Heikki Linnakangas wrote:
> In practice, there might be a lot of quirks and inefficiencies and
> locking contention etc. involved in various DBMS's, that you might
> be able to work around with hash partitioning. But from a
> theoretical point of view, there is no reason to expect just
> partitioning a table on a hash to make key-value lookups any faster.

Good analysis.  Has anyone benchmarked this to know our btree is
efficient in this area?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Hash partitioning.

From
"Yuri Levinsky"
Date:
 Heiki,
This is most professional explanation that I ever seen. Let me please
disagree with a bottom line. It's heavily depends on amount of memory
and actual index sizes. I did a benchmark ~6 years ago and I won a glass
of beer.  Anyway I am talking about hash partitioning as a feature and
my example about compare with unique b-tree index scan is little bit
extreme. In case you have 2,4,8..1024 different values (not known in
advance) the index might be eliminated. That's whole the feature: no
competition for hash function.

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-----Original Message-----
From: Heikki Linnakangas [mailto:hlinnakangas@vmware.com]
Sent: Wednesday, June 26, 2013 5:10 PM
To: Yuri Levinsky
Cc: Tom Lane; Christopher Browne; Robert Haas; Bruce Momjian; PostgreSQL
Mailing Lists
Subject: Re: [HACKERS] Hash partitioning.

On 26.06.2013 16:41, Yuri Levinsky wrote:
> Heikki,
> As far as I understand the height of the btree will affect the number
> of I/Os necessary. The height of the tree does not increase linearly
> with the number of records.

The height of a b-tree is O(log n), where n is the number of records.
Informally, if we assume that you have on average, say, 1000 keys on one
b-tree page, a two-level b-tree can hold one million items, and a three
level one billion items, and so on. The height of the tree affects the
number of I/Os needed for searches, but keep in mind that the top levels
of the tree are going to be very frequently accessed and in practice
will stay permanently cached. You will only perform actual I/O on the
1-2 bottom levels of the tree (or 0 if it all fits in cache)

Now let's compare that with a hash partitioned table, with 1000
partitions and a b-tree index on every partition. When doing a search,
you first hash the key to look up the right partition, then you search
the index of that partition. This is almost equivalent to just having a
b-tree that's one level taller - instead of looking up the right
partition in the hash table, you look up the right child page at the
root of the b-tree. From a very coarse theoretical point of view, the
only difference is that you replaced the binary search on the b-tree
root page with an equivalent hash lookup. A hash lookup can be somewhat
cheaper than binary search, but in practice there is very little
difference. There certainly isn't any difference in the number of actual
I/O performed.

In practice, there might be a lot of quirks and inefficiencies and
locking contention etc. involved in various DBMS's, that you might be
able to work around with hash partitioning. But from a theoretical point
of view, there is no reason to expect just partitioning a table on a
hash to make key-value lookups any faster.

- Heikki

This mail was received via Mail-SeCure System.





Re: Hash partitioning.

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 26.06.2013 11:17, Yuri Levinsky wrote:
>> When you dealing with company, which has
>> ~350.000.000 users, and you don't want to use key/value data stores: you
>> need hash partitioned tables and hash partitioned table clusters to
>> perform fast search and 4-6 tables join based on user phone number for
>> example.

> B-trees are surprisingly fast for key-value lookups. There is no reason 
> to believe that a hash partitioned table would be faster for that than a 
> plain table.

Or in short: the quoted advice may very well be true for Oracle, but
applying it blindly to Postgres is not a good idea.  PG's performance
characteristics are a lot different, especially in the area of
partitioned tables.
        regards, tom lane



Re: Hash partitioning.

From
Markus Wanner
Date:
On 06/26/2013 04:10 PM, Yuri Levinsky wrote:
> You typically don't want to use b-tree index when yo select
> more when ~1-2% of your data. 

Agreed. Indices on columns with very low selectivity don't perform well.
(Postgres knows that and uses a sequential scan based on selectivity
estimates. Being able to eliminate entire partitions from such a seq
scan would certainly be beneficial, yes.)

In the Postgres world, though, I think CLUSTERing might be the better
approach to solve that problem. (Note: this has nothing to do with
distributed systems in this case.) I'm not sure what the current status
of auto clustering or optimized scans on such a permanently clustered
table is, though.

The minmax indices proposed for 9.4 might be another feature worth
looking at.

Both of these approaches may eventually provide a more general and more
automatic way to speed up scans on large portions of a relation, IMO.

Regards

Markus Wanner



Re: Hash partitioning.

From
Markus Wanner
Date:
On 06/26/2013 04:01 PM, ktm@rice.edu wrote:
> I think he is referring to the fact that with parallel query execution,
> multiple partitions can be processed simultaneously instead of serially
> as they are now with the resulting speed increase.

Processing simultaneously is the purpose of parallel query execution,
yes. But I see no reason for that not to work equally well for
unpartitioned tables.

Disk I/O is already pretty well optimized and parallelized, I think.
Trying to parallelize a seq scan on the Postgres side is likely to yield
far inferior performance.

Regards

Markus Wanner



Re: Hash partitioning.

From
Bruce Momjian
Date:
On Wed, Jun 26, 2013 at 05:04:11PM +0200, Markus Wanner wrote:
> On 06/26/2013 04:01 PM, ktm@rice.edu wrote:
> > I think he is referring to the fact that with parallel query execution,
> > multiple partitions can be processed simultaneously instead of serially
> > as they are now with the resulting speed increase.
> 
> Processing simultaneously is the purpose of parallel query execution,
> yes. But I see no reason for that not to work equally well for
> unpartitioned tables.

Well, I think by definition you are going to be able to spread lookups
for a particular range across more partitions with 'hash' than with
range or list partitioning.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Hash partitioning.

From
Kevin Grittner
Date:
Markus Wanner <markus@bluegap.ch> wrote:
> On 06/25/2013 11:52 PM, Kevin Grittner wrote:
>> At least until we have parallel
>> query execution.  At *that* point this all changes.
>
> Can you elaborate on that, please? I currently have a hard time
> imagining how partitions can help performance in that case,
> either.

Well, partitioning will *still* be a net loss for overall
throughput on a machine with enough active connections to keep all
the resources busy.  Where it will help is when you have a machine
with a lot of cores and a few big "reporting" style queries.  Since
we currently can only use one core for a single query, we leave a
lot of CPU time (often the bottleneck for such queries) unused.  If
we allow a large query to search multiple partitions in parallel, a
big query can complete sooner.  It will consume more resources
overall in doing so, but if those resources would otherwise sit
idle, it could be a big win for some use cases.

> Put another way: ideally, the system should take care of
> optimally distributing data across its physical storage itself.

Agreed.  That's not where I see the win.  Most cases where I've
seen people attempt to micro-manage object placement have performed
worse than somply giving the OS a big RAID (we had 40 spindles in
some of ours at Wis. Courts) and letting the filesystem figure it
out.  There are exceptions for special cases like WAL.  I found out
by accident how much that can matter.

> If you need to do partitioning manually for performance reasons,
> that's actually a deficiency of it, not a feature.

+1

> I certainly agree that manageability may be a perfectly valid
> reason to partition your data.

It can definitely help there.

> Maybe there even exist other good reasons.

I'm arguing that better concurrency can be one in the future.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Hash partitioning.

From
Heikki Linnakangas
Date:
On 26.06.2013 18:34, Kevin Grittner wrote:
> Markus Wanner<markus@bluegap.ch>  wrote:
>> On 06/25/2013 11:52 PM, Kevin Grittner wrote:
>>> At least until we have parallel
>>> query execution.  At *that* point this all changes.
>>
>> Can you elaborate on that, please? I currently have a hard time
>> imagining how partitions can help performance in that case,
>> either.
>
> Well, partitioning will *still* be a net loss for overall
> throughput on a machine with enough active connections to keep all
> the resources busy.  Where it will help is when you have a machine
> with a lot of cores and a few big "reporting" style queries.  Since
> we currently can only use one core for a single query, we leave a
> lot of CPU time (often the bottleneck for such queries) unused.  If
> we allow a large query to search multiple partitions in parallel, a
> big query can complete sooner.

We could also allow a large query to search a single table in parallel. 
A seqscan would be easy to divide into N equally-sized parts that can be 
scanned in parallel. It's more difficult for index scans, but even then 
it might be possible at least in some limited cases.

- Heikki



Re: Hash partitioning.

From
Markus Wanner
Date:
On 06/26/2013 05:46 PM, Heikki Linnakangas wrote:
> We could also allow a large query to search a single table in parallel.
> A seqscan would be easy to divide into N equally-sized parts that can be
> scanned in parallel. It's more difficult for index scans, but even then
> it might be possible at least in some limited cases.

So far reading sequentially is still faster than hopping between
different locations. Purely from the I/O perspective, that is.

For queries where the single CPU core turns into a bottle-neck and which
we want to parallelize, we should ideally still do a normal, fully
sequential scan and only fan out after the scan and distribute the
incoming pages (or even tuples) to the multiple cores to process.

Regards

Markus Wanner



Re: Hash partitioning.

From
Claudio Freire
Date:
On Wed, Jun 26, 2013 at 11:14 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Jun 26, 2013 at 05:10:00PM +0300, Heikki Linnakangas wrote:
>> In practice, there might be a lot of quirks and inefficiencies and
>> locking contention etc. involved in various DBMS's, that you might
>> be able to work around with hash partitioning. But from a
>> theoretical point of view, there is no reason to expect just
>> partitioning a table on a hash to make key-value lookups any faster.
>
> Good analysis.  Has anyone benchmarked this to know our btree is
> efficient in this area?


Yep. I had at one point, and came to the same conclusion. I ended up
building a few partial indices, and have been happy ever since.
Granted, my DB isn't that big, just around 200G.

No, I don't have the benchmark results. It's been a while. Back then,
it was 8.3, so I did the partitioning on the application. It still
wasn't worth it.

Now I just have two indices. One that indexes only hot tuples, it's
very heavily queried and works blazingly fast, and one that indexes by
(hotness, key). I include the hotness value on the query, and still
works quite fast enough. Luckily, I know things become cold after an
update to mark them cold, so I can do that. I included hotness on the
index to cluster updates on the hot part of the index, but I could
have just used a regular index and paid a small price on the updates.
Indeed, for a while it worked without the hotness, and there was no
significant trouble. I later found out that WAL bandwidth was
noticeably decreased when I added that hotness column, so I did, helps
a bit with replication. Has worked ever since.

Today, I only use "partitioning" to split conceptually different
tables, so I can have different schemas for each table (and normalize
with a view). Now it's 9.2, so the view works quite nicely and
transparently. I have yet to find a use for hash partitioning.



Re: Hash partitioning.

From
Jeff Janes
Date:
On Wed, Jun 26, 2013 at 7:01 AM, ktm@rice.edu <ktm@rice.edu> wrote:
On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote:
> On 06/25/2013 11:52 PM, Kevin Grittner wrote:
> > At least until we have parallel
> > query execution.  At *that* point this all changes.
>
> Can you elaborate on that, please? I currently have a hard time
> imagining how partitions can help performance in that case, either. At
> least compared to modern RAID and read-ahead capabilities.
>
> After all, RAID can be thought of as hash partitioning with a very weird
> hash function. Or maybe rather range partitioning on an internal key.
>
> Put another way: ideally, the system should take care of optimally
> distributing data across its physical storage itself. If you need to do
> partitioning manually for performance reasons, that's actually a
> deficiency of it, not a feature.

+1, except I'm looking at it from a CPU perspective not a disk perspective.

I would hope not to need to partition my data at all in order to enable parallel execution.  I certainly would hope not to redo that partitioning just because I got new hardware with a different number of CPUs.


Hi Markus,

I think he is referring to the fact that with parallel query execution,
multiple partitions can be processed simultaneously instead of serially
as they are now with the resulting speed increase.


Hopefully parallel execution can divide the query into multiple "chunks" on its own, without me needing to micromanage it.

Cheers,

Jeff

Re: Hash partitioning.

From
"Yuri Levinsky"
Date:
  Guys,
Single core CPU's are dying for Home users, my cellular has 4 cores.
Today's standard is minimum 4 cores per CPU and tomorrow who knows?
Parallelization sometimes is only one solution for heavy nightly jobs.
From the other hand parallelization is very tricky and unpredictable
when it comes into user's hands.  Anyway when you have this option (same
for hash partitioning) you in much better position than you don't have
it. The question is: when we may hope to have it?

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-----Original Message-----
From: Markus Wanner [mailto:markus@bluegap.ch]
Sent: Wednesday, June 26, 2013 6:56 PM
To: Heikki Linnakangas
Cc: Kevin Grittner; Claudio Freire; Robert Haas; Bruce Momjian; Yuri
Levinsky; PostgreSQL-Dev
Subject: Re: [HACKERS] Hash partitioning.

On 06/26/2013 05:46 PM, Heikki Linnakangas wrote:
> We could also allow a large query to search a single table in
parallel.
> A seqscan would be easy to divide into N equally-sized parts that can
> be scanned in parallel. It's more difficult for index scans, but even
> then it might be possible at least in some limited cases.

So far reading sequentially is still faster than hopping between
different locations. Purely from the I/O perspective, that is.

For queries where the single CPU core turns into a bottle-neck and which
we want to parallelize, we should ideally still do a normal, fully
sequential scan and only fan out after the scan and distribute the
incoming pages (or even tuples) to the multiple cores to process.

Regards

Markus Wanner

This mail was received via Mail-SeCure System.





Re: Hash partitioning.

From
Nicolas Barbier
Date:
2013/6/26 Heikki Linnakangas <hlinnakangas@vmware.com>:

> On 26.06.2013 16:41, Yuri Levinsky wrote:
>
>> Heikki,
>> As far as I understand the height of the btree will affect the number of
>> I/Os necessary. The height of the tree does not increase linearly with
>> the number of records.
>
> Now let's compare that with a hash partitioned table, with 1000 partitions
> and a b-tree index on every partition. [..] This is almost equivalent to
> just having a b-tree that's one level taller [..] There certainly isn't
> any difference in the number of actual I/O performed.

Imagine that there are a lot of indexes, e.g., 50. Although a lookup
(walking one index) is equally fast, an insertion must update al 50
indexes. When each index requires one extra I/O (because each index is
one level taller), that is 50 extra I/Os. In the partitioned case,
each index would require the normal smaller amount of I/Os. Choosing
which partition to use must only be done once: The result “counts” for
all indexes that are to be updated.

Additionally: Imagine that the data can be partitioned along some
column that makes sense for performance reasons (e.g., some “date”
where most accesses are concentrated on rows with more recent dates).
The other indexes will probably not have such a performance
distribution. Using those other indexes (both for look-ups and
updates) in the non-partitioned case, will therefore pull a huge
portion of each index into cache (because of the “random distribution”
of the non-date data). In the partitioned case, more cache can be
spent on the indexes that correspond to the “hot partitions.”

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?



Re: Hash partitioning.

From
Nicolas Barbier
Date:
2013/6/27 Nicolas Barbier <nicolas.barbier@gmail.com>:

> When each index requires one extra I/O (because each index is
> one level taller), that is 50 extra I/Os. In the partitioned case,
> each index would require the normal smaller amount of I/Os.

[..]

> Using those other indexes (both for look-ups and
> updates) in the non-partitioned case, will therefore pull a huge
> portion of each index into cache (because of the “random distribution”
> of the non-date data). In the partitioned case, more cache can be
> spent on the indexes that correspond to the “hot partitions.”

It seems that the system described by Claudio fixes this problem another way:

Claudio wrote:

> Now I just have two indices. One that indexes only hot tuples, it's
> very heavily queried and works blazingly fast, and one that indexes by
> (hotness, key).

Yuri, maybe that is something you should investigate instead of partitioning?

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?



Re: Hash partitioning.

From
Ants Aasma
Date:
<p dir="ltr"><br /> On Jun 27, 2013 12:24 PM, "Nicolas Barbier" <<a
href="mailto:nicolas.barbier@gmail.com">nicolas.barbier@gmail.com</a>>wrote:<br /> ><br /> > 2013/6/27 Nicolas
Barbier<<a href="mailto:nicolas.barbier@gmail.com">nicolas.barbier@gmail.com</a>>:<br /> ><br /> > >
Wheneach index requires one extra I/O (because each index is<br /> > > one level taller), that is 50 extra I/Os.
Inthe partitioned case,<br /> > > each index would require the normal smaller amount of I/Os.<br /> ><br />
>[..]<br /> ><br /> > > Using those other indexes (both for look-ups and<br /> > > updates) in the
non-partitionedcase, will therefore pull a huge<br /> > > portion of each index into cache (because of the
“randomdistribution”<br /> > > of the non-date data). In the partitioned case, more cache can be<br /> > >
spenton the indexes that correspond to the “hot partitions.”<br /> ><br /> > It seems that the system described
byClaudio fixes this problem another way:<br /> ><br /> > Claudio wrote:<br /> ><br /> > > Now I just
havetwo indices. One that indexes only hot tuples, it's<br /> > > very heavily queried and works blazingly fast,
andone that indexes by<br /> > > (hotness, key).<p dir="ltr">This is not really related to hash partitioning, but
youcan also do index partitioning while having the tables unpartitioned. If the hotness field is a timestamp like it
oftenis, you can create a predicate index on (key, tstamp) where tstamp > [some date in recent past], and replace
theindex with a newer one every so often to keep the size small. This way you can have a non-partitioned index for
batchqueries and a small one for the OLTP workload. If we added the option to build indexes using an index only scan,
buildingthe replacement index would be quite cheap.<p dir="ltr">Regards,<br /> Ants Aasma 

Re: Hash partitioning.

From
Markus Wanner
Date:
On 06/27/2013 11:12 AM, Nicolas Barbier wrote:
> Imagine that there are a lot of indexes, e.g., 50. Although a lookup
> (walking one index) is equally fast, an insertion must update al 50
> indexes. When each index requires one extra I/O (because each index is
> one level taller), that is 50 extra I/Os. In the partitioned case,
> each index would require the normal smaller amount of I/Os. Choosing
> which partition to use must only be done once: The result “counts” for
> all indexes that are to be updated.

I think you're underestimating the cost of partitioning. After all, the
lookup of what index to update for a given partition is a a lookup in
pg_index via pg_index_indrelid_index - a btree index.

Additionally, the depth of an index doesn't directly translate to the
number of I/O writes per insert (or delete). I'd rather expect the avg.
number of I/O writes per insert into a b-tree to be reasonably close to
one - depending mostly on the number of keys per page, not depth.

> Additionally: Imagine that the data can be partitioned along some
> column that makes sense for performance reasons (e.g., some “date”
> where most accesses are concentrated on rows with more recent dates).
> The other indexes will probably not have such a performance
> distribution. Using those other indexes (both for look-ups and
> updates) in the non-partitioned case, will therefore pull a huge
> portion of each index into cache (because of the “random distribution”
> of the non-date data). In the partitioned case, more cache can be
> spent on the indexes that correspond to the “hot partitions.”

That's a valid point, yes. I'd call this index partitioning. And with
partial indices, Postgres already has something that gets pretty close,
I think. Though, I don't consider this to be related to how the tuples
of the relation are laid out on disk.

Regards

Markus Wanner



Re: Hash partitioning.

From
Nicolas Barbier
Date:
2013/6/27 Markus Wanner <markus@bluegap.ch>:

> On 06/27/2013 11:12 AM, Nicolas Barbier wrote:
>
>> Imagine that there are a lot of indexes, e.g., 50. Although a lookup
>> (walking one index) is equally fast, an insertion must update al 50
>> indexes. When each index requires one extra I/O (because each index is
>> one level taller), that is 50 extra I/Os. In the partitioned case,
>> each index would require the normal smaller amount of I/Os. Choosing
>> which partition to use must only be done once: The result “counts” for
>> all indexes that are to be updated.
>
> I think you're underestimating the cost of partitioning. After all, the
> lookup of what index to update for a given partition is a a lookup in
> pg_index via pg_index_indrelid_index - a btree index.

I am assuming that this (comparatively very small and super-hot) index
is cached all the time, while for the other indexes (that are
supposedly super-huge) only the top part stays cached.

I am mostly just trying to find out where Yuri’s “partitioning is
needed for super-huge tables” experience might come from, and noting
that Heikki’s argument might not be 100% valid. I think that the
“PostgreSQL-answer” to this problem is to somehow cluster the data on
the “hotness column” (so that all hot rows are close to one another,
thereby improving the efficiency of the caching of relation blocks) +
partial indexes for the hot rows (as first mentioned by Claudio; to
improve the efficiency of the caching of index blocks).

> Additionally, the depth of an index doesn't directly translate to the
> number of I/O writes per insert (or delete). I'd rather expect the avg.
> number of I/O writes per insert into a b-tree to be reasonably close to
> one - depending mostly on the number of keys per page, not depth.

My reasoning was: To determine which index block to update (typically
one in both the partitioned and non-partitioned cases), one needs to
walk the index first, which supposedly causes one additional (read)
I/O in the non-partitioned case on average, because there is one extra
level and the lower part of the index is not cached (because of the
size of the index). I think that pokes a hole in Heikki’s argument of
“it really doesn’t matter, partitioning == using one big table with
big non-partial indexes.”

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?



Re: Hash partitioning.

From
Markus Wanner
Date:
On 06/27/2013 06:35 PM, Nicolas Barbier wrote:
> I am assuming that this (comparatively very small and super-hot) index
> is cached all the time, while for the other indexes (that are
> supposedly super-huge) only the top part stays cached.
> 
> I am mostly just trying to find out where Yuri’s “partitioning is
> needed for super-huge tables” experience might come from, and noting
> that Heikki’s argument might not be 100% valid.

I think the OP made that clear by stating that his index has relatively
low selectivity. That seems to be a case that Postgres doesn't handle
very well.

> I think that the
> “PostgreSQL-answer” to this problem is to somehow cluster the data on
> the “hotness column” (so that all hot rows are close to one another,
> thereby improving the efficiency of the caching of relation blocks) +
> partial indexes for the hot rows (as first mentioned by Claudio; to
> improve the efficiency of the caching of index blocks).

Agreed, sounds like a sane strategy.

> My reasoning was: To determine which index block to update (typically
> one in both the partitioned and non-partitioned cases), one needs to
> walk the index first, which supposedly causes one additional (read)
> I/O in the non-partitioned case on average, because there is one extra
> level and the lower part of the index is not cached (because of the
> size of the index). I think that pokes a hole in Heikki’s argument of
> “it really doesn’t matter, partitioning == using one big table with
> big non-partial indexes.”

Heikki's argument holds for the general case, where you cannot assume a
well defined hot partition. In that case, the lowest levels of all the
b-trees of the partitions don't fit in the cache, either. A single index
performs better in that case, because it has lower overhead.

I take your point that in case you *can* define a hot partition and
apply partitioning, the hot(test) index(es) are more likely to be cached
and thus require less disk I/O.

Regards

Markus Wanner



Re: Hash partitioning.

From
Jeff Janes
Date:
On Wed, Jun 26, 2013 at 8:55 AM, Markus Wanner <markus@bluegap.ch> wrote:
On 06/26/2013 05:46 PM, Heikki Linnakangas wrote:
> We could also allow a large query to search a single table in parallel.
> A seqscan would be easy to divide into N equally-sized parts that can be
> scanned in parallel. It's more difficult for index scans, but even then
> it might be possible at least in some limited cases.

So far reading sequentially is still faster than hopping between
different locations. Purely from the I/O perspective, that is.


Wouldn't any IO system being used on a high-end system be fairly good about making this work through interleaved read-ahead algorithms?  Also, hopefully the planner would be able to predict when parallelization has nothing to add and avoid using it, although surely that is easier said than done.
 

For queries where the single CPU core turns into a bottle-neck and which
we want to parallelize, we should ideally still do a normal, fully
sequential scan and only fan out after the scan and distribute the
incoming pages (or even tuples) to the multiple cores to process.

That sounds like it would be much more susceptible to lock contention, and harder to get bug-free, than dividing into bigger chunks, like whole 1 gig segments.  

Fanning out line by line (according to line_number % number_processes) was my favorite parallelization method in Perl, but those files were read only and so had no concurrency issues.
 
Cheers,

Jeff

Re: Hash partitioning.

From
Jeff Janes
Date:
On Wed, Jun 26, 2013 at 11:14 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

Now I just have two indices. One that indexes only hot tuples, it's
very heavily queried and works blazingly fast, and one that indexes by
(hotness, key). I include the hotness value on the query, and still
works quite fast enough. Luckily, I know things become cold after an
update to mark them cold, so I can do that. I included hotness on the
index to cluster updates on the hot part of the index, but I could
have just used a regular index and paid a small price on the updates.
Indeed, for a while it worked without the hotness, and there was no
significant trouble. I later found out that WAL bandwidth was
noticeably decreased when I added that hotness column, so I did, helps
a bit with replication. Has worked ever since.


I'm surprised that clustering updates into the hot part of the index, without also clustering them it into a hot part of the table heap, works well enough to make a difference.  Does clustering in the table just come naturally under your usage patterns?

 Cheers,

Jeff

Re: Hash partitioning.

From
Jeff Janes
Date:
On Thu, Jun 27, 2013 at 2:12 AM, Nicolas Barbier <nicolas.barbier@gmail.com> wrote:
2013/6/26 Heikki Linnakangas <hlinnakangas@vmware.com>:

> On 26.06.2013 16:41, Yuri Levinsky wrote:
>
>> Heikki,
>> As far as I understand the height of the btree will affect the number of
>> I/Os necessary. The height of the tree does not increase linearly with
>> the number of records.
>
> Now let's compare that with a hash partitioned table, with 1000 partitions
> and a b-tree index on every partition. [..] This is almost equivalent to
> just having a b-tree that's one level taller [..] There certainly isn't
> any difference in the number of actual I/O performed.

Imagine that there are a lot of indexes, e.g., 50. Although a lookup
(walking one index) is equally fast, an insertion must update al 50
indexes. When each index requires one extra I/O (because each index is
one level taller), that is 50 extra I/Os.

Except for pathological conditions like indexing the longest values that can be indexed, a btree insertion rarely needs to split even the lowest internal page, much less all pages up to the root.

...


Additionally: Imagine that the data can be partitioned along some
column that makes sense for performance reasons (e.g., some “date”
where most accesses are concentrated on rows with more recent dates).
The other indexes will probably not have such a performance
distribution. Using those other indexes (both for look-ups and
updates) in the non-partitioned case, will therefore pull a huge
portion of each index into cache (because of the “random distribution”
of the non-date data). In the partitioned case, more cache can be
spent on the indexes that correspond to the “hot partitions.”

This could well be true for range partitioning on date.  It is hard to see how this would work well for hash partitioning on the date, unless you carefully arrange for the number of hash partitions to be about the same as the number of distinct dates in the table.

Cheers,

Jeff

Re: Hash partitioning.

From
Jeff Janes
Date:
On Thu, Jun 27, 2013 at 9:35 AM, Nicolas Barbier <nicolas.barbier@gmail.com> wrote:

My reasoning was: To determine which index block to update (typically
one in both the partitioned and non-partitioned cases), one needs to
walk the index first, which supposedly causes one additional (read)
I/O in the non-partitioned case on average, because there is one extra
level and the lower part of the index is not cached (because of the
size of the index).

But the "extra level" is up at the top where it is well cached, not at the bottom where it is not.

Cheers,

Jeff

Re: Hash partitioning.

From
Markus Wanner
Date:
On 06/27/2013 11:13 PM, Jeff Janes wrote:
> Wouldn't any IO system being used on a high-end system be fairly good
> about making this work through interleaved read-ahead algorithms?

To some extent, certainly. It cannot possibly get better than a fully
sequential load, though.

> That sounds like it would be much more susceptible to lock contention,
> and harder to get bug-free, than dividing into bigger chunks, like whole
> 1 gig segments.  

Maybe, yes. Splitting a known amount of work into equal pieces sounds
like a pretty easy parallelization strategy. In case you don't know the
total amount of work or the size of each piece in advance, it gets a bit
harder. Choosing chunks that turn out to be too big certainly hurts.

Regards

Markus Wanner



Re: Hash partitioning.

From
Claudio Freire
Date:
On Thu, Jun 27, 2013 at 6:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Jun 26, 2013 at 11:14 AM, Claudio Freire <klaussfreire@gmail.com>
> wrote:
>>
>>
>> Now I just have two indices. One that indexes only hot tuples, it's
>> very heavily queried and works blazingly fast, and one that indexes by
>> (hotness, key). I include the hotness value on the query, and still
>> works quite fast enough. Luckily, I know things become cold after an
>> update to mark them cold, so I can do that. I included hotness on the
>> index to cluster updates on the hot part of the index, but I could
>> have just used a regular index and paid a small price on the updates.
>>
>> Indeed, for a while it worked without the hotness, and there was no
>> significant trouble. I later found out that WAL bandwidth was
>> noticeably decreased when I added that hotness column, so I did, helps
>> a bit with replication. Has worked ever since.
>
>
>
> I'm surprised that clustering updates into the hot part of the index,
> without also clustering them it into a hot part of the table heap, works
> well enough to make a difference.  Does clustering in the table just come
> naturally under your usage patterns?

Yes, hotness is highly correlated to age, while still not 100%. So
most updates hit the tail of the table, about a week or two worth of
it.