Thread: Hash partitioning.
<div class="WordSection1"><p class="MsoNormal"><span style="font-family:"Georgia","serif";color:black"> Hi,</span><p class="MsoNormal"><spanstyle="font-family:"Georgia","serif";color:black">Do we have any plans to implement Hash Partitioning,maybe I missing this feature? </span><p class="MsoNormal"><span style="font-family:"Georgia","serif";color:black"> </span><pclass="MsoNormal"><i><span style="font-size:10.0pt;font-family:"Arial","sans-serif";color:#111111">Sincerelyyours</span></i><span style="color:black">,</span><pclass="MsoNormal"><span style="color:black"> </span><p class="MsoNormal"><span style="color:black"><imgalt="Description: Celltick logo_highres" height="53" id="Picture_x0020_1" src="cid:image002.jpg@01CE71BB.6A05EF80"width="143" /></span><p class="MsoNormal"><span style="font-size:8.0pt;font-family:"Arial","sans-serif";color:black">YuriLevinsky, DBA</span><span style="font-size:10.0pt;font-family:"Arial","sans-serif";color:black"></span><pclass="MsoNormal"><span style="font-size:8.0pt;font-family:"Arial","sans-serif";color:black">CelltickTechnologies Ltd., 32 Maskit St., Herzliya 46733,Israel</span><span style="font-size:10.0pt;font-family:"Arial","sans-serif";color:black"></span><p class="MsoNormal"><spanstyle="font-size:8.0pt;font-family:"Arial","sans-serif";color:black">Mobile: +972 54 6107703, Office:+972 9 9710239; Fax: +972 9 9710222</span><span style="font-size:10.0pt;font-family:"Arial","sans-serif";color:black"></span><pclass="MsoNormal"> </div>
On Tue, Jun 25, 2013 at 03:48:19PM +0300, Yuri Levinsky wrote: > Hi, > > Do we have any plans to implement Hash Partitioning, maybe I missing this > feature? You can do it by writing your own constraint and trigger functions that control the hashing. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce, Many thanks. According to PostgreSQL documentation it's only range and list partitions are supported. My question is: when I am following your advice, is PostgreSQL will do partitioning pruning on select? My expectation is: I divided my table on 128 hash partitions according let's say user_id. When I do select * from users where user_id=? , I am expecting the engine select from some particular partition according to my function. The issue is critical when you working with big tables, that you can't normally partition by range/list. The feature allow parallel select from such table: each thread might select from his own dedicated partition. The feature also (mainly) allow to decrease index b-tree level on partition key column by dividing index into smaller parts. Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -----Original Message----- From: Bruce Momjian [mailto:bruce@momjian.us] Sent: Tuesday, June 25, 2013 4:21 PM To: Yuri Levinsky Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Hash partitioning. On Tue, Jun 25, 2013 at 03:48:19PM +0300, Yuri Levinsky wrote: > Hi, > > Do we have any plans to implement Hash Partitioning, maybe I missing > this feature? You can do it by writing your own constraint and trigger functions that control the hashing. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + This mail was received via Mail-SeCure System.
On Tue, Jun 25, 2013 at 05:19:47PM +0300, Yuri Levinsky wrote: > Bruce, > Many thanks. According to PostgreSQL documentation it's only range and > list partitions are supported. My question is: when I am following your > advice, is PostgreSQL will do partitioning pruning on select? My > expectation is: > I divided my table on 128 hash partitions according let's say user_id. > When I do select * from users where user_id=? , I am expecting the > engine select from some particular partition according to my function. > The issue is critical when you working with big tables, that you can't > normally partition by range/list. The feature allow parallel select from > such table: each thread might select from his own dedicated partition. > The feature also (mainly) allow to decrease index b-tree level on > partition key column by dividing index into smaller parts. Uh, where do you see that we only support range and list? You aren't using an EnterpriseDB closed-source product, are you? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Jun 25, 2013 at 9:21 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Jun 25, 2013 at 03:48:19PM +0300, Yuri Levinsky wrote: >> Hi, >> >> Do we have any plans to implement Hash Partitioning, maybe I missing this >> feature? > > You can do it by writing your own constraint and trigger functions that > control the hashing. Not really. Constraint exclusion won't kick in for a constraint like CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form a = 42. Of course, since partitioning generally doesn't improve performance in PostgreSQL anyway, it's not clear why you'd want to do this in the first place. But the fact that constraint exclusion won't work if you do is kind of a knockout blow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 25, 2013 at 11:02:40AM -0400, Robert Haas wrote: > On Tue, Jun 25, 2013 at 9:21 AM, Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, Jun 25, 2013 at 03:48:19PM +0300, Yuri Levinsky wrote: > >> Hi, > >> > >> Do we have any plans to implement Hash Partitioning, maybe I missing this > >> feature? > > > > You can do it by writing your own constraint and trigger functions that > > control the hashing. > > Not really. Constraint exclusion won't kick in for a constraint like > CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form a = 42. Uh, I thought we checked the constant against every CHECK constraint and only scanned partitions that matched. Why does this not work? > Of course, since partitioning generally doesn't improve performance in > PostgreSQL anyway, it's not clear why you'd want to do this in the I think partitioning does improve performance by reducing index depth. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Jun 25, 2013 at 11:06 AM, Bruce Momjian <bruce@momjian.us> wrote: >> Not really. Constraint exclusion won't kick in for a constraint like >> CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form a = 42. > > Uh, I thought we checked the constant against every CHECK constraint and > only scanned partitions that matched. Why does this not work? That's a pretty fuzzy description of what we do. For this to work, we'd have to be able to use the predicate a = 42 to prove that hashme(a) % 16 = 3 is false. But we can't actually substitute 42 in for a and then evaluate hashme(42) % 16 = 3, because we don't know that the a = 42 in the WHERE clause means exact equality for all purposes, only that it means "has the numerically same value". For integers, equality under = is sufficient to prove equivalence. But for numeric values, for example, it is not. The values '42'::numeric and '42.0'::numeric are equal according to =(numeric, numeric), but they are not the same. If the hashme() function did something like length($1::text), it would get different answers for those two values. IOW, the theorem prover has no way of knowing that the hash function provided has semantics that are compatible with the opclass of the operator used in the query. >> Of course, since partitioning generally doesn't improve performance in >> PostgreSQL anyway, it's not clear why you'd want to do this in the > > I think partitioning does improve performance by reducing index depth. Generally, I think traversing an extra level of the index is cheaper than opening extra relations and going through the theorem-prover machinery. There are benefits to partitioning, but they have to do with management - e.g. each partition can be vacuumed independently; old partitions can be dropped more efficiently than you can bulk-delete rows spread throughout a table - rather than performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 25, 2013 at 11:15:24AM -0400, Robert Haas wrote: > On Tue, Jun 25, 2013 at 11:06 AM, Bruce Momjian <bruce@momjian.us> wrote: > >> Not really. Constraint exclusion won't kick in for a constraint like > >> CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form a = 42. > > > > Uh, I thought we checked the constant against every CHECK constraint and > > only scanned partitions that matched. Why does this not work? > > That's a pretty fuzzy description of what we do. For this to work, > we'd have to be able to use the predicate a = 42 to prove that > hashme(a) % 16 = 3 is false. But we can't actually substitute 42 in > for a and then evaluate hashme(42) % 16 = 3, because we don't know > that the a = 42 in the WHERE clause means exact equality for all > purposes, only that it means "has the numerically same value". For > integers, equality under = is sufficient to prove equivalence. > > But for numeric values, for example, it is not. The values > '42'::numeric and '42.0'::numeric are equal according to =(numeric, > numeric), but they are not the same. If the hashme() function did > something like length($1::text), it would get different answers for > those two values. IOW, the theorem prover has no way of knowing that > the hash function provided has semantics that are compatible with the > opclass of the operator used in the query. I looked at predtest.c but I can't see how we accept >= and <= ranges, but not CHECK (a % 16 == 3). It is the '%' operator? I am not sure why the hashme() function is there. Wouldn't it work if hashme() was an immutable function? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Jun 25, 2013 at 11:45 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Jun 25, 2013 at 11:15:24AM -0400, Robert Haas wrote: >> On Tue, Jun 25, 2013 at 11:06 AM, Bruce Momjian <bruce@momjian.us> wrote: >> >> Not really. Constraint exclusion won't kick in for a constraint like >> >> CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form a = 42. >> > >> > Uh, I thought we checked the constant against every CHECK constraint and >> > only scanned partitions that matched. Why does this not work? >> >> That's a pretty fuzzy description of what we do. For this to work, >> we'd have to be able to use the predicate a = 42 to prove that >> hashme(a) % 16 = 3 is false. But we can't actually substitute 42 in >> for a and then evaluate hashme(42) % 16 = 3, because we don't know >> that the a = 42 in the WHERE clause means exact equality for all >> purposes, only that it means "has the numerically same value". For >> integers, equality under = is sufficient to prove equivalence. >> >> But for numeric values, for example, it is not. The values >> '42'::numeric and '42.0'::numeric are equal according to =(numeric, >> numeric), but they are not the same. If the hashme() function did >> something like length($1::text), it would get different answers for >> those two values. IOW, the theorem prover has no way of knowing that >> the hash function provided has semantics that are compatible with the >> opclass of the operator used in the query. > > I looked at predtest.c but I can't see how we accept >= and <= ranges, > but not CHECK (a % 16 == 3). It is the '%' operator? I am not sure why > the hashme() function is there. Wouldn't it work if hashme() was an > immutable function? Let me back up a minute. You told the OP that he could make hash partitioning by writing his own constraint and trigger functions. I think that won't work. But I'm happy to be proven wrong. Do you have an example showing how to do it? Here's why I think it WON'T work: rhaas=# create table foo (a int, b text); CREATE TABLE rhaas=# create table foo0 (check ((a % 16) = 0)) inherits (foo); CREATE TABLE rhaas=# create table foo1 (check ((a % 16) = 1)) inherits (foo); CREATE TABLE rhaas=# create table foo2 (check ((a % 16) = 2)) inherits (foo); CREATE TABLE rhaas=# create table foo3 (check ((a % 16) = 3)) inherits (foo); CREATE TABLE rhaas=# explain select * from foo where a = 1; QUERY PLAN ------------------------------------------------------------Append (cost=0.00..101.50 rows=25 width=36) -> Seq Scan onfoo (cost=0.00..0.00 rows=1 width=36) Filter: (a = 1) -> Seq Scan on foo0 (cost=0.00..25.38 rows=6 width=36) Filter: (a = 1) -> Seq Scan on foo1 (cost=0.00..25.38 rows=6 width=36) Filter: (a = 1) -> SeqScan on foo2 (cost=0.00..25.38 rows=6 width=36) Filter: (a = 1) -> Seq Scan on foo3 (cost=0.00..25.38 rows=6width=36) Filter: (a = 1) (11 rows) Notice we get a scan on every partition. Now let's try it with no modulo arithmetic, just a straightforward one-partition-per-value: rhaas=# create table foo (a int, b text); CREATE TABLE rhaas=# create table foo0 (check (a = 0)) inherits (foo); CREATE TABLE rhaas=# create table foo1 (check (a = 1)) inherits (foo); CREATE TABLE rhaas=# create table foo2 (check (a = 2)) inherits (foo); CREATE TABLE rhaas=# create table foo3 (check (a = 3)) inherits (foo); CREATE TABLE rhaas=# explain select * from foo where a = 1; QUERY PLAN ------------------------------------------------------------Append (cost=0.00..25.38 rows=7 width=36) -> Seq Scan on foo (cost=0.00..0.00 rows=1 width=36) Filter: (a = 1) -> Seq Scan on foo1 (cost=0.00..25.38 rows=6 width=36) Filter: (a = 1) (5 rows) Voila, now constraint exclusion is working. I confess that I'm not entirely clear about the details either, but the above tests speak for themselves. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Bruce Momjian <bruce@momjian.us> writes: > I looked at predtest.c but I can't see how we accept >= and <= ranges, > but not CHECK (a % 16 == 3). It is the '%' operator? I am not sure why > the hashme() function is there. Wouldn't it work if hashme() was an > immutable function? No. Robert's description is exactly correct: it's a question of whether we can know that the semantics of function X have anything to do with the behavior of operator Y. In the case of something like CHECK (X >= 16) combined with WHERE X = 10, if the given = and >= operators belong to the same btree opclass family then we can assume that their semantics are compatible and then apply reasoning to show that these two clauses can't both be true for the same value of X. We can *not* use "X = 10" to reason about the behavior of anything that isn't in the = operator's btree opclass, because we don't assume that "=" means "absolutely identical for every purpose". And in fact it does not mean that for several pretty common datatypes (float being another example besides numeric). regards, tom lane
Guys, I am sorry for taking your time. The reason for my question is: As former Oracle DBA and now simple beginner PostgreSQL DBA I would like to say: the current partitioning mechanism might be improved. Sorry, it seems to me far behind yesterday requirements. As model for improvement the Oracle might be taken as example. Unfortunately I am not writing an C code and see my benefit to PostgreSQL community in only rising this issue. I'll be very happy to be helpful in something else, but... Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -----Original Message----- From: Robert Haas [mailto:robertmhaas@gmail.com] Sent: Tuesday, June 25, 2013 6:55 PM To: Bruce Momjian Cc: Yuri Levinsky; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Hash partitioning. On Tue, Jun 25, 2013 at 11:45 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Jun 25, 2013 at 11:15:24AM -0400, Robert Haas wrote: >> On Tue, Jun 25, 2013 at 11:06 AM, Bruce Momjian <bruce@momjian.us> wrote: >> >> Not really. Constraint exclusion won't kick in for a constraint >> >> like CHECK (hashme(a) % 16 == 3) and a WHERE clause of the form a = 42. >> > >> > Uh, I thought we checked the constant against every CHECK >> > constraint and only scanned partitions that matched. Why does this not work? >> >> That's a pretty fuzzy description of what we do. For this to work, >> we'd have to be able to use the predicate a = 42 to prove that >> hashme(a) % 16 = 3 is false. But we can't actually substitute 42 in >> for a and then evaluate hashme(42) % 16 = 3, because we don't know >> that the a = 42 in the WHERE clause means exact equality for all >> purposes, only that it means "has the numerically same value". For >> integers, equality under = is sufficient to prove equivalence. >> >> But for numeric values, for example, it is not. The values >> '42'::numeric and '42.0'::numeric are equal according to =(numeric, >> numeric), but they are not the same. If the hashme() function did >> something like length($1::text), it would get different answers for >> those two values. IOW, the theorem prover has no way of knowing that >> the hash function provided has semantics that are compatible with the >> opclass of the operator used in the query. > > I looked at predtest.c but I can't see how we accept >= and <= ranges, > but not CHECK (a % 16 == 3). It is the '%' operator? I am not sure > why the hashme() function is there. Wouldn't it work if hashme() was > an immutable function? Let me back up a minute. You told the OP that he could make hash partitioning by writing his own constraint and trigger functions. I think that won't work. But I'm happy to be proven wrong. Do you have an example showing how to do it? Here's why I think it WON'T work: rhaas=# create table foo (a int, b text); CREATE TABLE rhaas=# create table foo0 (check ((a % 16) = 0)) inherits (foo); CREATE TABLE rhaas=# create table foo1 (check ((a % 16) = 1)) inherits (foo); CREATE TABLE rhaas=# create table foo2 (check ((a % 16) = 2)) inherits (foo); CREATE TABLE rhaas=# create table foo3 (check ((a % 16) = 3)) inherits (foo); CREATE TABLE rhaas=# explain select * from foo where a = 1; QUERY PLAN ------------------------------------------------------------Append (cost=0.00..101.50 rows=25 width=36) -> Seq Scan onfoo (cost=0.00..0.00 rows=1 width=36) Filter: (a = 1) -> Seq Scan on foo0 (cost=0.00..25.38 rows=6 width=36) Filter: (a = 1) -> Seq Scan on foo1 (cost=0.00..25.38 rows=6 width=36) Filter: (a = 1) -> SeqScan on foo2 (cost=0.00..25.38 rows=6 width=36) Filter: (a = 1) -> Seq Scan on foo3 (cost=0.00..25.38 rows=6width=36) Filter: (a = 1) (11 rows) Notice we get a scan on every partition. Now let's try it with no modulo arithmetic, just a straightforward one-partition-per-value: rhaas=# create table foo (a int, b text); CREATE TABLE rhaas=# create table foo0 (check (a = 0)) inherits (foo); CREATE TABLE rhaas=# create table foo1 (check (a = 1)) inherits (foo); CREATE TABLE rhaas=# create table foo2 (check (a = 2)) inherits (foo); CREATE TABLE rhaas=# create table foo3 (check (a = 3)) inherits (foo); CREATE TABLE rhaas=# explain select * from foo where a = 1; QUERY PLAN ------------------------------------------------------------Append (cost=0.00..25.38 rows=7 width=36) -> Seq Scan on foo (cost=0.00..0.00 rows=1 width=36) Filter: (a = 1) -> Seq Scan on foo1 (cost=0.00..25.38 rows=6 width=36) Filter: (a = 1) (5 rows) Voila, now constraint exclusion is working. I confess that I'm not entirely clear about the details either, but the above tests speak for themselves. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company This mail was received via Mail-SeCure System.
On Tue, Jun 25, 2013 at 12:08:34PM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > I looked at predtest.c but I can't see how we accept >= and <= ranges, > > but not CHECK (a % 16 == 3). It is the '%' operator? I am not sure why > > the hashme() function is there. Wouldn't it work if hashme() was an > > immutable function? > > No. Robert's description is exactly correct: it's a question of whether > we can know that the semantics of function X have anything to do with > the behavior of operator Y. In the case of something like CHECK (X >= 16) > combined with WHERE X = 10, if the given = and >= operators belong to > the same btree opclass family then we can assume that their semantics > are compatible and then apply reasoning to show that these two clauses > can't both be true for the same value of X. We can *not* use "X = 10" > to reason about the behavior of anything that isn't in the = operator's > btree opclass, because we don't assume that "=" means "absolutely > identical for every purpose". And in fact it does not mean that for > several pretty common datatypes (float being another example besides > numeric). OK, so it is really the index comparisons that we are using; makes sense. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Yuri Levinsky escribió: > As former Oracle DBA and now simple beginner PostgreSQL DBA I would like > to say: the current partitioning mechanism might be improved. Sorry, it > seems to me far behind yesterday requirements. I don't think you'll find anybody that disagrees with this. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 25, 2013 at 12:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Let me back up a minute. You told the OP that he could make hash > partitioning by writing his own constraint and trigger functions. I > think that won't work. But I'm happy to be proven wrong. Do you have > an example showing how to do it? > > Here's why I think it WON'T work: > > rhaas=# create table foo (a int, b text); > CREATE TABLE > rhaas=# create table foo0 (check ((a % 16) = 0)) inherits (foo); > CREATE TABLE > rhaas=# create table foo1 (check ((a % 16) = 1)) inherits (foo); > CREATE TABLE > rhaas=# create table foo2 (check ((a % 16) = 2)) inherits (foo); > CREATE TABLE > rhaas=# create table foo3 (check ((a % 16) = 3)) inherits (foo); > CREATE TABLE > rhaas=# explain select * from foo where a = 1; > QUERY PLAN > ------------------------------------------------------------ > Append (cost=0.00..101.50 rows=25 width=36) > -> Seq Scan on foo (cost=0.00..0.00 rows=1 width=36) > Filter: (a = 1) > -> Seq Scan on foo0 (cost=0.00..25.38 rows=6 width=36) > Filter: (a = 1) > -> Seq Scan on foo1 (cost=0.00..25.38 rows=6 width=36) > Filter: (a = 1) > -> Seq Scan on foo2 (cost=0.00..25.38 rows=6 width=36) > Filter: (a = 1) > -> Seq Scan on foo3 (cost=0.00..25.38 rows=6 width=36) > Filter: (a = 1) > (11 rows) > > Notice we get a scan on every partition. Now let's try it with no > modulo arithmetic, just a straightforward one-partition-per-value: > > rhaas=# create table foo (a int, b text); > CREATE TABLE > rhaas=# create table foo0 (check (a = 0)) inherits (foo); > CREATE TABLE > rhaas=# create table foo1 (check (a = 1)) inherits (foo); > CREATE TABLE > rhaas=# create table foo2 (check (a = 2)) inherits (foo); > CREATE TABLE > rhaas=# create table foo3 (check (a = 3)) inherits (foo); > CREATE TABLE > rhaas=# explain select * from foo where a = 1; Did you try "select * from foo where (a % 16) = (1::int % 16)"? A few views I have that span multiple "partitions" (in quotes since they're not exactly partitions, but close), I can make constraint exclusion work if I match the expression EXACTLY, including types (I've posted a few questions about this to pg-performance).
<div dir="ltr">On Tue, Jun 25, 2013 at 12:08 PM, Yuri Levinsky <span dir="ltr"><<a href="mailto:yuril@celltick.com" target="_blank">yuril@celltick.com</a>></span>wrote:<br /><div class="gmail_extra"><div class="gmail_quote"><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Guys,<br/> I am sorry for taking your time. The reason for my question is:<br /> As former OracleDBA and now simple beginner PostgreSQL DBA I would like<br /> to say: the current partitioning mechanism might be improved.Sorry, it<br /> seems to me far behind yesterday requirements. As model for improvement<br /> the Oracle might betaken as example. Unfortunately I am not writing an<br /> C code and see my benefit to PostgreSQL community in only risingthis<br /> issue. I'll be very happy to be helpful in something else, but...<br /></blockquote></div><br /></div><divclass="gmail_extra">Please don't flee over this...<br /><br /></div><div class="gmail_extra">As I think you cansee, now, the partitioning problem is tougher than it<br />may at first seem to be. It's quite useful to quickly getto the point of<br /> understanding that.<br /><br /></div><div class="gmail_extra">There would indeed be merit in improvingthe partitioning apparatus,<br /></div><div class="gmail_extra">and actually, I think it's been a couple of yearssince there has been<br /> serious discussion of this.<br /><br /></div><div class="gmail_extra">The discussion tendsto head into the rabbit hole of disputing about<br />whether one mechanism or another is ideal. That's the wrong starting<br/>point - we shouldn't start with what's easiest to make "ideal," we <br /> should start by determining what isrequired/desirable, without too<br />much reference, at least initially, on to how to achieve it.<br /></div><div class="gmail_extra">--<br />When confronted by a difficult problem, solve it by reducing it to the<br /> question, "How wouldthe Lone Ranger handle this?"<br /></div></div>
Christopher Browne <cbbrowne@gmail.com> writes: > There would indeed be merit in improving the partitioning apparatus, > and actually, I think it's been a couple of years since there has been > serious discussion of this. We could certainly use a partitioning mechanism that's easier to use than what we have now, which is basically "build it yourself, here's the parts bin". There would also be some performance benefits from moving the partitioning logic into hard-wired code. However, I find it hard to think that hash partitioning as such is very high on the to-do list. As was pointed out upthread, the main practical advantage of partitioning is *not* performance of routine queries, but improved bulk-data management such as the ability to do periodic housecleaning by dropping a partition. If your partitioning is on a hash, you've thrown away any such advantage, because there's no real-world meaning to the way the data's been split up. So I find range and list partitioning way more plausible. regards, tom lane
On Tue, Jun 25, 2013 at 4:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > However, I find it hard to think that hash partitioning as such is very > high on the to-do list. As was pointed out upthread, the main practical > advantage of partitioning is *not* performance of routine queries, but > improved bulk-data management such as the ability to do periodic > housecleaning by dropping a partition. If your partitioning is on a > hash, you've thrown away any such advantage, because there's no > real-world meaning to the way the data's been split up. So I find range > and list partitioning way more plausible. It would be nice if range partitioning based on some user-defined function was completely automatic, as in: * You define a function that returns a partition name for a given input. * You define a table to somehow be auto-partitioned on your_function(some_column) * The planner knows now it's some_column applied to your_function, so it can do constraint exclusion checks (your_function would probably need to be stable at least) * If a returned partition is missing... what? (auto-create? that'd be nice) It's pretty much what we have already, albeit easier to use. And, perhaps constraint exclusion logic could be specialized for this case, and made more robust.
Claudio Freire <klaussfreire@gmail.com> wrote: > Did you try "select * from foo where (a % 16) = (1::int % 16)"? I did. Using Robert's hashed partitioning table definitions: test=# explain select * from foo where a = 1 and (a % 16) = (1 % 16); QUERY PLAN ------------------------------------------------------------ Append (cost=0.00..31.53 rows=2 width=36) -> Seq Scan on foo (cost=0.00..0.00 rows=1 width=36) Filter: ((a = 1) AND ((a % 16) = 1)) -> Seq Scan on foo1 (cost=0.00..31.53 rows=1 width=36) Filter: ((a = 1) AND ((a % 16) = 1)) (5 rows) So if you are generating your queries through something capable of generating that last clause off of the first, this could work. Not all applications need to remain as flexible about the operators as we want the database engine itself to be. I agree though, that having an index implementation that can do the first level split faster than any partitioning mechanism can do is better, and that the main benefits of partitioning are in administration, *not* searching. At least until we have parallel query execution. At *that* point this all changes. One other thing worth noting is that I have several times seen cases where the planner cannot exclude partitions, but at execution time it finds that it doesn't need to execute all of the plan nodes. I think it makes sense to not work quite so hard to eliminate partitions at plan time if we can skip the unneeded ones at run time, once we have more data values resolved. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 25, 2013 at 6:52 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > I agree though, that having an index implementation that can do the > first level split faster than any partitioning mechanism can do is > better, and that the main benefits of partitioning are in > administration, *not* searching. Indeed, but the proposal for hash partitions isn't fundamentally different from range partitions. It's "easy-to-use partitions over user-defined functions", hash or not.
Tom, I clearly understand your point. I actually came from corporate market such as Verizon, Barclays... I remember very good that PostgreSQL is open source, but let's forget it for a moment. The key issue for corporate market always been a partitioning(vertical and lately horizontal). Because of that Oracle has too many types and combinations of partitions, the other vendors as well. Easy partitions maintenance (automatic, simple syntax) is very important for everybody who lives in corporate RDBMS world and not only use "DB's for free" in order to create some virtual shop. The main purpose of partitioning in my world is to store billions of rows and be able to search by date, hour or even minute as fast as possible. When you dealing with company, which has ~350.000.000 users, and you don't want to use key/value data stores: you need hash partitioned tables and hash partitioned table clusters to perform fast search and 4-6 tables join based on user phone number for example. I believe to increase PostgreSQL popularity in corporate world, to make real money from support, the next features might be: better vertical and later horizontal partitioning, columnar-oriented tables, DB freeze for NetApp/EMC snapshots and similar. Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Tuesday, June 25, 2013 10:33 PM To: Christopher Browne Cc: Yuri Levinsky; Robert Haas; Bruce Momjian; PostgreSQL Mailing Lists Subject: Re: [HACKERS] Hash partitioning. Christopher Browne <cbbrowne@gmail.com> writes: > There would indeed be merit in improving the partitioning apparatus, > and actually, I think it's been a couple of years since there has been > serious discussion of this. We could certainly use a partitioning mechanism that's easier to use than what we have now, which is basically "build it yourself, here's the parts bin". There would also be some performance benefits from moving the partitioning logic into hard-wired code. However, I find it hard to think that hash partitioning as such is very high on the to-do list. As was pointed out upthread, the main practical advantage of partitioning is *not* performance of routine queries, but improved bulk-data management such as the ability to do periodic housecleaning by dropping a partition. If your partitioning is on a hash, you've thrown away any such advantage, because there's no real-world meaning to the way the data's been split up. So I find range and list partitioning way more plausible. regards, tom lane
On 26.06.2013 11:17, Yuri Levinsky wrote: > The main purpose of partitioning in my world > is to store billions of rows and be able to search by date, hour or even > minute as fast as possible. Hash partitioning sounds like a bad fit for that use case. A regular b-tree, possibly with range partitioning, sounds optimal for that. > When you dealing with company, which has > ~350.000.000 users, and you don't want to use key/value data stores: you > need hash partitioned tables and hash partitioned table clusters to > perform fast search and 4-6 tables join based on user phone number for > example. B-trees are surprisingly fast for key-value lookups. There is no reason to believe that a hash partitioned table would be faster for that than a plain table. - Heikki
On Tue, Jun 25, 2013 at 02:52:33PM -0700, Kevin Grittner wrote: > Claudio Freire <klaussfreire@gmail.com> wrote: > > > Did you try "select * from foo where (a % 16) = (1::int % 16)"? > > I did. Using Robert's hashed partitioning table definitions: > > test=# explain select * from foo where a = 1 and (a % 16) = (1 % 16); > QUERY PLAN > ------------------------------------------------------------ > Append (cost=0.00..31.53 rows=2 width=36) > -> Seq Scan on foo (cost=0.00..0.00 rows=1 width=36) > Filter: ((a = 1) AND ((a % 16) = 1)) > -> Seq Scan on foo1 (cost=0.00..31.53 rows=1 width=36) > Filter: ((a = 1) AND ((a % 16) = 1)) > (5 rows) > > So if you are generating your queries through something capable of > generating that last clause off of the first, this could work. Not OK, so what is it in our code that requires that? It is a type mismatch? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Heikki, As far as I understand the height of the btree will affect the number of I/Os necessary. The height of the tree does not increase linearly with the number of records. May be I wrong in terminology but when I am trying to insert data into empty table the insertion time is increasing when number of records is growing. In order to keep indexes as small as possible I usually split the table by hash if I don't have any better alternative. On some systems hash functions +index might work faster when only index for insert and search operations. This especially usable when you have non unique index with small number of possible values that you don't know in advance or that changing between your customers. In that case the hash partition has to be used instead of index. Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -----Original Message----- From: Heikki Linnakangas [mailto:hlinnakangas@vmware.com] Sent: Wednesday, June 26, 2013 2:23 PM To: Yuri Levinsky Cc: Tom Lane; Christopher Browne; Robert Haas; Bruce Momjian; PostgreSQL Mailing Lists Subject: Re: [HACKERS] Hash partitioning. On 26.06.2013 11:17, Yuri Levinsky wrote: > The main purpose of partitioning in my world is to store billions of > rows and be able to search by date, hour or even minute as fast as > possible. Hash partitioning sounds like a bad fit for that use case. A regular b-tree, possibly with range partitioning, sounds optimal for that. > When you dealing with company, which has > ~350.000.000 users, and you don't want to use key/value data stores: > you need hash partitioned tables and hash partitioned table clusters > to perform fast search and 4-6 tables join based on user phone number > for example. B-trees are surprisingly fast for key-value lookups. There is no reason to believe that a hash partitioned table would be faster for that than a plain table. - Heikki This mail was received via Mail-SeCure System.
On 06/25/2013 11:52 PM, Kevin Grittner wrote: > At least until we have parallel > query execution. At *that* point this all changes. Can you elaborate on that, please? I currently have a hard time imagining how partitions can help performance in that case, either. At least compared to modern RAID and read-ahead capabilities. After all, RAID can be thought of as hash partitioning with a very weird hash function. Or maybe rather range partitioning on an internal key. Put another way: ideally, the system should take care of optimally distributing data across its physical storage itself. If you need to do partitioning manually for performance reasons, that's actually a deficiency of it, not a feature. I certainly agree that manageability may be a perfectly valid reason to partition your data. Maybe there even exist other good reasons. I don't think performance optimization is one. (It's more like giving the system a hint. And we all dislike hints, don't we? *ducks*) Regards Markus Wanner
On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote: > On 06/25/2013 11:52 PM, Kevin Grittner wrote: > > At least until we have parallel > > query execution. At *that* point this all changes. > > Can you elaborate on that, please? I currently have a hard time > imagining how partitions can help performance in that case, either. At > least compared to modern RAID and read-ahead capabilities. > > After all, RAID can be thought of as hash partitioning with a very weird > hash function. Or maybe rather range partitioning on an internal key. > > Put another way: ideally, the system should take care of optimally > distributing data across its physical storage itself. If you need to do > partitioning manually for performance reasons, that's actually a > deficiency of it, not a feature. > > I certainly agree that manageability may be a perfectly valid reason to > partition your data. Maybe there even exist other good reasons. I don't > think performance optimization is one. (It's more like giving the system > a hint. And we all dislike hints, don't we? *ducks*) > > Regards > > Markus Wanner > Hi Markus, I think he is referring to the fact that with parallel query execution, multiple partitions can be processed simultaneously instead of serially as they are now with the resulting speed increase. Regards, Ken
On 26.06.2013 16:41, Yuri Levinsky wrote: > Heikki, > As far as I understand the height of the btree will affect the number of > I/Os necessary. The height of the tree does not increase linearly with > the number of records. The height of a b-tree is O(log n), where n is the number of records. Informally, if we assume that you have on average, say, 1000 keys on one b-tree page, a two-level b-tree can hold one million items, and a three level one billion items, and so on. The height of the tree affects the number of I/Os needed for searches, but keep in mind that the top levels of the tree are going to be very frequently accessed and in practice will stay permanently cached. You will only perform actual I/O on the 1-2 bottom levels of the tree (or 0 if it all fits in cache) Now let's compare that with a hash partitioned table, with 1000 partitions and a b-tree index on every partition. When doing a search, you first hash the key to look up the right partition, then you search the index of that partition. This is almost equivalent to just having a b-tree that's one level taller - instead of looking up the right partition in the hash table, you look up the right child page at the root of the b-tree. From a very coarse theoretical point of view, the only difference is that you replaced the binary search on the b-tree root page with an equivalent hash lookup. A hash lookup can be somewhat cheaper than binary search, but in practice there is very little difference. There certainly isn't any difference in the number of actual I/O performed. In practice, there might be a lot of quirks and inefficiencies and locking contention etc. involved in various DBMS's, that you might be able to work around with hash partitioning. But from a theoretical point of view, there is no reason to expect just partitioning a table on a hash to make key-value lookups any faster. - Heikki
Markus, It's no relation between partitions and raids despite they both distribute data somehow. By the end of the day when you use the raid you have one single device with some performance limitations. When you want to improve your data access after that and not to work with huge indexes that you unable to maintain or you don't want to use index like in case of range partition by time or hash partition: you welcome to use partitions. You typically don't want to use b-tree index when yo select more when ~1-2% of your data. Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -----Original Message----- From: ktm@rice.edu [mailto:ktm@rice.edu] Sent: Wednesday, June 26, 2013 5:01 PM To: Markus Wanner Cc: Kevin Grittner; Claudio Freire; Robert Haas; Bruce Momjian; Yuri Levinsky; PostgreSQL-Dev Subject: Re: [HACKERS] Hash partitioning. On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote: > On 06/25/2013 11:52 PM, Kevin Grittner wrote: > > At least until we have parallel > > query execution. At *that* point this all changes. > > Can you elaborate on that, please? I currently have a hard time > imagining how partitions can help performance in that case, either. At > least compared to modern RAID and read-ahead capabilities. > > After all, RAID can be thought of as hash partitioning with a very > weird hash function. Or maybe rather range partitioning on an internal key. > > Put another way: ideally, the system should take care of optimally > distributing data across its physical storage itself. If you need to > do partitioning manually for performance reasons, that's actually a > deficiency of it, not a feature. > > I certainly agree that manageability may be a perfectly valid reason > to partition your data. Maybe there even exist other good reasons. I > don't think performance optimization is one. (It's more like giving > the system a hint. And we all dislike hints, don't we? *ducks*) > > Regards > > Markus Wanner > Hi Markus, I think he is referring to the fact that with parallel query execution, multiple partitions can be processed simultaneously instead of serially as they are now with the resulting speed increase. Regards, Ken This mail was received via Mail-SeCure System.
On Wed, Jun 26, 2013 at 05:10:00PM +0300, Heikki Linnakangas wrote: > In practice, there might be a lot of quirks and inefficiencies and > locking contention etc. involved in various DBMS's, that you might > be able to work around with hash partitioning. But from a > theoretical point of view, there is no reason to expect just > partitioning a table on a hash to make key-value lookups any faster. Good analysis. Has anyone benchmarked this to know our btree is efficient in this area? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Heiki, This is most professional explanation that I ever seen. Let me please disagree with a bottom line. It's heavily depends on amount of memory and actual index sizes. I did a benchmark ~6 years ago and I won a glass of beer. Anyway I am talking about hash partitioning as a feature and my example about compare with unique b-tree index scan is little bit extreme. In case you have 2,4,8..1024 different values (not known in advance) the index might be eliminated. That's whole the feature: no competition for hash function. Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -----Original Message----- From: Heikki Linnakangas [mailto:hlinnakangas@vmware.com] Sent: Wednesday, June 26, 2013 5:10 PM To: Yuri Levinsky Cc: Tom Lane; Christopher Browne; Robert Haas; Bruce Momjian; PostgreSQL Mailing Lists Subject: Re: [HACKERS] Hash partitioning. On 26.06.2013 16:41, Yuri Levinsky wrote: > Heikki, > As far as I understand the height of the btree will affect the number > of I/Os necessary. The height of the tree does not increase linearly > with the number of records. The height of a b-tree is O(log n), where n is the number of records. Informally, if we assume that you have on average, say, 1000 keys on one b-tree page, a two-level b-tree can hold one million items, and a three level one billion items, and so on. The height of the tree affects the number of I/Os needed for searches, but keep in mind that the top levels of the tree are going to be very frequently accessed and in practice will stay permanently cached. You will only perform actual I/O on the 1-2 bottom levels of the tree (or 0 if it all fits in cache) Now let's compare that with a hash partitioned table, with 1000 partitions and a b-tree index on every partition. When doing a search, you first hash the key to look up the right partition, then you search the index of that partition. This is almost equivalent to just having a b-tree that's one level taller - instead of looking up the right partition in the hash table, you look up the right child page at the root of the b-tree. From a very coarse theoretical point of view, the only difference is that you replaced the binary search on the b-tree root page with an equivalent hash lookup. A hash lookup can be somewhat cheaper than binary search, but in practice there is very little difference. There certainly isn't any difference in the number of actual I/O performed. In practice, there might be a lot of quirks and inefficiencies and locking contention etc. involved in various DBMS's, that you might be able to work around with hash partitioning. But from a theoretical point of view, there is no reason to expect just partitioning a table on a hash to make key-value lookups any faster. - Heikki This mail was received via Mail-SeCure System.
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 26.06.2013 11:17, Yuri Levinsky wrote: >> When you dealing with company, which has >> ~350.000.000 users, and you don't want to use key/value data stores: you >> need hash partitioned tables and hash partitioned table clusters to >> perform fast search and 4-6 tables join based on user phone number for >> example. > B-trees are surprisingly fast for key-value lookups. There is no reason > to believe that a hash partitioned table would be faster for that than a > plain table. Or in short: the quoted advice may very well be true for Oracle, but applying it blindly to Postgres is not a good idea. PG's performance characteristics are a lot different, especially in the area of partitioned tables. regards, tom lane
On 06/26/2013 04:10 PM, Yuri Levinsky wrote: > You typically don't want to use b-tree index when yo select > more when ~1-2% of your data. Agreed. Indices on columns with very low selectivity don't perform well. (Postgres knows that and uses a sequential scan based on selectivity estimates. Being able to eliminate entire partitions from such a seq scan would certainly be beneficial, yes.) In the Postgres world, though, I think CLUSTERing might be the better approach to solve that problem. (Note: this has nothing to do with distributed systems in this case.) I'm not sure what the current status of auto clustering or optimized scans on such a permanently clustered table is, though. The minmax indices proposed for 9.4 might be another feature worth looking at. Both of these approaches may eventually provide a more general and more automatic way to speed up scans on large portions of a relation, IMO. Regards Markus Wanner
On 06/26/2013 04:01 PM, ktm@rice.edu wrote: > I think he is referring to the fact that with parallel query execution, > multiple partitions can be processed simultaneously instead of serially > as they are now with the resulting speed increase. Processing simultaneously is the purpose of parallel query execution, yes. But I see no reason for that not to work equally well for unpartitioned tables. Disk I/O is already pretty well optimized and parallelized, I think. Trying to parallelize a seq scan on the Postgres side is likely to yield far inferior performance. Regards Markus Wanner
On Wed, Jun 26, 2013 at 05:04:11PM +0200, Markus Wanner wrote: > On 06/26/2013 04:01 PM, ktm@rice.edu wrote: > > I think he is referring to the fact that with parallel query execution, > > multiple partitions can be processed simultaneously instead of serially > > as they are now with the resulting speed increase. > > Processing simultaneously is the purpose of parallel query execution, > yes. But I see no reason for that not to work equally well for > unpartitioned tables. Well, I think by definition you are going to be able to spread lookups for a particular range across more partitions with 'hash' than with range or list partitioning. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Markus Wanner <markus@bluegap.ch> wrote: > On 06/25/2013 11:52 PM, Kevin Grittner wrote: >> At least until we have parallel >> query execution. At *that* point this all changes. > > Can you elaborate on that, please? I currently have a hard time > imagining how partitions can help performance in that case, > either. Well, partitioning will *still* be a net loss for overall throughput on a machine with enough active connections to keep all the resources busy. Where it will help is when you have a machine with a lot of cores and a few big "reporting" style queries. Since we currently can only use one core for a single query, we leave a lot of CPU time (often the bottleneck for such queries) unused. If we allow a large query to search multiple partitions in parallel, a big query can complete sooner. It will consume more resources overall in doing so, but if those resources would otherwise sit idle, it could be a big win for some use cases. > Put another way: ideally, the system should take care of > optimally distributing data across its physical storage itself. Agreed. That's not where I see the win. Most cases where I've seen people attempt to micro-manage object placement have performed worse than somply giving the OS a big RAID (we had 40 spindles in some of ours at Wis. Courts) and letting the filesystem figure it out. There are exceptions for special cases like WAL. I found out by accident how much that can matter. > If you need to do partitioning manually for performance reasons, > that's actually a deficiency of it, not a feature. +1 > I certainly agree that manageability may be a perfectly valid > reason to partition your data. It can definitely help there. > Maybe there even exist other good reasons. I'm arguing that better concurrency can be one in the future. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 26.06.2013 18:34, Kevin Grittner wrote: > Markus Wanner<markus@bluegap.ch> wrote: >> On 06/25/2013 11:52 PM, Kevin Grittner wrote: >>> At least until we have parallel >>> query execution. At *that* point this all changes. >> >> Can you elaborate on that, please? I currently have a hard time >> imagining how partitions can help performance in that case, >> either. > > Well, partitioning will *still* be a net loss for overall > throughput on a machine with enough active connections to keep all > the resources busy. Where it will help is when you have a machine > with a lot of cores and a few big "reporting" style queries. Since > we currently can only use one core for a single query, we leave a > lot of CPU time (often the bottleneck for such queries) unused. If > we allow a large query to search multiple partitions in parallel, a > big query can complete sooner. We could also allow a large query to search a single table in parallel. A seqscan would be easy to divide into N equally-sized parts that can be scanned in parallel. It's more difficult for index scans, but even then it might be possible at least in some limited cases. - Heikki
On 06/26/2013 05:46 PM, Heikki Linnakangas wrote: > We could also allow a large query to search a single table in parallel. > A seqscan would be easy to divide into N equally-sized parts that can be > scanned in parallel. It's more difficult for index scans, but even then > it might be possible at least in some limited cases. So far reading sequentially is still faster than hopping between different locations. Purely from the I/O perspective, that is. For queries where the single CPU core turns into a bottle-neck and which we want to parallelize, we should ideally still do a normal, fully sequential scan and only fan out after the scan and distribute the incoming pages (or even tuples) to the multiple cores to process. Regards Markus Wanner
On Wed, Jun 26, 2013 at 11:14 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Jun 26, 2013 at 05:10:00PM +0300, Heikki Linnakangas wrote: >> In practice, there might be a lot of quirks and inefficiencies and >> locking contention etc. involved in various DBMS's, that you might >> be able to work around with hash partitioning. But from a >> theoretical point of view, there is no reason to expect just >> partitioning a table on a hash to make key-value lookups any faster. > > Good analysis. Has anyone benchmarked this to know our btree is > efficient in this area? Yep. I had at one point, and came to the same conclusion. I ended up building a few partial indices, and have been happy ever since. Granted, my DB isn't that big, just around 200G. No, I don't have the benchmark results. It's been a while. Back then, it was 8.3, so I did the partitioning on the application. It still wasn't worth it. Now I just have two indices. One that indexes only hot tuples, it's very heavily queried and works blazingly fast, and one that indexes by (hotness, key). I include the hotness value on the query, and still works quite fast enough. Luckily, I know things become cold after an update to mark them cold, so I can do that. I included hotness on the index to cluster updates on the hot part of the index, but I could have just used a regular index and paid a small price on the updates. Indeed, for a while it worked without the hotness, and there was no significant trouble. I later found out that WAL bandwidth was noticeably decreased when I added that hotness column, so I did, helps a bit with replication. Has worked ever since. Today, I only use "partitioning" to split conceptually different tables, so I can have different schemas for each table (and normalize with a view). Now it's 9.2, so the view works quite nicely and transparently. I have yet to find a use for hash partitioning.
On Wed, Jun 26, 2013 at 7:01 AM, ktm@rice.edu <ktm@rice.edu> wrote:
On Wed, Jun 26, 2013 at 03:47:43PM +0200, Markus Wanner wrote:
> On 06/25/2013 11:52 PM, Kevin Grittner wrote:
> > At least until we have parallel
> > query execution. At *that* point this all changes.
>
> Can you elaborate on that, please? I currently have a hard time
> imagining how partitions can help performance in that case, either. At
> least compared to modern RAID and read-ahead capabilities.
>
> After all, RAID can be thought of as hash partitioning with a very weird
> hash function. Or maybe rather range partitioning on an internal key.
>
> Put another way: ideally, the system should take care of optimally
> distributing data across its physical storage itself. If you need to do
> partitioning manually for performance reasons, that's actually a
> deficiency of it, not a feature.
+1, except I'm looking at it from a CPU perspective not a disk perspective.
I would hope not to need to partition my data at all in order to enable parallel execution. I certainly would hope not to redo that partitioning just because I got new hardware with a different number of CPUs.
Hi Markus,
I think he is referring to the fact that with parallel query execution,
multiple partitions can be processed simultaneously instead of serially
as they are now with the resulting speed increase.
Hopefully parallel execution can divide the query into multiple "chunks" on its own, without me needing to micromanage it.
Cheers,
Jeff
Guys, Single core CPU's are dying for Home users, my cellular has 4 cores. Today's standard is minimum 4 cores per CPU and tomorrow who knows? Parallelization sometimes is only one solution for heavy nightly jobs. From the other hand parallelization is very tricky and unpredictable when it comes into user's hands. Anyway when you have this option (same for hash partitioning) you in much better position than you don't have it. The question is: when we may hope to have it? Sincerely yours, Yuri Levinsky, DBA Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222 -----Original Message----- From: Markus Wanner [mailto:markus@bluegap.ch] Sent: Wednesday, June 26, 2013 6:56 PM To: Heikki Linnakangas Cc: Kevin Grittner; Claudio Freire; Robert Haas; Bruce Momjian; Yuri Levinsky; PostgreSQL-Dev Subject: Re: [HACKERS] Hash partitioning. On 06/26/2013 05:46 PM, Heikki Linnakangas wrote: > We could also allow a large query to search a single table in parallel. > A seqscan would be easy to divide into N equally-sized parts that can > be scanned in parallel. It's more difficult for index scans, but even > then it might be possible at least in some limited cases. So far reading sequentially is still faster than hopping between different locations. Purely from the I/O perspective, that is. For queries where the single CPU core turns into a bottle-neck and which we want to parallelize, we should ideally still do a normal, fully sequential scan and only fan out after the scan and distribute the incoming pages (or even tuples) to the multiple cores to process. Regards Markus Wanner This mail was received via Mail-SeCure System.
2013/6/26 Heikki Linnakangas <hlinnakangas@vmware.com>: > On 26.06.2013 16:41, Yuri Levinsky wrote: > >> Heikki, >> As far as I understand the height of the btree will affect the number of >> I/Os necessary. The height of the tree does not increase linearly with >> the number of records. > > Now let's compare that with a hash partitioned table, with 1000 partitions > and a b-tree index on every partition. [..] This is almost equivalent to > just having a b-tree that's one level taller [..] There certainly isn't > any difference in the number of actual I/O performed. Imagine that there are a lot of indexes, e.g., 50. Although a lookup (walking one index) is equally fast, an insertion must update al 50 indexes. When each index requires one extra I/O (because each index is one level taller), that is 50 extra I/Os. In the partitioned case, each index would require the normal smaller amount of I/Os. Choosing which partition to use must only be done once: The result “counts” for all indexes that are to be updated. Additionally: Imagine that the data can be partitioned along some column that makes sense for performance reasons (e.g., some “date” where most accesses are concentrated on rows with more recent dates). The other indexes will probably not have such a performance distribution. Using those other indexes (both for look-ups and updates) in the non-partitioned case, will therefore pull a huge portion of each index into cache (because of the “random distribution” of the non-date data). In the partitioned case, more cache can be spent on the indexes that correspond to the “hot partitions.” Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
2013/6/27 Nicolas Barbier <nicolas.barbier@gmail.com>: > When each index requires one extra I/O (because each index is > one level taller), that is 50 extra I/Os. In the partitioned case, > each index would require the normal smaller amount of I/Os. [..] > Using those other indexes (both for look-ups and > updates) in the non-partitioned case, will therefore pull a huge > portion of each index into cache (because of the “random distribution” > of the non-date data). In the partitioned case, more cache can be > spent on the indexes that correspond to the “hot partitions.” It seems that the system described by Claudio fixes this problem another way: Claudio wrote: > Now I just have two indices. One that indexes only hot tuples, it's > very heavily queried and works blazingly fast, and one that indexes by > (hotness, key). Yuri, maybe that is something you should investigate instead of partitioning? Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
<p dir="ltr"><br /> On Jun 27, 2013 12:24 PM, "Nicolas Barbier" <<a href="mailto:nicolas.barbier@gmail.com">nicolas.barbier@gmail.com</a>>wrote:<br /> ><br /> > 2013/6/27 Nicolas Barbier<<a href="mailto:nicolas.barbier@gmail.com">nicolas.barbier@gmail.com</a>>:<br /> ><br /> > > Wheneach index requires one extra I/O (because each index is<br /> > > one level taller), that is 50 extra I/Os. Inthe partitioned case,<br /> > > each index would require the normal smaller amount of I/Os.<br /> ><br /> >[..]<br /> ><br /> > > Using those other indexes (both for look-ups and<br /> > > updates) in the non-partitionedcase, will therefore pull a huge<br /> > > portion of each index into cache (because of the “randomdistribution”<br /> > > of the non-date data). In the partitioned case, more cache can be<br /> > > spenton the indexes that correspond to the “hot partitions.”<br /> ><br /> > It seems that the system described byClaudio fixes this problem another way:<br /> ><br /> > Claudio wrote:<br /> ><br /> > > Now I just havetwo indices. One that indexes only hot tuples, it's<br /> > > very heavily queried and works blazingly fast, andone that indexes by<br /> > > (hotness, key).<p dir="ltr">This is not really related to hash partitioning, but youcan also do index partitioning while having the tables unpartitioned. If the hotness field is a timestamp like it oftenis, you can create a predicate index on (key, tstamp) where tstamp > [some date in recent past], and replace theindex with a newer one every so often to keep the size small. This way you can have a non-partitioned index for batchqueries and a small one for the OLTP workload. If we added the option to build indexes using an index only scan, buildingthe replacement index would be quite cheap.<p dir="ltr">Regards,<br /> Ants Aasma
On 06/27/2013 11:12 AM, Nicolas Barbier wrote: > Imagine that there are a lot of indexes, e.g., 50. Although a lookup > (walking one index) is equally fast, an insertion must update al 50 > indexes. When each index requires one extra I/O (because each index is > one level taller), that is 50 extra I/Os. In the partitioned case, > each index would require the normal smaller amount of I/Os. Choosing > which partition to use must only be done once: The result “counts” for > all indexes that are to be updated. I think you're underestimating the cost of partitioning. After all, the lookup of what index to update for a given partition is a a lookup in pg_index via pg_index_indrelid_index - a btree index. Additionally, the depth of an index doesn't directly translate to the number of I/O writes per insert (or delete). I'd rather expect the avg. number of I/O writes per insert into a b-tree to be reasonably close to one - depending mostly on the number of keys per page, not depth. > Additionally: Imagine that the data can be partitioned along some > column that makes sense for performance reasons (e.g., some “date” > where most accesses are concentrated on rows with more recent dates). > The other indexes will probably not have such a performance > distribution. Using those other indexes (both for look-ups and > updates) in the non-partitioned case, will therefore pull a huge > portion of each index into cache (because of the “random distribution” > of the non-date data). In the partitioned case, more cache can be > spent on the indexes that correspond to the “hot partitions.” That's a valid point, yes. I'd call this index partitioning. And with partial indices, Postgres already has something that gets pretty close, I think. Though, I don't consider this to be related to how the tuples of the relation are laid out on disk. Regards Markus Wanner
2013/6/27 Markus Wanner <markus@bluegap.ch>: > On 06/27/2013 11:12 AM, Nicolas Barbier wrote: > >> Imagine that there are a lot of indexes, e.g., 50. Although a lookup >> (walking one index) is equally fast, an insertion must update al 50 >> indexes. When each index requires one extra I/O (because each index is >> one level taller), that is 50 extra I/Os. In the partitioned case, >> each index would require the normal smaller amount of I/Os. Choosing >> which partition to use must only be done once: The result “counts” for >> all indexes that are to be updated. > > I think you're underestimating the cost of partitioning. After all, the > lookup of what index to update for a given partition is a a lookup in > pg_index via pg_index_indrelid_index - a btree index. I am assuming that this (comparatively very small and super-hot) index is cached all the time, while for the other indexes (that are supposedly super-huge) only the top part stays cached. I am mostly just trying to find out where Yuri’s “partitioning is needed for super-huge tables” experience might come from, and noting that Heikki’s argument might not be 100% valid. I think that the “PostgreSQL-answer” to this problem is to somehow cluster the data on the “hotness column” (so that all hot rows are close to one another, thereby improving the efficiency of the caching of relation blocks) + partial indexes for the hot rows (as first mentioned by Claudio; to improve the efficiency of the caching of index blocks). > Additionally, the depth of an index doesn't directly translate to the > number of I/O writes per insert (or delete). I'd rather expect the avg. > number of I/O writes per insert into a b-tree to be reasonably close to > one - depending mostly on the number of keys per page, not depth. My reasoning was: To determine which index block to update (typically one in both the partitioned and non-partitioned cases), one needs to walk the index first, which supposedly causes one additional (read) I/O in the non-partitioned case on average, because there is one extra level and the lower part of the index is not cached (because of the size of the index). I think that pokes a hole in Heikki’s argument of “it really doesn’t matter, partitioning == using one big table with big non-partial indexes.” Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
On 06/27/2013 06:35 PM, Nicolas Barbier wrote: > I am assuming that this (comparatively very small and super-hot) index > is cached all the time, while for the other indexes (that are > supposedly super-huge) only the top part stays cached. > > I am mostly just trying to find out where Yuri’s “partitioning is > needed for super-huge tables” experience might come from, and noting > that Heikki’s argument might not be 100% valid. I think the OP made that clear by stating that his index has relatively low selectivity. That seems to be a case that Postgres doesn't handle very well. > I think that the > “PostgreSQL-answer” to this problem is to somehow cluster the data on > the “hotness column” (so that all hot rows are close to one another, > thereby improving the efficiency of the caching of relation blocks) + > partial indexes for the hot rows (as first mentioned by Claudio; to > improve the efficiency of the caching of index blocks). Agreed, sounds like a sane strategy. > My reasoning was: To determine which index block to update (typically > one in both the partitioned and non-partitioned cases), one needs to > walk the index first, which supposedly causes one additional (read) > I/O in the non-partitioned case on average, because there is one extra > level and the lower part of the index is not cached (because of the > size of the index). I think that pokes a hole in Heikki’s argument of > “it really doesn’t matter, partitioning == using one big table with > big non-partial indexes.” Heikki's argument holds for the general case, where you cannot assume a well defined hot partition. In that case, the lowest levels of all the b-trees of the partitions don't fit in the cache, either. A single index performs better in that case, because it has lower overhead. I take your point that in case you *can* define a hot partition and apply partitioning, the hot(test) index(es) are more likely to be cached and thus require less disk I/O. Regards Markus Wanner
On Wed, Jun 26, 2013 at 8:55 AM, Markus Wanner <markus@bluegap.ch> wrote:
On 06/26/2013 05:46 PM, Heikki Linnakangas wrote:So far reading sequentially is still faster than hopping between
> We could also allow a large query to search a single table in parallel.
> A seqscan would be easy to divide into N equally-sized parts that can be
> scanned in parallel. It's more difficult for index scans, but even then
> it might be possible at least in some limited cases.
different locations. Purely from the I/O perspective, that is.
Wouldn't any IO system being used on a high-end system be fairly good about making this work through interleaved read-ahead algorithms? Also, hopefully the planner would be able to predict when parallelization has nothing to add and avoid using it, although surely that is easier said than done.
For queries where the single CPU core turns into a bottle-neck and which
we want to parallelize, we should ideally still do a normal, fully
sequential scan and only fan out after the scan and distribute the
incoming pages (or even tuples) to the multiple cores to process.
That sounds like it would be much more susceptible to lock contention, and harder to get bug-free, than dividing into bigger chunks, like whole 1 gig segments.
Fanning out line by line (according to line_number % number_processes) was my favorite parallelization method in Perl, but those files were read only and so had no concurrency issues.
Cheers,
Jeff
On Wed, Jun 26, 2013 at 11:14 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
Now I just have two indices. One that indexes only hot tuples, it's
very heavily queried and works blazingly fast, and one that indexes by
(hotness, key). I include the hotness value on the query, and still
works quite fast enough. Luckily, I know things become cold after an
update to mark them cold, so I can do that. I included hotness on the
index to cluster updates on the hot part of the index, but I could
have just used a regular index and paid a small price on the updates.
Indeed, for a while it worked without the hotness, and there was no
significant trouble. I later found out that WAL bandwidth was
noticeably decreased when I added that hotness column, so I did, helps
a bit with replication. Has worked ever since.
I'm surprised that clustering updates into the hot part of the index, without also clustering them it into a hot part of the table heap, works well enough to make a difference. Does clustering in the table just come naturally under your usage patterns?
Cheers,
Jeff
On Thu, Jun 27, 2013 at 2:12 AM, Nicolas Barbier <nicolas.barbier@gmail.com> wrote:
2013/6/26 Heikki Linnakangas <hlinnakangas@vmware.com>:
> On 26.06.2013 16:41, Yuri Levinsky wrote:
>
>> Heikki,
>> As far as I understand the height of the btree will affect the number of
>> I/Os necessary. The height of the tree does not increase linearly with
>> the number of records.
>> Now let's compare that with a hash partitioned table, with 1000 partitions> and a b-tree index on every partition. [..] This is almost equivalent to
> just having a b-tree that's one level taller [..] There certainly isn't> any difference in the number of actual I/O performed.Imagine that there are a lot of indexes, e.g., 50. Although a lookup
(walking one index) is equally fast, an insertion must update al 50
indexes. When each index requires one extra I/O (because each index is
one level taller), that is 50 extra I/Os.
Except for pathological conditions like indexing the longest values that can be indexed, a btree insertion rarely needs to split even the lowest internal page, much less all pages up to the root.
...
Additionally: Imagine that the data can be partitioned along some
column that makes sense for performance reasons (e.g., some “date”
where most accesses are concentrated on rows with more recent dates).
The other indexes will probably not have such a performance
distribution. Using those other indexes (both for look-ups and
updates) in the non-partitioned case, will therefore pull a huge
portion of each index into cache (because of the “random distribution”
of the non-date data). In the partitioned case, more cache can be
spent on the indexes that correspond to the “hot partitions.”
This could well be true for range partitioning on date. It is hard to see how this would work well for hash partitioning on the date, unless you carefully arrange for the number of hash partitions to be about the same as the number of distinct dates in the table.
Cheers,
Jeff
On Thu, Jun 27, 2013 at 9:35 AM, Nicolas Barbier <nicolas.barbier@gmail.com> wrote:
My reasoning was: To determine which index block to update (typically
one in both the partitioned and non-partitioned cases), one needs to
walk the index first, which supposedly causes one additional (read)
I/O in the non-partitioned case on average, because there is one extra
level and the lower part of the index is not cached (because of the
size of the index).
But the "extra level" is up at the top where it is well cached, not at the bottom where it is not.
Cheers,
Jeff
On 06/27/2013 11:13 PM, Jeff Janes wrote: > Wouldn't any IO system being used on a high-end system be fairly good > about making this work through interleaved read-ahead algorithms? To some extent, certainly. It cannot possibly get better than a fully sequential load, though. > That sounds like it would be much more susceptible to lock contention, > and harder to get bug-free, than dividing into bigger chunks, like whole > 1 gig segments. Maybe, yes. Splitting a known amount of work into equal pieces sounds like a pretty easy parallelization strategy. In case you don't know the total amount of work or the size of each piece in advance, it gets a bit harder. Choosing chunks that turn out to be too big certainly hurts. Regards Markus Wanner
On Thu, Jun 27, 2013 at 6:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Jun 26, 2013 at 11:14 AM, Claudio Freire <klaussfreire@gmail.com> > wrote: >> >> >> Now I just have two indices. One that indexes only hot tuples, it's >> very heavily queried and works blazingly fast, and one that indexes by >> (hotness, key). I include the hotness value on the query, and still >> works quite fast enough. Luckily, I know things become cold after an >> update to mark them cold, so I can do that. I included hotness on the >> index to cluster updates on the hot part of the index, but I could >> have just used a regular index and paid a small price on the updates. >> >> Indeed, for a while it worked without the hotness, and there was no >> significant trouble. I later found out that WAL bandwidth was >> noticeably decreased when I added that hotness column, so I did, helps >> a bit with replication. Has worked ever since. > > > > I'm surprised that clustering updates into the hot part of the index, > without also clustering them it into a hot part of the table heap, works > well enough to make a difference. Does clustering in the table just come > naturally under your usage patterns? Yes, hotness is highly correlated to age, while still not 100%. So most updates hit the tail of the table, about a week or two worth of it.