Thread: Re: [PERFORM] CLUSTER command
On Thu, 12 Dec 2002, johnnnnnn wrote: > On Thu, Dec 12, 2002 at 05:39:44PM -0500, Jean-Luc Lachance wrote: > > Let's create a new command: > > > > PARTITION <table> ON <attribute> > <snip> > > Because the table file is already extended (2G limit) using > > different files extension (.N) > > how complicated (modifying the code) would it be to have the table > > files split according to the cluster key? > > I think the code changes would be complicated. Just at a 30-second > consideration, this would need to touch: > - all sql (selects, inserts, updates, deletes) > - vacuuming > - indexing > - statistics gathering > - existing clustering I think his idea was to treat it similarly to the way that the system treats tables >2G with .N files. The only thing is that I believe the code that deals with that wouldn't be particularly easy to change to do it though, but I've only taken a cursory look at what I think is the place that does that(storage/smgr/md.c). Some sort of good partitioning system would be nice though. > create table u1 (...); > create table u2 (...); > create table u3 (...); > > create view uv as (select "A" as partition_key, ... from u1 > union all > select "B" as partition_key, ... from u2 > union all > select "C" as partition_key, ... from u3); > > That keeps the tables in different files on-disk while still allowing > you to query against all of them. You need to index them separately > and logic is necessary when changing data. Unfortunately, I think that the optimizer isn't going to do what you'd hope here and scan only the appropriate table if you were to say partition_key='A' and foo='bar'. I'd love to be shown that I'm wrong, but the best I could see hoping for would be that if partition_key was part of u1-u3 and there was an index on partition_key,foo that it could use that and do minimal work on the other tables. In addition, doing something like the above is a nightmare if you don't know beforehand what the partitions should be (for example if you know there aren't alot of distinct values, but you don't know what they are) or for that matter even with 10-15 partitions, writing the rules and such would probably be really error prone.
On Thu, Dec 12, 2002 at 04:03:47PM -0800, Stephan Szabo wrote: > On Thu, 12 Dec 2002, johnnnnnn wrote: > > > I think the code changes would be complicated. Just at a 30-second > > consideration, this would need to touch: > > - all sql (selects, inserts, updates, deletes) > > - vacuuming > > - indexing > > - statistics gathering > > - existing clustering > > I think his idea was to treat it similarly to the way that the > system treats tables >2G with .N files. The only thing is that > I believe the code that deals with that wouldn't be particularly > easy to change to do it though, but I've only taken a cursory look at > what I think is the place that does that(storage/smgr/md.c). Some sort of > good partitioning system would be nice though. I don't think this is doable without a huge amount of work. The storage manager doesn't know anything about what is in a page, let alone a tuple. And it shouldn't, IMHO. Upper levels don't know how are pages organized in disk; they don't know about .1 segments and so on, and they shouldn't. I think this kind of partition doesn't buy too much. I would really like to have some kind of auto-clustering, but it should be implemented in some upper level; e.g., by leaving some empty space in pages for future tuples, and arranging the whole heap again when it runs out of free space somewhere. Note that this is very far from the storage manager. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "La realidad se compone de muchos sueños, todos ellos diferentes, pero en cierto aspecto, parecidos..." (Yo, hablando de sueños eróticos)
I think Oracle does something like this with its clustering. You set a %fill and Oracle uses this when doing inserts into a segment and when to add a new one. There is also some control over the grouping of data within a page. I don't have an Oracle manual present, but I think the clustering works on a specific index. I agree that adding auto-clustering would be a very good thing and that we can learn about functionality by studying what other applications have already done and if/how those strategies were successful. Charlie Alvaro Herrera wrote: >On Thu, Dec 12, 2002 at 04:03:47PM -0800, Stephan Szabo wrote: > > >>On Thu, 12 Dec 2002, johnnnnnn wrote: >> >> >> >>>I think the code changes would be complicated. Just at a 30-second >>>consideration, this would need to touch: >>>- all sql (selects, inserts, updates, deletes) >>>- vacuuming >>>- indexing >>>- statistics gathering >>>- existing clustering >>> >>> >>I think his idea was to treat it similarly to the way that the >>system treats tables >2G with .N files. The only thing is that >>I believe the code that deals with that wouldn't be particularly >>easy to change to do it though, but I've only taken a cursory look at >>what I think is the place that does that(storage/smgr/md.c). Some sort of >>good partitioning system would be nice though. >> >> > >I don't think this is doable without a huge amount of work. The storage >manager doesn't know anything about what is in a page, let alone a >tuple. And it shouldn't, IMHO. Upper levels don't know how are pages >organized in disk; they don't know about .1 segments and so on, and they >shouldn't. > >I think this kind of partition doesn't buy too much. I would really >like to have some kind of auto-clustering, but it should be implemented >in some upper level; e.g., by leaving some empty space in pages for >future tuples, and arranging the whole heap again when it runs out of >free space somewhere. Note that this is very far from the storage >manager. > > > -- Charles H. Woloszynski ClearMetrix, Inc. 115 Research Drive Bethlehem, PA 18015 tel: 610-419-2210 x400 fax: 240-371-3256 web: www.clearmetrix.com
On Thu, 12 Dec 2002, Alvaro Herrera wrote: > On Thu, Dec 12, 2002 at 04:03:47PM -0800, Stephan Szabo wrote: > > On Thu, 12 Dec 2002, johnnnnnn wrote: > > > > > I think the code changes would be complicated. Just at a 30-second > > > consideration, this would need to touch: > > > - all sql (selects, inserts, updates, deletes) > > > - vacuuming > > > - indexing > > > - statistics gathering > > > - existing clustering > > > > I think his idea was to treat it similarly to the way that the > > system treats tables >2G with .N files. The only thing is that > > I believe the code that deals with that wouldn't be particularly > > easy to change to do it though, but I've only taken a cursory look at > > what I think is the place that does that(storage/smgr/md.c). Some sort of > > good partitioning system would be nice though. > > I don't think this is doable without a huge amount of work. The storage > manager doesn't know anything about what is in a page, let alone a > tuple. And it shouldn't, IMHO. Upper levels don't know how are pages > organized in disk; they don't know about .1 segments and so on, and they > shouldn't. Which is part of why I said it wouldn't be easy to change to do that, there's no good way to communicate that information. Like I said, I didn't look deeply, but I had to look though, because you can never tell with bits of old university code to do mostly what you want that haven't been exercised in years floating around. > I think this kind of partition doesn't buy too much. I would really > like to have some kind of auto-clustering, but it should be implemented > in some upper level; e.g., by leaving some empty space in pages for > future tuples, and arranging the whole heap again when it runs out of > free space somewhere. Note that this is very far from the storage > manager. Auto clustering would be nice. I think Jean-Luc's suggested partitioning mechanism has certain usage patterns that it's a win for and most others that it's not. Since the usage pattern I can think of (very large table with a small number of breakdowns where your conditions are primarily on those breakdowns) aren't even remotely in the domain of things I've worked with, I can't say whether it'd end up really being a win to avoid the index reads for the table.
Stephan, Someone commented earlier about the separation/abstraction of the storage manager. I agree that it should not be done at the storage level. Maybe a better idea, would be to create a new pg_partition table that would have the functionality of an index on the key field and also be used to point to a file/table ID. That would be alot more work to code on thet planner though. If a newly inherited table could also inherite the constraints and indecies of its parent maybe things would be easier. JLL Stephan Szabo wrote: > > On Thu, 12 Dec 2002, johnnnnnn wrote: > > > On Thu, Dec 12, 2002 at 05:39:44PM -0500, Jean-Luc Lachance wrote: > > > Let's create a new command: > > > > > > PARTITION <table> ON <attribute> > > <snip> > > > Because the table file is already extended (2G limit) using > > > different files extension (.N) > > > how complicated (modifying the code) would it be to have the table > > > files split according to the cluster key? > > > > > I think the code changes would be complicated. Just at a 30-second > > consideration, this would need to touch: > > - all sql (selects, inserts, updates, deletes) > > - vacuuming > > - indexing > > - statistics gathering > > - existing clustering > > I think his idea was to treat it similarly to the way that the > system treats tables >2G with .N files. The only thing is that > I believe the code that deals with that wouldn't be particularly > easy to change to do it though, but I've only taken a cursory look at > what I think is the place that does that(storage/smgr/md.c). Some sort of > good partitioning system would be nice though. > > > create table u1 (...); > > create table u2 (...); > > create table u3 (...); > > > > create view uv as (select "A" as partition_key, ... from u1 > > union all > > select "B" as partition_key, ... from u2 > > union all > > select "C" as partition_key, ... from u3); > > > > That keeps the tables in different files on-disk while still allowing > > you to query against all of them. You need to index them separately > > and logic is necessary when changing data. > > Unfortunately, I think that the optimizer isn't going to do what you'd > hope here and scan only the appropriate table if you were to say > partition_key='A' and foo='bar'. I'd love to be shown that I'm wrong, but > the best I could see hoping for would be that if partition_key was part of > u1-u3 and there was an index on partition_key,foo that it could use that > and do minimal work on the other tables. > > In addition, doing something like the above is a nightmare if you don't > know beforehand what the partitions should be (for example if you know > there aren't alot of distinct values, but you don't know what they are) or > for that matter even with 10-15 partitions, writing the rules and such > would probably be really error prone. > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly