Thread: How to best use 32 15k.7 300GB drives?
HI, I use PostgreSQL basically as a data warehouse to store all the genetic data that our lab generates. The only person that accesses the database is myself and therefore I've had it housed on my workstation in my office up till now. However, it's getting time to move it to bigger hardware. I currently have a server that is basically only storing backup images of all our other workstations so I'm going to move my database onto it. The server looks like this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 quad-core x2, 32GB RAM. For my purposes the CPUs and RAM are fine. I currently have an Adaptec 52445+BBU controller that has the OS (4 drive RAID5), FTP (4 drive RAID5) and two backup arrays (8 drive each RAID0). The backup arrays are in a 16 drive external enclosure through an expander so I actually have 16 ports free on the 52445 card. I plan to remove 3 of the drives from my backup arrays to make room for 3 - 73GB 15k.5 drives (re-purposed from my workstation). Two 16 drive enclosures with SAS2 expanders just arrived as well as 36 Seagate 15k.7 300GB drives (ST3300657SS). I also intend on getting an Adaptec 6445 controller with the flash module when it becomes available in about a month or two. I already have several Adaptec cards so I'd prefer to stick with them. Here's the way I was planning using the new hardware: xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 controller data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller 2 - 300G 15k.7 as hot spares enclosure C 4 spare 15k.7 for on the shelf With this configuration I figure I'll have ~3TB for my main data tables and 1TB for indexes. Right now my database is 500GB total. The 3:1 split reflects my current table structure and what I foresee coming down the road in terms of new data. So my questions are 1) am I'm crazy for doing this, 2) would you change anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp filespace) on a different controller than everything else? Please keep in mind I'm a geneticist who happens to know a little bit about bioinformatics and not the reverse. :-) Thanks! Bob -- ************************************************* Robert Schnabel Research Assistant Professor University of Missouri-Columbia Animal Sciences Unit, Rm.162 920 East Campus Drive Columbia, MO 65211-5300 Phone: 573-884-4106 Fax: 573-882-6827 http://animalgenomics.missouri.edu "...Socialist governments traditionally do make a financial mess. They always run out of other people's money." Margaret Thatcher, 5 February 1976 *************************************************
On Thu, 27 Jan 2011, Robert Schnabel wrote: > HI, > > I use PostgreSQL basically as a data warehouse to store all the genetic data > that our lab generates. The only person that accesses the database is myself > and therefore I've had it housed on my workstation in my office up till now. > However, it's getting time to move it to bigger hardware. I currently have a > server that is basically only storing backup images of all our other > workstations so I'm going to move my database onto it. The server looks like > this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 quad-core x2, 32GB > RAM. For my purposes the CPUs and RAM are fine. I currently have an Adaptec > 52445+BBU controller that has the OS (4 drive RAID5), FTP (4 drive RAID5) and > two backup arrays (8 drive each RAID0). The backup arrays are in a 16 drive > external enclosure through an expander so I actually have 16 ports free on > the 52445 card. I plan to remove 3 of the drives from my backup arrays to > make room for 3 - 73GB 15k.5 drives (re-purposed from my workstation). Two > 16 drive enclosures with SAS2 expanders just arrived as well as 36 Seagate > 15k.7 300GB drives (ST3300657SS). I also intend on getting an Adaptec 6445 > controller with the flash module when it becomes available in about a month > or two. I already have several Adaptec cards so I'd prefer to stick with > them. > > Here's the way I was planning using the new hardware: > xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 controller > data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller > indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller > 2 - 300G 15k.7 as hot spares enclosure C > 4 spare 15k.7 for on the shelf > > With this configuration I figure I'll have ~3TB for my main data tables and > 1TB for indexes. Right now my database is 500GB total. The 3:1 split > reflects my current table structure and what I foresee coming down the road > in terms of new data. > > So my questions are 1) am I'm crazy for doing this, 2) would you change > anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp > filespace) on a different controller than everything else? Please keep in > mind I'm a geneticist who happens to know a little bit about bioinformatics > and not the reverse. :-) a number of questions spring to mind how much of the time are you expecting to spend inserting data into this system vs querying data from the system? is data arriving continuously, or is it a matter of receiving a bunch of data, inserting it, then querying it? which do you need to optimize for, insert speed or query speed? do you expect your queries to be searching for a subset of the data scattered randomly throughlut the input data, or do you expect it to be 'grab this (relativly) sequential chunk of input data and manipulate it to generate a report' type of thing what is your connectvity to the raid enclosures? (does putting 22 drives on one cable mean that you will be limited due to the bandwidth of this cable rather than the performance of the drives) can you do other forms of raid on these drives or only raid 10? how critical is the data in this database? if it were to die would it just be a matter of recreating it and reloading the data? or would you loose irreplaceable data? David Lang
On January 27, 2011, Robert Schnabel <schnabelr@missouri.edu> wrote:
> So my questions are 1) am I'm crazy for doing this, 2) would you change
> anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp
> filespace) on a different controller than everything else? Please keep
> in mind I'm a geneticist who happens to know a little bit about
> bioinformatics and not the reverse. :-)
>
Putting the WAL on a second controller does help, if you're write-heavy.
I tried separating indexes and data once on one server and didn't really notice that it helped much. Managing the space was problematic. I would suggest putting those together on a single RAID-10 of all the 300GB drives (minus a spare). It will probably outperform separate arrays most of the time, and be much easier to manage.
--
A hybrid Escalade is missing the point much in the same way that having a diet soda with your extra large pepperoni pizza is missing the point.
On 1/27/2011 5:19 PM, david@lang.hm wrote: > On Thu, 27 Jan 2011, Robert Schnabel wrote: > >> HI, >> >> I use PostgreSQL basically as a data warehouse to store all the genetic data >> that our lab generates. The only person that accesses the database is myself >> and therefore I've had it housed on my workstation in my office up till now. >> However, it's getting time to move it to bigger hardware. I currently have a >> server that is basically only storing backup images of all our other >> workstations so I'm going to move my database onto it. The server looks like >> this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 quad-core x2, 32GB >> RAM. For my purposes the CPUs and RAM are fine. I currently have an Adaptec >> 52445+BBU controller that has the OS (4 drive RAID5), FTP (4 drive RAID5) and >> two backup arrays (8 drive each RAID0). The backup arrays are in a 16 drive >> external enclosure through an expander so I actually have 16 ports free on >> the 52445 card. I plan to remove 3 of the drives from my backup arrays to >> make room for 3 - 73GB 15k.5 drives (re-purposed from my workstation). Two >> 16 drive enclosures with SAS2 expanders just arrived as well as 36 Seagate >> 15k.7 300GB drives (ST3300657SS). I also intend on getting an Adaptec 6445 >> controller with the flash module when it becomes available in about a month >> or two. I already have several Adaptec cards so I'd prefer to stick with >> them. >> >> Here's the way I was planning using the new hardware: >> xlog& wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 controller >> data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller >> indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller >> 2 - 300G 15k.7 as hot spares enclosure C >> 4 spare 15k.7 for on the shelf >> >> With this configuration I figure I'll have ~3TB for my main data tables and >> 1TB for indexes. Right now my database is 500GB total. The 3:1 split >> reflects my current table structure and what I foresee coming down the road >> in terms of new data. >> >> So my questions are 1) am I'm crazy for doing this, 2) would you change >> anything and 3) is it acceptable to put the xlog& wal (and perhaps tmp >> filespace) on a different controller than everything else? Please keep in >> mind I'm a geneticist who happens to know a little bit about bioinformatics >> and not the reverse. :-) > a number of questions spring to mind > > how much of the time are you expecting to spend inserting data into this > system vs querying data from the system? > > is data arriving continuously, or is it a matter of receiving a bunch of > data, inserting it, then querying it? > > which do you need to optimize for, insert speed or query speed? > Bulk loads of GB of data via COPY from csv files once every couple weeks. I basically only have a couple different table "types" based on the data going into them. Each type is set up as inherited tables so there is a new child table for each "sample" that is added. Once the bulk data is inserted into the tables I generally do some updates on columns to set values which characterize the data. These columns then get indexed. Basically once the initial manipulation is done the table is then static and what I'm looking for is query speed. > do you expect your queries to be searching for a subset of the data > scattered randomly throughlut the input data, or do you expect it to be > 'grab this (relativly) sequential chunk of input data and manipulate it to > generate a report' type of thing Generally it is grab a big sequential chunk of data and either dump it to a csv or insert into another table. I use external scripts to format data. My two big table structures look like this: CREATE TABLE genotypes ( snp_number integer NOT NULL, sample_id integer NOT NULL, genotype smallint NOT NULL ) There are ~58k unique snp_number. Other tables will have upwards of 600-700k snp_number. The child tables have a constraint based on sample_id such as: CONSTRAINT check100 CHECK (sample_id > 100000000 AND sample_id < 101000000) The data is sorted by snp_number, sample_id. So if I want the data for a given sample_id it would be a block of ~58k rows. The size of the table depends on how many sample_id's there are. My largest has ~30k sample_id by 58k snp_number per sample. The other big table (with children) is "mutations" and is set up similarly so that I can access individual tables (samples) based on constraints. Each of these children have between 5-60M records. > what is your connectvity to the raid enclosures? (does > putting 22 drives on one cable mean that you will be limited due to the > bandwidth of this cable rather than the performance of the drives) > > can you do other forms of raid on these drives or only raid 10? This is all direct attach storage via SAS2 so I'm guessing it's probably limited to the single port link between the controller and the expander. Again, geneticist here not computer scientist. ;-) The enclosures have Areca ARC-8026-16 expanders. I can basically do whatever RAID level I want. > how critical is the data in this database? if it were to die would it just > be a matter of recreating it and reloading the data? or would you loose > irreplaceable data? > > David Lang All of the data could be reloaded. Basically, once I get the data into the database and I'm done manipulating it I create a backup copy/dump which then gets stored at a couple different locations. Like I said, I really only do big loads/updates periodically so if it tanked all I'd be out is whatever I did since the last backup/dump and some time. My goal is to 1) have a fairly robust system so that I don't have to spend my time rebuilding things and 2) be able to query the data quickly. Most of what I do are ad hoc queries. I have an idea... "how many X have Y in this set of Z samples" and write the query to get the answer. I can wait a couple minutes to get an answer but waiting an hour is becoming tiresome. Bob
sorry for not replying properly to your response, I managed to delete the mail. as I understand your data access pattern it's the following: for the main table space: bulk loads every couple of weeks. if the data is lost you can just reload it. searches tend to be extracting large sequential chunks of data, either to external files or into different tables spaces. for this table space, you are basically only inserting every couple of weeks, and it sounds as if you do not really care how long it takes to load the data. first the disclaimer, I'm not a postgres expert, but I do have good experiance with large amounts of data on linux systems (and especially running into the limitations when doing it on the cheap ;-) with this data pattern your WAL is meaningless (as it's only relavent for isertes), and you may as well use raid6 as raid10 (both allow you to utalize all drives for reads, but raid6 gives you 2 drives worth of reducnancy while the wrong two drives on raid10 could kill the entire array). You may even want to disable fsync on imports. It will save you a lot of time, and if the system crashes during the load you can just reinitialize and reload the data. however, since you are going to be large sequential data transfers, you want to be utalizing multiple SAS links, preferrably as evenly as possible, so rather than putting all your data drives on one port, you may want to spread them between ports so that your aggragate bandwidth to the drives is higher (with this many high speed drives, this is a significant limitation) the usual reason for keeping the index drives separate is to avoid having writes interact with index reads. Since you are not going to be doing both at the same time, I don't know if it helps to separate your indexes. now, if you pull the data from this main table into a smaller table for analysis, you may want to do more interesting things with the drives that you use for this smaller table as you are going to be loading data into them more frequently. David Lang On Thu, 27 Jan 2011, david@lang.hm wrote: > Date: Thu, 27 Jan 2011 15:19:32 -0800 (PST) > From: david@lang.hm > To: Robert Schnabel <schnabelr@missouri.edu> > Cc: pgsql-performance <pgsql-performance@postgresql.org> > Subject: Re: [PERFORM] How to best use 32 15k.7 300GB drives? > > On Thu, 27 Jan 2011, Robert Schnabel wrote: > >> HI, >> >> I use PostgreSQL basically as a data warehouse to store all the genetic >> data that our lab generates. The only person that accesses the database is >> myself and therefore I've had it housed on my workstation in my office up >> till now. However, it's getting time to move it to bigger hardware. I >> currently have a server that is basically only storing backup images of all >> our other workstations so I'm going to move my database onto it. The >> server looks like this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 >> quad-core x2, 32GB RAM. For my purposes the CPUs and RAM are fine. I >> currently have an Adaptec 52445+BBU controller that has the OS (4 drive >> RAID5), FTP (4 drive RAID5) and two backup arrays (8 drive each RAID0). >> The backup arrays are in a 16 drive external enclosure through an expander >> so I actually have 16 ports free on the 52445 card. I plan to remove 3 of >> the drives from my backup arrays to make room for 3 - 73GB 15k.5 drives >> (re-purposed from my workstation). Two 16 drive enclosures with SAS2 >> expanders just arrived as well as 36 Seagate 15k.7 300GB drives >> (ST3300657SS). I also intend on getting an Adaptec 6445 controller with >> the flash module when it becomes available in about a month or two. I >> already have several Adaptec cards so I'd prefer to stick with them. >> >> Here's the way I was planning using the new hardware: >> xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 >> controller >> data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller >> indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller >> 2 - 300G 15k.7 as hot spares enclosure C >> 4 spare 15k.7 for on the shelf >> >> With this configuration I figure I'll have ~3TB for my main data tables and >> 1TB for indexes. Right now my database is 500GB total. The 3:1 split >> reflects my current table structure and what I foresee coming down the road >> in terms of new data. >> >> So my questions are 1) am I'm crazy for doing this, 2) would you change >> anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp >> filespace) on a different controller than everything else? Please keep in >> mind I'm a geneticist who happens to know a little bit about bioinformatics >> and not the reverse. :-) > > a number of questions spring to mind > > how much of the time are you expecting to spend inserting data into this > system vs querying data from the system? > > is data arriving continuously, or is it a matter of receiving a bunch of > data, inserting it, then querying it? > > which do you need to optimize for, insert speed or query speed? > > do you expect your queries to be searching for a subset of the data scattered > randomly throughlut the input data, or do you expect it to be 'grab this > (relativly) sequential chunk of input data and manipulate it to generate a > report' type of thing > > what is your connectvity to the raid enclosures? (does putting 22 drives on > one cable mean that you will be limited due to the bandwidth of this cable > rather than the performance of the drives) > > can you do other forms of raid on these drives or only raid 10? > > how critical is the data in this database? if it were to die would it just be > a matter of recreating it and reloading the data? or would you loose > irreplaceable data? > > David Lang >
Robert, * Robert Schnabel (schnabelr@missouri.edu) wrote: > Once the bulk data is inserted into the tables I generally > do some updates on columns to set values which characterize the > data. Please tell me you're not running actual full-table UPDATE statements... You would be *much* better off either: a) munging the data on the way in (if possible/reasonable) b) loading the data into temp tables first, and then using INSERT statements to move the data into the 'final' tables WITH the new columns/info you want c) considering if you can normalize the data into multiple tables and/or to cut down the columns to only what you need as you go through the above, too A full-table UPDATE means you're basically making the table twice as big as it needs to be, and it'll never get smaller.. > These columns then get indexed. Basically once the initial > manipulation is done the table is then static and what I'm looking > for is query speed. Sadly, this is the same type of DW needs that I've got (though with telecomm data and phone calls, not genetic stuffs ;), and PG ends up being limited by the fact that it can only use one core/thread to go through the data with. You might consider investing some time trying to figure out how to parallelize your queries. My approach to this has been to partition the data (probably something you're doing already) into multiple tables and then have shell/perl scripts which will run a given query against all of the tables, dumping the results of that aggregation/analysis into other tables, and then having a final 'merge' query. > The data is sorted by snp_number, sample_id. So if I want the data > for a given sample_id it would be a block of ~58k rows. The size of > the table depends on how many sample_id's there are. My largest has > ~30k sample_id by 58k snp_number per sample. The other big table > (with children) is "mutations" and is set up similarly so that I can > access individual tables (samples) based on constraints. Each of > these children have between 5-60M records. Understand that indexes are only going to be used/useful, typically, if the amount of records being returned is small relative to the size of the table (eg: 5%). > This is all direct attach storage via SAS2 so I'm guessing it's > probably limited to the single port link between the controller and > the expander. Again, geneticist here not computer scientist. ;-) That link certainly isn't going to help things.. You might consider how or if you can improve that. > All of the data could be reloaded. Basically, once I get the data > into the database and I'm done manipulating it I create a backup > copy/dump which then gets stored at a couple different locations. You might consider turning fsync off while you're doing these massive data loads.. and make sure that you issue your 'CREATE TABLE' and your 'COPY' statements in the same transaction, and again, I suggest loading into temporary (CREATE TEMPORARY TABLE) tables first, then doing the CREATE TABLE/INSERT statement for the 'real' table. Make sure that you create *both* your constraints *and* your indexes *after* the table is populated. If you turn fsync off, make sure you turn it back on. :) > My goal is to 1) have a fairly robust system so that I don't have to > spend my time rebuilding things and 2) be able to query the data > quickly. Most of what I do are ad hoc queries. I have an idea... > "how many X have Y in this set of Z samples" and write the query to > get the answer. I can wait a couple minutes to get an answer but > waiting an hour is becoming tiresome. Have you done any analysis to see what the bottleneck actually is? When you run top, is your PG process constantly in 'D' state, or is it in 'R' state, or what? Might help figure some of that out. Note that parallelizing the query will help regardless of if it's disk bound or CPU bound, when you're running on the kind of hardware you're talking about (lots of spindles, multiple CPUs, etc). Thanks, Stephen
Attachment
> Putting the WAL on a second controller does help, if you're write-heavy. > > I tried separating indexes and data once on one server and didn't > really notice that it helped much. Managing the space was problematic. > I would suggest putting those together on a single RAID-10 of all the > 300GB drives (minus a spare). It will probably outperform separate > arrays most of the time, and be much easier to manage. > > -- > > I like to use RAID 1, and let LVM do the striping. That way I can add more drives later too.
<br /> On 1/28/2011 7:14 AM, Stephen Frost wrote: <blockquote cite="mid:20110128131410.GC30352@tamriel.snowman.net" type="cite"><prewrap="">Robert, * Robert Schnabel (<a class="moz-txt-link-abbreviated" href="mailto:schnabelr@missouri.edu">schnabelr@missouri.edu</a>) wrote: </pre><blockquote type="cite"><pre wrap="">Once the bulk data is inserted into the tables I generally do some updates on columns to set values which characterize the data. </pre></blockquote><pre wrap=""> Please tell me you're not running actual full-table UPDATE statements... You would be *much* better off either: a) munging the data on the way in (if possible/reasonable) b) loading the data into temp tables first, and then using INSERT statements to move the data into the 'final' tables WITHthe new columns/info you want c) considering if you can normalize the data into multiple tables and/or to cut down the columns to only what you need asyou go through the above, too A full-table UPDATE means you're basically making the table twice as big as it needs to be, and it'll never get smaller.. </pre></blockquote> Depends on what you mean by that. The tables that I'm concerned with look something like bigint x2,char var x13, int x24, real x8, smallint x4 by about 65M rows, each. I only do the updates on one table at a time. Thereal columns are actually null in the input csv file. I run an update which basically uses some of the integer columnsand calculates frequencies which go into the real columns. Ditto with some of the other columns. I don't do thisbefore I upload the data because 1) it's easier this way and 2) I can't because some of the updates involve joins toother tables to grab info that I can't do outside the database. So yes, once the upload is done I run queries that updateevery row for certain columns, not every column. After I'm done with a table I run a VACUUM ANALYZE. I'm really notworried about what my table looks like on disk. I actually take other steps also to avoid what you're talking about.<br/><br /><blockquote cite="mid:20110128131410.GC30352@tamriel.snowman.net" type="cite"><pre wrap=""></pre><blockquotetype="cite"><pre wrap="">These columns then get indexed. Basically once the initial manipulation is done the table is then static and what I'm looking for is query speed. </pre></blockquote><pre wrap=""> Sadly, this is the same type of DW needs that I've got (though with telecomm data and phone calls, not genetic stuffs ;), and PG ends up being limited by the fact that it can only use one core/thread to go through the data with. You might consider investing some time trying to figure out how to parallelize your queries. My approach to this has been to partition the data (probably something you're doing already) into multiple tables and then have shell/perl scripts which will run a given query against all of the tables, dumping the results of that aggregation/analysis into other tables, and then having a final 'merge' query. </pre></blockquote> Thanks for the advise but parallelizing/automating doesn't really do anything for me. The data is alreadypartitioned. Think of it this way, you just got 65M new records with about 30 data points per record on an individualsample. You put it in a new table of it's own and now you want to characterize those 65M data points. The firstupdate flags about 60M of the rows as uninteresting so you move them to their own *uninteresting* table and basicallynever really touch them again (but you cant get rid of them). Now you're working with 5M that you're going to characterizeinto about 20 categories based on what is in those 30 columns of data. Do all the querying/updating then indexand you're done. Too long to describe but I cannot automate this. I only update one partition at a time and only aboutevery couple weeks or so.<br /><br /><br /><blockquote cite="mid:20110128131410.GC30352@tamriel.snowman.net" type="cite"><prewrap=""></pre><blockquote type="cite"><pre wrap="">The data is sorted by snp_number, sample_id. So if Iwant the data for a given sample_id it would be a block of ~58k rows. The size of the table depends on how many sample_id's there are. My largest has ~30k sample_id by 58k snp_number per sample. The other big table (with children) is "mutations" and is set up similarly so that I can access individual tables (samples) based on constraints. Each of these children have between 5-60M records. </pre></blockquote><pre wrap=""> Understand that indexes are only going to be used/useful, typically, if the amount of records being returned is small relative to the size of the table (eg: 5%). </pre></blockquote> Yep, I understand that. Even though they occupy a lot of space, I keep them around because there aretimes when I need them.<br /><br /><br /><blockquote cite="mid:20110128131410.GC30352@tamriel.snowman.net" type="cite"><prewrap=""> </pre><blockquote type="cite"><pre wrap="">This is all direct attach storage via SAS2 so I'm guessing it's probably limited to the single port link between the controller and the expander. Again, geneticist here not computer scientist. ;-) </pre></blockquote><pre wrap=""> That link certainly isn't going to help things.. You might consider how or if you can improve that. </pre></blockquote> Suggestions??? It was previously suggested to split the drives on each array across the two controllerports rather than have all the data drives on one port which makes sense. Maybe I'm getting my terminology wronghere but I'm talking about a single SFF-8088 link to each 16 drive enclosure. What about two controllers, one for eachenclosure? Don't know if I have enough empty slots though.<br /><br /><blockquote cite="mid:20110128131410.GC30352@tamriel.snowman.net"type="cite"><pre wrap=""></pre><blockquote type="cite"><pre wrap="">Allof the data could be reloaded. Basically, once I get the data into the database and I'm done manipulating it I create a backup copy/dump which then gets stored at a couple different locations. </pre></blockquote><pre wrap=""> You might consider turning fsync off while you're doing these massive data loads.. and make sure that you issue your 'CREATE TABLE' and your 'COPY' statements in the same transaction, and again, I suggest loading into temporary (CREATE TEMPORARY TABLE) tables first, then doing the CREATE TABLE/INSERT statement for the 'real' table. Make sure that you create *both* your constraints *and* your indexes *after* the table is populated. If you turn fsync off, make sure you turn it back on. :) </pre></blockquote> I haven't messed with fsync but maybe I'll try. In general, I create my indexes and constraints afterI'm done doing all the updating I need to do. I made the mistake <b>*once*</b> of copying millions of rows into a tablethat already had indexes.<br /><br /><blockquote cite="mid:20110128131410.GC30352@tamriel.snowman.net" type="cite"><prewrap=""></pre><blockquote type="cite"><pre wrap="">My goal is to 1) have a fairly robust system so that Idon't have to spend my time rebuilding things and 2) be able to query the data quickly. Most of what I do are ad hoc queries. I have an idea... "how many X have Y in this set of Z samples" and write the query to get the answer. I can wait a couple minutes to get an answer but waiting an hour is becoming tiresome. </pre></blockquote><pre wrap=""> Have you done any analysis to see what the bottleneck actually is? When you run top, is your PG process constantly in 'D' state, or is it in 'R' state, or what? Might help figure some of that out. Note that parallelizing the query will help regardless of if it's disk bound or CPU bound, when you're running on the kind of hardware you're talking about (lots of spindles, multiple CPUs, etc). Thanks, Stephen </pre></blockquote> It got lost from the original post but my database (9.0.0) is currently on my Windows XP 64-bit workstationin my office on a 16 drive Seagate 15k.5 RAID5, no comments needed, I know, I'm moving it :-). I'm moving itto my server which is Windows Ent Server 2008 R2 64-bit 8 AMD cores & 32G ram and these new drives/controller. So notop or lvm although I do keep an eye on things with Process Explorer. Also, I don't have any single query that is a problem. I have my canned queries which I run manually to update/manipulate/move data around every couple weeks when I geta new chunk of data. Other than that my queries are all ad hoc. I'm just trying to get opinions on the best way to setup these drives/controllers/enclosures for basically large sequential reads that quite often use indexes.<br /><br />So far I'd summarize the consensus as:<br /> 1) putting WAL on a separate array is worthless since I do very little writes. What about if I put my temp tablespace on the same array with WAL & xlog? I've noticed a lot of the ad hoc queriesI run create tmp files, sometimes tens of GB. I appreciate the fact that managing multiple tablespaces is not aseasy as managing one but if it helps...<br /><br /> 2) Indexes on a separate array may not be all that useful since I'mnot doing simultaneous reads/writes.<br /><br /> 3) Since I can very easily recreate the database in case of crash/corruptionRAID10 may not be the best option. However, if I do go with RAID10 split the drives between the two enclosures(this assumes data & index arrays). I've thought about RAID0 but quite frankly I really don't like havingto rebuild things. At some point my time becomes valuable. RAID6 was suggested but rebuilding a 9TB RAID6 seems scaryslow to me.<br /><br /> I appreciate the comments thus far.<br /> Bob<br /><br /><br /><br /><br /><br /><br />
On Fri, Jan 28, 2011 at 9:39 AM, Robert Schnabel <schnabelr@missouri.edu> wrote: > I can't do outside the database. So yes, once the upload is done I run > queries that update every row for certain columns, not every column. After > I'm done with a table I run a VACUUM ANALYZE. I'm really not worried about > what my table looks like on disk. I actually take other steps also to avoid > what you're talking about. It will still get bloated. If you update one column in one row in pg, you now have two copies of that row in the database. If you date 1 column in 1M rows, you now have 2M rows in the database (1M "dead" rows, 1M "live" rows). vacuum analyze will not get rid of them, but will free them up to be used in future updates / inserts. Vacuum full or cluster will free up the space, but will lock the table while it does so. There's nothing wrong with whole table updates as part of an import process, you just have to know to "clean up" after you're done, and regular vacuum can't fix this issue, only vacuum full or reindex or cluster.
On 1/28/2011 11:00 AM, Scott Marlowe wrote: > On Fri, Jan 28, 2011 at 9:39 AM, Robert Schnabel<schnabelr@missouri.edu> wrote: >> I can't do outside the database. So yes, once the upload is done I run >> queries that update every row for certain columns, not every column. After >> I'm done with a table I run a VACUUM ANALYZE. I'm really not worried about >> what my table looks like on disk. I actually take other steps also to avoid >> what you're talking about. > It will still get bloated. If you update one column in one row in pg, > you now have two copies of that row in the database. If you date 1 > column in 1M rows, you now have 2M rows in the database (1M "dead" > rows, 1M "live" rows). vacuum analyze will not get rid of them, but > will free them up to be used in future updates / inserts. Vacuum full > or cluster will free up the space, but will lock the table while it > does so. > > There's nothing wrong with whole table updates as part of an import > process, you just have to know to "clean up" after you're done, and > regular vacuum can't fix this issue, only vacuum full or reindex or > cluster. Those are exactly what I was referring to with my "other steps". I just don't always do them as soon as I'm done updating because sometimes I want to query the table right away to find out something. Yep, I found out the hard way that regular VACUUM didn't help.
Robert, * Robert Schnabel (schnabelr@missouri.edu) wrote: > Depends on what you mean by that. The tables that I'm concerned with look > something like bigint x2, char var x13, int x24, real x8, smallint x4 by > about 65M rows, each. I only do the updates on one table at a time. The > real columns are actually null in the input csv file. I run an update > which basically uses some of the integer columns and calculates > frequencies which go into the real columns. Erm, I'm pretty sure you're still increasing the size of the resulting tables by quite a bit by doing this process- which will slow down later queries. > Ditto with some of the other > columns. I don't do this before I upload the data because 1) it's easier > this way and 2) I can't because some of the updates involve joins to other > tables to grab info that I can't do outside the database. That's fine- just first load the data into temporary tables and then do INSERT INTO new_table SELECT <your query>; instead. > So yes, once > the upload is done I run queries that update every row for certain > columns, not every column. After I'm done with a table I run a VACUUM > ANALYZE. I'm really not worried about what my table looks like on disk. I thought you wanted it fast..? If not, I'm not sure why you're bothering to post to this list. What it looks like on disk certainly impacts how fast it is... > I actually take other steps also to avoid what you're talking about. If you really don't feel like changing your process, you could just run 'CLUSTER' on the table, on whatever index you use most frequently, and PG will rewrite the entire table for you, dropping all the dead rows, etc. You should then run VACUUM FREEZE on it. > These columns then get indexed. Basically once the initial > manipulation is done the table is then static and what I'm looking > for is query speed. Yes, I gathered that, making the table smaller on disk will improve query speed. > Thanks for the advise but parallelizing/automating doesn't really do > anything for me. The data is already partitioned. Think of it this way, > you just got 65M new records with about 30 data points per record on an > individual sample. You put it in a new table of it's own and now you want > to characterize those 65M data points. The first update flags about 60M > of the rows as uninteresting so you move them to their own *uninteresting* > table and basically never really touch them again (but you cant get rid of > them). Now you're working with 5M that you're going to characterize into > about 20 categories based on what is in those 30 columns of data. Do all > the querying/updating then index and you're done. Too long to describe > but I cannot automate this. I only update one partition at a time and > only about every couple weeks or so. I was referring to parallelizing queries *after* the data is all loaded, etc. I wasn't talking about the queries that you use during the load. I presume that after the load you run some queries. You can probably parallelize those queries (most DW queries can be, be ime...). > That link certainly isn't going to help things.. You might consider how > or if you can improve that. > > Suggestions??? It was previously suggested to split the drives on each > array across the two controller ports rather than have all the data drives > on one port which makes sense. Maybe I'm getting my terminology wrong > here but I'm talking about a single SFF-8088 link to each 16 drive > enclosure. What about two controllers, one for each enclosure? Don't > know if I have enough empty slots though. I don't know that you'd need a second controller (though it probably wouldn't hurt if you could). If there's only one way to attach the enclosure, then so be it. The issue is if the enclosures end up multi-plexing the individual drives into fewer channels than there are actual drives, hence creating a bottle-neck. You would need different enclosures to deal with that, if that's the case. > I haven't messed with fsync but maybe I'll try. In general, I create my > indexes and constraints after I'm done doing all the updating I need to > do. I made the mistake *once* of copying millions of rows into a table > that already had indexes. Yeah, I bet that took a while. As I said above, if you don't want to change your process (which, tbh, I think would be faster if you were doing INSERTs into a new table than full-table UPDATEs...), then you should do a CLUSTER after you've created whatever is the most popular INDEX, and then create your other indexes after that. > It got lost from the original post but my database (9.0.0) is currently on > my Windows XP 64-bit workstation in my office on a 16 drive Seagate 15k.5 > RAID5, no comments needed, I know, I'm moving it :-). I'm moving it to my > server which is Windows Ent Server 2008 R2 64-bit 8 AMD cores & 32G ram > and these new drives/controller. Ughh... No chance to get a Unix-based system (Linux, BSD, whatever) on there instead? I really don't think Windows Server is going to help your situation one bit.. :( > 1) putting WAL on a separate array is worthless since I do very little > writes. What about if I put my temp tablespace on the same array with WAL > & xlog? I've noticed a lot of the ad hoc queries I run create tmp files, > sometimes tens of GB. I appreciate the fact that managing multiple > tablespaces is not as easy as managing one but if it helps... That's not a bad idea but I'm not sure it'd make as much difference as you think it would.. What would be better would be to *avoid*, at all cost, letting it spill out to on-disk for queries. The way to do that is to make sure your work_mem is as high as PG will actually use (1GB), and then to *parallelize* those queries using multiple PG connections, so that each one will be able to use up that much memory. For example, say you need to summarize the values for each of your strands (or whatever) across 5 different "loads". Your query might look like: select load,strand,sum(value) from parent_table group by load,strand; Ideally, PG will use a hash table, key'd on load+strand, to store the resulting summations in. If it doesn't think the hash table will fit in work_mem, it's going to SORT ALL OF YOUR DATA ON DISK first instead, and then WALK THROUGH IT, sum'ing each section, then spitting out the result to the client, and moving on. This is *not* a fast process. If doing the same query on an individual child will use a hash table, then it'd be hugely faster to query each load first, storing the results into temporary tables. What would be even *faster* would be the run all 5 of those queries against the child tables in parallel (given that you have over 5 CPUs and enough memory that you don't start swapping). If it's still too big on the per-child basis, you might be able to use conditionals to do the first 100 strands, then the next hundred, etc. > I appreciate the comments thus far. Let's hope you'll always appreciate them. :) Thanks, Stephen
Attachment
* Scott Marlowe (scott.marlowe@gmail.com) wrote: > There's nothing wrong with whole table updates as part of an import > process, you just have to know to "clean up" after you're done, and > regular vacuum can't fix this issue, only vacuum full or reindex or > cluster. Just to share my experiences- I've found that creating a new table and inserting into it is actually faster than doing full-table updates, if that's an option for you. Thanks, Stephen
Attachment
p, li { white-space: pre-wrap; } On January 27, 2011, Robert Schnabel <schnabelr@missouri.edu> wrote:
> So my questions are 1) am I'm crazy for doing this, 2) would you change
> anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp
> filespace) on a different controller than everything else? Please keep
> in mind I'm a geneticist who happens to know a little bit about
> bioinformatics and not the reverse. :-)
>
Putting the WAL on a second controller does help, if you're write-heavy.
I tried separating indexes and data once on one server and didn't really notice that it helped much. Managing the space was problematic. I would suggest putting those together on a single RAID-10 of all the 300GB drives (minus a spare). It will probably outperform separate arrays most of the time, and be much easier to manage.
--
A hybrid Escalade is missing the point much in the same way that having a diet soda with your extra large pepperoni pizza is missing the point.
On 1/28/11 9:00 AM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: >On Fri, Jan 28, 2011 at 9:39 AM, Robert Schnabel <schnabelr@missouri.edu> >wrote: >> I can't do outside the database. So yes, once the upload is done I run >> queries that update every row for certain columns, not every column. >>After >> I'm done with a table I run a VACUUM ANALYZE. I'm really not worried >>about >> what my table looks like on disk. I actually take other steps also to >>avoid >> what you're talking about. > >It will still get bloated. If you update one column in one row in pg, >you now have two copies of that row in the database. If you date 1 >column in 1M rows, you now have 2M rows in the database (1M "dead" >rows, 1M "live" rows). vacuum analyze will not get rid of them, but >will free them up to be used in future updates / inserts. Vacuum full >or cluster will free up the space, but will lock the table while it >does so. > >There's nothing wrong with whole table updates as part of an import >process, you just have to know to "clean up" after you're done, and >regular vacuum can't fix this issue, only vacuum full or reindex or >cluster. Also note that HOT will come into play if you have FILLFACTOR set appropriately, so you won't get two copies of the row. This is true if the column being updated is small enough and not indexed. It wastes some space, but a lot less than the factor of two. > >-- >Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-performance
On 1/28/11 9:28 AM, "Stephen Frost" <sfrost@snowman.net> wrote: >* Scott Marlowe (scott.marlowe@gmail.com) wrote: >> There's nothing wrong with whole table updates as part of an import >> process, you just have to know to "clean up" after you're done, and >> regular vacuum can't fix this issue, only vacuum full or reindex or >> cluster. > >Just to share my experiences- I've found that creating a new table and >inserting into it is actually faster than doing full-table updates, if >that's an option for you. I wonder if postgres could automatically optimize that, if it thought that it was going to update more than X% of a table, and HOT was not going to help, then just create a new table file for XID's = or higher than the one making the change, and leave the old one for old XIDs, then regular VACUUM could toss out the old one if no more transactions could see it. > > Thanks, > > Stephen
On Fri, Jan 28, 2011 at 10:44 AM, Scott Carey <scott@richrelevance.com> wrote: > If you go this route, I suggest two equally sized RAID 10's on different > controllers fir index + data, with software raid-0 on top of that. RAID 10 > will max out a controller after 6 to 10 drives, usually. Using the OS RAID > 0 to aggregate the throughput of two controllers works great. I often go one step further and just create a bunch of RAID-1 pairs and use OS level RAID-0 on top of that. On the LSI8888 cards that was by far the fastest setup I tested.
On 1/28/2011 11:14 AM, Stephen Frost wrote: >> It got lost from the original post but my database (9.0.0) is currently on >> my Windows XP 64-bit workstation in my office on a 16 drive Seagate 15k.5 >> RAID5, no comments needed, I know, I'm moving it :-). I'm moving it to my >> server which is Windows Ent Server 2008 R2 64-bit 8 AMD cores& 32G ram >> and these new drives/controller. > Ughh... No chance to get a Unix-based system (Linux, BSD, whatever) on > there instead? I really don't think Windows Server is going to help > your situation one bit.. :( > Almost zero chance. I basically admin the server myself so I can do whatever I want but all permissions are controlled through campus active directory and our departmental IT person doesn't do *nix. So let's just assume I'm stuck with Windows. The main purpose of the server at the moment is to house our backup images. I have two 9 TB arrays which I use robocopy to mirror images once a day between our other server and my workstation. There's really not much of anything else ever eating up CPUs on the server which is why I'm moving my database onto it. >> I appreciate the comments thus far. > Let's hope you'll always appreciate them. :) > > Thanks, > > Stephen Umm, that didn't quite read the way I meant it to when I wrote it. All comments are appreciated. :-) Seriously though, there have been points made that have made me rethink how I go about processing data which I'm sure will help. I'm in a fairly fortunate position in that I can put these new drives on the server and play around with different configurations while I maintain my current setup on my workstation. I guess I just need to experiment and see what works. Thanks again, Bob
<br /> On 1/28/2011 11:44 AM, Scott Carey wrote: <blockquote cite="mid:C9683F12.1E3D5%25scott@richrelevance.com" type="cite"><div><br/></div><div><br /></div><span id="OLK_SRC_BODY_SECTION"><div><div>On 1/27/11 4:11 PM, "Alan Hodgson"<<a href="mailto:ahodgson@simkin.ca" moz-do-not-send="true">ahodgson@simkin.ca</a>> wrote:</div></div><div><br/></div><blockquote id="MAC_OUTLOOK_ATTRIBUTION_BLOCKQUOTE" style="padding: 0pt 0pt 0pt 5px; margin:0pt 0pt 0pt 5px;"><div><style type="text/css"> p, li { white-space: pre-wrap; } </style><div style="font-family: 'Ubuntu'; font-size: 11pt; font-weight: 400; font-style: normal;"><p style="margin:0px; text-indent: 0px;">On January 27, 2011, Robert Schnabel <<a href="mailto:schnabelr@missouri.edu" moz-do-not-send="true">schnabelr@missouri.edu</a>>wrote:<p style="margin: 0px; text-indent: 0px;">> So my questionsare 1) am I'm crazy for doing this, 2) would you change<p style="margin: 0px; text-indent: 0px;">> anything and3) is it acceptable to put the xlog & wal (and perhaps tmp<p style="margin: 0px; text-indent: 0px;">> filespace)on a different controller than everything else? Please keep<p style="margin: 0px; text-indent: 0px;">> in mindI'm a geneticist who happens to know a little bit about<p style="margin: 0px; text-indent: 0px;">> bioinformaticsand not the reverse. :-)<p style="margin: 0px; text-indent: 0px;">> <p style="margin: 0px; text-indent:0px;">Putting the WAL on a second controller does help, if you're write-heavy.<p style="margin: 0px; text-indent:0px;">I tried separating indexes and data once on one server and didn't really notice that it helped much. Managingthe space was problematic. I would suggest putting those together on a single RAID-10 of all the 300GB drives (minusa spare). It will probably outperform separate arrays most of the time, and be much easier to manage.</div></div></blockquote></span><div><br/></div><div>If you go this route, I suggest two equally sized RAID 10's ondifferent controllers fir index + data, with software raid-0 on top of that. RAID 10 will max out a controller after 6to 10 drives, usually. Using the OS RAID 0 to aggregate the throughput of two controllers works great.</div><div><br /></div><div>WALonly has to be a little bit faster than your network in most cases. I've never seen it be a bottleneck onlarge bulk loads if it is on its own controller with 120MB/sec write throughput. I suppose a bulk load from COPY mightstress it a bit more, but CPU ends up the bottleneck in postgres once you have I/O hardware this capable.</div><div><br/></div></blockquote> Do you mean 14 drives in one box as RAID10's on one controller, then 14 drivesin the other box on a second controller, then software RAID0 each of the two RAID10's together essentially as a single4 TB array? Would you still recommend doing this with Windows?<br /> Bob<br /><br /><br /><br />
I wonder if postgres could automatically optimize that, if it thought that
On 1/28/11 9:28 AM, "Stephen Frost" <sfrost@snowman.net> wrote:
>* Scott Marlowe (scott.marlowe@gmail.com) wrote:
>> There's nothing wrong with whole table updates as part of an import
>> process, you just have to know to "clean up" after you're done, and
>> regular vacuum can't fix this issue, only vacuum full or reindex or
>> cluster.
>
>Just to share my experiences- I've found that creating a new table and
>inserting into it is actually faster than doing full-table updates, if
>that's an option for you.
it was going to update more than X% of a table, and HOT was not going to
help, then just create a new table file for XID's = or higher than the one
making the change, and leave the old one for old XIDs, then regular VACUUM
could toss out the old one if no more transactions could see it.
Best regards,
Vitalii Tymchyshyn
2011/1/30 Віталій Тимчишин <tivv00@gmail.com>: > I was thinking if a table file could be deleted if it has no single live > row. And if this could be done by vacuum. In this case vacuum on table that > was fully updated recently could be almost as good as cluster - any scan > would skip such non-existing files really fast. Also almost no disk space > would be wasted. VACUUM actually already does something along these lines. If there are 1 or any larger number of entirely-free pages at the end of a table, VACUUM will truncate them away. In the degenerate case where ALL pages are entirely-free, this results in zeroing out the file. The problem with this is that it rarely does much. Consider a table with 1,000,000 pages, 50% of which contain live rows. On average, how many pages will this algorithm truncate away? Answer: if the pages containing live rows are randomly distributed, approximately one. (Proof: There is a 50% chance that the last page will contain live rows. If so, we can't truncate anything. If not, we can truncate one page, and maybe more. Now the chances of the next page being free are 499,999 in 999,999, or roughly one-half. So we have an almost-25% chance of being able to truncate at least two pages. And so on. So you get roughly 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + ... = 1 page.) Your idea of having a set of heaps rather than a single heap is an interesting one, but it's pretty much catering to the very specific case of a full-table update. I think the code changes needed would be far too invasive to seriously contemplate doing it just for that one case - although it is an important case that I would like to see us improve. Tom Lane previously objected to the idea of on-line table compaction on the grounds that people's apps might break if CTIDs changed under them, but I think a brawl between all the people who want on-line table compaction and all the people who want to avoid unexpected CTID changes would be pretty short. A bigger problem - or at least another problem - is that moving tuples this way is cumbersome and expensive. You basically have to move some tuples (inserting new index entries for them), vacuum away the old index entries (requiring a full scan of every index), and then repeat as many times as necessary to shrink the table. This is not exactly a smooth maintenance procedure, or one that can be done without significant disruption, but AFAIK nobody's come up with a better idea yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
The Oracle equivalent of "too many dead rows" is "too many chained rows" and that's where I've seen it used.
Cheers
Dave
2011/1/30 Віталій Тимчишин <tivv00@gmail.com>:> I was thinking if a table file could be deleted if it has no single liveVACUUM actually already does something along these lines. If there
> row. And if this could be done by vacuum. In this case vacuum on table that
> was fully updated recently could be almost as good as cluster - any scan
> would skip such non-existing files really fast. Also almost no disk space
> would be wasted.
are 1 or any larger number of entirely-free pages at the end of a
table, VACUUM will truncate them away. In the degenerate case where
ALL pages are entirely-free, this results in zeroing out the file.
The problem with this is that it rarely does much. Consider a table
with 1,000,000 pages, 50% of which contain live rows. On average, how
many pages will this algorithm truncate away? Answer: if the pages
containing live rows are randomly distributed, approximately one.
(Proof: There is a 50% chance that the last page will contain live
rows. If so, we can't truncate anything. If not, we can truncate one
page, and maybe more. Now the chances of the next page being free are
499,999 in 999,999, or roughly one-half. So we have an almost-25%
chance of being able to truncate at least two pages. And so on. So
you get roughly 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + ... = 1 page.)
Your idea of having a set of heaps rather than a single heap is an
interesting one, but it's pretty much catering to the very specific
case of a full-table update. I think the code changes needed would be
far too invasive to seriously contemplate doing it just for that one
case - although it is an important case that I would like to see us
improve. Tom Lane previously objected to the idea of on-line table
compaction on the grounds that people's apps might break if CTIDs
changed under them, but I think a brawl between all the people who
want on-line table compaction and all the people who want to avoid
unexpected CTID changes would be pretty short. A bigger problem - or
at least another problem - is that moving tuples this way is
cumbersome and expensive. You basically have to move some tuples
(inserting new index entries for them), vacuum away the old index
entries (requiring a full scan of every index), and then repeat as
many times as necessary to shrink the table. This is not exactly a
smooth maintenance procedure, or one that can be done without
significant disruption, but AFAIK nobody's come up with a better idea
yet.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
03.02.11 20:42, Robert Haas написав(ла): > 2011/1/30 Віталій Тимчишин<tivv00@gmail.com>: >> I was thinking if a table file could be deleted if it has no single live >> row. And if this could be done by vacuum. In this case vacuum on table that >> was fully updated recently could be almost as good as cluster - any scan >> would skip such non-existing files really fast. Also almost no disk space >> would be wasted. > VACUUM actually already does something along these lines. If there > are 1 or any larger number of entirely-free pages at the end of a > table, VACUUM will truncate them away. In the degenerate case where > ALL pages are entirely-free, this results in zeroing out the file. > > The problem with this is that it rarely does much. Consider a table > with 1,000,000 pages, 50% of which contain live rows. On average, how > many pages will this algorithm truncate away? Answer: if the pages > containing live rows are randomly distributed, approximately one. Yes, but take into account operations on a (by different reasons) clustered tables, like removing archived data (yes I know, this is best done with partitioning, but one must still go to a point when he will decide to use partitioning :) ). > Your idea of having a set of heaps rather than a single heap is an > interesting one, but it's pretty much catering to the very specific > case of a full-table update. I think the code changes needed would be > far too invasive to seriously contemplate doing it just for that one > case - although it is an important case that I would like to see us > improve. Why do you expect such a invasive code changes? I know little about postgresql code layering, but what I propose (with changing delete to truncate) is: 1) Leave tuple addressing as it is now 2) Allow truncated files, treating non-existing part as if it contained not used tuples 3) Make vacuum truncate file if it has not used tuples at the end. The only (relatively) tricky thing I can see is synchronizing truncation with parallel ongoing scan. Best regards, Vitalii Tymchyshyn
On Fri, Feb 4, 2011 at 4:19 AM, Vitalii Tymchyshyn <tivv00@gmail.com> wrote: > Why do you expect such a invasive code changes? I know little about > postgresql code layering, but what I propose (with changing delete to > truncate) is: > 1) Leave tuple addressing as it is now i.e. a block number and a slot position within the block? Seems like you'd need <file,block,slot>. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 4, 2011 at 4:19 AM, Vitalii Tymchyshyn <tivv00@gmail.com> wrote:i.e. a block number and a slot position within the block?
> Why do you expect such a invasive code changes? I know little about
> postgresql code layering, but what I propose (with changing delete to
> truncate) is:
> 1) Leave tuple addressing as it is now
Seems like you'd need <file,block,slot>.
--
Best regards,
Vitalii Tymchyshyn