Thread: Importing Large Amounts of Data
Appended is a message that I sent to the pgsql-general list, and for which I received no useful reply. (Well, anyway, no reply that has helped me to speed up my imports.) If you've read it already, feel free to ignore it, but if you haven't, I'd appreciate any advice on how to make this work as fast as possible. There are some things that could be done to help optimize situations like this, though I don't think any can be done in the short term. Here are some of my thoughts, which may or may not be useful: 1. Postgres appears to have a fairly high row overhead (40 bytes or so according to the FAQ), which grieves me slightly, as that's actually larger than the size of the data in my tuples. It would seem that in my case some of the items in that header (the OID and the NULL bitfield) are not used; would it be possible to avoid allocating these in this relations that don't use them? Also, is there a good description of the on-disk tuple format somewhere? include/access/htup.h seems to require a fair bit of knowledge about how other parts of the system work to be understood. 2. Does it make sense to be able to do some operations without logging? For the COPY, if the system crashes and I lose some or all all the tuples I'd imported so far, I don't care that much; I can just restart the COPY at an appropriate point. As mentioned below, that would save half a gig of disk writes when importing 5M tuples. 3. How about having a way to take a table off-line to work on it, and bring it back on-line again when done? This would get rid of the logging overhead, locking overhead, and that sort of stuff, and in theory might be able to get you something approaching disk-speed data imports. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC ---------- Forwarded message ---------- Date: Thu, 11 Apr 2002 17:28:13 +0900 (JST) From: Curt Sampson <cjs@cynic.net> To: pgsql-general@postgresql.org Subject: [GENERAL] Importing Large Amounts of Data I've been asked by a client to do some testing of Postgres for what appears to be OLAP on a fairly large data set (about half a billion tuples). I'm probably going to want to try partitioning this in various ways, but the application, not Postgres, will deal with that. I'm using PostgreSQL 7.2.1, and the schema I'm testing with is as follows: CREATE TABLE bigone ( rec_no INT PRIMARY KEY, day DATE NOT NULL, user_id CHAR(5) NOT NULL, value VARCHAR(20) NOT NULL ) WITHOUT OIDS;DROP INDEX bigone_pkey; [COPY is done here....] CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);CREATE INDEX bigone_day ON bigone (day);CREATE INDEX bigone_user_id ONbigone (user_id); Unfortunately, the first problem I've run into is that importing is rather slow. With all indexes (including the bigone_pkey) dropped, importing five million tuples into the above table, starting from empty, takes about 921 seconds. The second 5M tuples takes about 1009 seconds. If I switch to using the -F option, the first 5M takes 714 seconds and the second 5M takes 742 seconds. At the end, I have about 742 MB of data under the data/base directory. (This is using a fresh database cluster.) For comparison, the MySQL does each import in about 221 and 304 seconds, and the data in the end take up about 427 MB. Part of the problem here may be that Postgres appears to be logging the COPY operation; I get from 27-33 "recycled transaction log file" messages for every 5M tuple COPY that I do. If there were a way to do an unlogged copy, that might save close to half a gig of writes to the disk. The other part of the problem may just be the size of the data; why does Postgres take up 75% more space (about 78 bytes per tuple, vs. 45 bytes per tuple) for this table? As well, index builds seem to take about 20% longer (using -F), and they seem to be about 10% larger as well. Does anybody have any suggestions as to how I can improve performance here, and reduce disk space requirements? If not, I'll probably have to suggest to the client that he move to MySQL for this particular application, unless he needs any of the features that Postgres provides and MySQL doesn't.
> 1. Postgres appears to have a fairly high row overhead (40 bytes > or so according to the FAQ), which grieves me slightly, as that's > actually larger than the size of the data in my tuples. It would > seem that in my case some of the items in that header (the OID and > the NULL bitfield) are not used; would it be possible to avoid > allocating these in this relations that don't use them? CREATE TABLE WITHOUT OIDS ... > As well, index builds seem to take about 20% longer (using -F), and they > seem to be about 10% larger as well. > > Does anybody have any suggestions as to how I can improve performance > here, and reduce disk space requirements? If not, I'll probably have > to suggest to the client that he move to MySQL for this particular > application, unless he needs any of the features that Postgres provides > and MySQL doesn't. This conclusion seems to me to be remarkably shortsighted. Does the initial data load into the database occur just once or quite often? If just once, then the initial loading time doesn't matter. It's a bit hard to say "just turn off all the things that ensure your data integrity so it runs a bit faster", if you actually need data integrity. Anyway, from what I understand an OLTP application is all about selects and memoising certain aggregate results. Since Postgres has far more advanced indexing and trigger support than MySQL, surely you need to take this kind of difference into account??? The fact that you can load stuff quicker in MySQL and it takes up less disk space seems totally irrelevant. Just wait until your MySQL server crashes and your client finds that half his data is corrupted... Chris
On Mon, 15 Apr 2002, Christopher Kings-Lynne wrote: > > ...the OID and > > the NULL bitfield) are not used; would it be possible to avoid > > allocating these in this relations that don't use them? > > CREATE TABLE WITHOUT OIDS ... As you can see from the schema I gave later in my message, that's exactly what I did. But does this actually avoid allocating the space in the on-disk tuples? What part of the code deals with this? It looks to me like the four bytes for the OID are still allocated in the tuple, but not used. > This conclusion seems to me to be remarkably shortsighted. Does the initial > data load into the database occur just once or quite often? Well, I'm going to be doing the initial load (.5 billion tuples) quite a few times, in order to test some different partitioning arrangements. So I'll save quite a lot of time initially if I get a faster import. But from the looks of it, the production system will be doing daily imports of fresh data ranging in size from a copule of million rows to a couple of tens of millions of rows. > It's a bit hard to say "just turn off all the things that ensure your data > integrity so it runs a bit faster", if you actually need data integrity. I'm not looking for "runs a bit faster;" five percent either way makes little difference to me. I'm looking for a five-fold performance increase. > Anyway, from what I understand an OLTP application is all about selects and > memoising certain aggregate results. I guess that was a typo, and you meant OLAP? Anyway, from the looks of it, this is going to be fairly simple stuff. (Unfortunately, I don't have details of the real application the client has in mind, though I sure wish I did.) What I'm trying to indicate when I say "OLAP" is that it's basically selecting across broad swaths of a large data set, and doing little or nothing in the way of updates. (Except for the daily batches of data, of course.) > The fact that you can load stuff quicker in > MySQL and it takes up less disk space seems totally irrelevant. Yeah, everybody's telling me this. Let me try once again here: 1. Every day, I must import millions, possibly tens ofmillions, of rows of data. Thus, speed of import is indeedfairly importantto me. 2. It looks, at least at this point, as if the applicationwill be doing only fairly simple selects out of the currenthalf-billionrows of data and whatever gets added in thefuture. Thus, I don't think that using MySQL would be aproblem.(If I did, I wouldn't be proposing it.) I don't want to start a flamewar here, because personally I don't even like MySQL and would prefer always to use PostgreSQL. But it makes it a lot harder to do so when people keep insisting that import speed is not important. Rather than say that, why don't we just admit that PosgreSQL is a fairly crap performer in this regard at the moment (at least the way I'm doing it), and work out ways to fix this? > Just wait until your MySQL server crashes and your client finds that half > his data is corrupted... If there are no updates, why would anything be corrupted? At any rate, I can always restore from backup, since little or nothing would be lost. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
> As you can see from the schema I gave later in my message, that's > exactly what I did. But does this actually avoid allocating the > space in the on-disk tuples? What part of the code deals with this? > It looks to me like the four bytes for the OID are still allocated > in the tuple, but not used. OK, well I guess in that case they are - I'm no expert on the file format. > But from the looks of it, the production system will be doing daily > imports of fresh data ranging in size from a copule of million rows > to a couple of tens of millions of rows. Well that definitely makes a difference then... > > It's a bit hard to say "just turn off all the things that > ensure your data > > integrity so it runs a bit faster", if you actually need data integrity. > > I'm not looking for "runs a bit faster;" five percent either way > makes little difference to me. I'm looking for a five-fold performance > increase. > Anyway, from the looks of it, this is going to be fairly simple > stuff. (Unfortunately, I don't have details of the real application > the client has in mind, though I sure wish I did.) What I'm trying > to indicate when I say "OLAP" is that it's basically selecting > across broad swaths of a large data set, and doing little or nothing > in the way of updates. (Except for the daily batches of data, of > course.) OK, well now it depends on what kind of selects you're doing. Do you regularly select over a certain subset of the data, in which case using partial indices might give you significant speedup. Do you select functions of columns? If so, then you'll need functional indices. MySQL doesn't have either of these. However, if you're always doing full table scans, then MySQL will probably do these faster. Now, here's another scenario. Suppose you're often querying aggregate data over particular subsets of the data. Now instead of requerying all the time, you can set up triggers to maintain your aggregates for you on the fly. This will give O(1) performance on select compared to O(n). MySQL's new query cache might help you with this, however. > I don't want to start a flamewar here, because personally I don't > even like MySQL and would prefer always to use PostgreSQL. But it > makes it a lot harder to do so when people keep insisting that > import speed is not important. Rather than say that, why don't we > just admit that PosgreSQL is a fairly crap performer in this regard > at the moment (at least the way I'm doing it), and work out ways > to fix this? It depends on your definition. You have to accept a certain overhead if you're to have data integrity and MVCC. If you can't handle that overhead, then you can't have data integrity and vice versa. BTW, instead of: CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no); do: ALTER TABLE bigone ADD PRIMARY KEY(rec_no); And remember to run "VACUUM ANALYZE bigone;" or just "ANALYZE bigone;" after the COPY and before trying to use the table. I'm not sure if it's better to analyze before or after the indexes are added, but it's definitely better to vaccum before the indexes are added. Chris
On Mon, 15 Apr 2002, Christopher Kings-Lynne wrote: > OK, well now it depends on what kind of selects you're doing. Do you > regularly select over a certain subset of the data, in which case using > partial indices might give you significant speedup. I believe from the information I've been given that we will indeed be regularly selecting over certain subsets, based on day. (One of the test queries I've been asked to use selects based on user_id and a date range.) But I was intending to partition the tables based on date range (to keep the index rebuild time from getting completely out of hand), so that will handily take care of that requirement anyway. > Do you select functions of columns? No. > It depends on your definition. You have to accept a certain overhead if > you're to have data integrity and MVCC. If you can't handle that overhead, > then you can't have data integrity and vice versa. Well, a few points: a) I am not convinced that data integrity should cost a five-fold decrease in performance, b) In fact, at times I don't need that data integrity. I'm prefectly happy to risk the loss of a table during import, ifit lets me do the import more quickly, especially if I'm taking the database off line to do the import anyway. MS SQL serverin fact allows me to specify relaxed integrity (with attendant risks) when doing a BULK IMPORT; it would be cool ifPostgres allowed that to. > BTW, instead of: > > CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no); > > do: > > ALTER TABLE bigone ADD PRIMARY KEY(rec_no); > > And remember to run "VACUUM ANALYZE bigone;" or just "ANALYZE bigone;" after > the COPY and before trying to use the table. I'm not sure if it's better to > analyze before or after the indexes are added, but it's definitely better to > vaccum before the indexes are added. Thanks. This is the kind of useful information I'm looking for. I was doing a vacuum after, rather than before, generating the indices. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
> b) In fact, at times I don't need that data integrity. I'm > prefectly > happy to risk the loss of a table during import, if it > lets me do the > import more quickly, especially if I'm taking the database off line > to do the import anyway. MS SQL server in fact allows me to specify > relaxed integrity (with attendant risks) when doing a BULK > IMPORT; it > would be cool if Postgres allowed that to. Well I guess a TODO item would be to allow COPY to use relaxed constraints. Don't know how this would go over with the core developers tho. > Thanks. This is the kind of useful information I'm looking for. I > was doing a vacuum after, rather than before, generating the indices. That's because the indexes themselves are cleaned out with vacuum, as well as the tables. Chris
On Monday 15 April 2002 03:53, Christopher Kings-Lynne wrote: > BTW, instead of: > > CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no); > > do: > > ALTER TABLE bigone ADD PRIMARY KEY(rec_no); I am sorry, could you please elaborate more on the difference? -- Denis
> On Monday 15 April 2002 03:53, Christopher Kings-Lynne wrote: > > BTW, instead of: > > > > CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no); > > > > do: > > > > ALTER TABLE bigone ADD PRIMARY KEY(rec_no); > > I am sorry, could you please elaborate more on the difference? They have the same _effect_, it's just that the first sytnax does not mark the index as the _primary_ index on the relation. Chris
On Monday 15 April 2002 05:15, Christopher Kings-Lynne wrote: > > On Monday 15 April 2002 03:53, Christopher Kings-Lynne wrote: > > > BTW, instead of: > > > > > > CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no); > > > > > > do: > > > > > > ALTER TABLE bigone ADD PRIMARY KEY(rec_no); > > > > I am sorry, could you please elaborate more on the difference? > > They have the same _effect_, it's just that the first sytnax does not mark > the index as the _primary_ index on the relation. Yes, I know. I mean how does this affect performance? How this can change planner decision? Does it have any effect except cosmetical one? -- Denis
> Yes, I know. I mean how does this affect performance? How this can change > planner decision? Does it have any effect except cosmetical one? Only cosmetic. In the example he gave, he wanted a primary key, so I showed him how to make one properly. Chris
Curt Sampson <cjs@cynic.net> writes: > On Mon, 15 Apr 2002, Christopher Kings-Lynne wrote: >> CREATE TABLE WITHOUT OIDS ... > As you can see from the schema I gave later in my message, that's > exactly what I did. But does this actually avoid allocating the > space in the on-disk tuples? What part of the code deals with this? > It looks to me like the four bytes for the OID are still allocated > in the tuple, but not used. Curt is correct: WITHOUT OIDS does not save any storage. Having two different formats for the on-disk tuple header seemed more pain than the feature was worth. Also, because of alignment considerations it would save no storage on machines where MAXALIGN is 8. (Possibly my thinking is colored somewhat by the fact that that's so on all my favorite platforms ;-).) However, as for the NULL values bitmap: that's already compacted out when not used, and always has been AFAIK. >> It's a bit hard to say "just turn off all the things that ensure your data >> integrity so it runs a bit faster", if you actually need data integrity. > I'm not looking for "runs a bit faster;" five percent either way > makes little difference to me. I'm looking for a five-fold performance > increase. You are not going to get it from this; where in the world did you get the notion that data integrity costs that much? When the WAL stuff was added in 7.1, we certainly did not see any five-fold slowdown. If anything, testing seemed to indicate that WAL sped things up. A lot would depend on your particular scenario of course. Have you tried all the usual speedup hacks? Turn off fsync, if you really think you do not care about crash integrity; use COPY FROM STDIN to bulk-load data, not retail INSERTs; possibly drop and recreate indexes rather than updating them piecemeal; etc. You should also consider not declaring foreign keys, as the runtime checks for reference validity are pretty expensive. regards, tom lane
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: >> Yes, I know. I mean how does this affect performance? How this can change >> planner decision? Does it have any effect except cosmetical one? > Only cosmetic. In the example he gave, he wanted a primary key, so I showed > him how to make one properly. The ALTER form will complain if any of the columns are not marked NOT NULL, so the difference isn't completely cosmetic. regards, tom lane
On Mon, 15 Apr 2002, Tom Lane wrote: > > I'm not looking for "runs a bit faster;" five percent either way > > makes little difference to me. I'm looking for a five-fold performance > > increase. > > You are not going to get it from this; where in the world did you get > the notion that data integrity costs that much? Um...the fact that MySQL imports the same data five times as fast? :-) Note that this is *only* related to bulk-importing huge amounts of data. Postgres seems a little bit slower than MySQL at building the indicies afterwards, but this would be expected since (probably due to higher tuple overhead) the size of the data once in postgres is about 75% larger than in MySQL: 742 MB vs 420 MB. I've not done any serious testing of query speed, but the bit of toying I've done with it shows no major difference. > Have you tried all the usual speedup hacks? Turn off fsync, if you > really think you do not care about crash integrity; use COPY FROM STDIN > to bulk-load data, not retail INSERTs; possibly drop and recreate > indexes rather than updating them piecemeal; etc. You should also > consider not declaring foreign keys, as the runtime checks for reference > validity are pretty expensive. Yes, I did all of the above. (This was all mentioned in my initial message, except for turning off foreign key constraints--but the table has no foreign keys.) What I'm thinking would be really cool would be to have an "offline" way of creating tables using a stand-alone program that would write the files at, one hopes, near disk speed. Maybe it could work by creating the tables in a detached tablespace, and then you'd attach the tablespace when you're done. It might even be extended to be able to do foreign key checks, create indicies, and so on. (Foreign key checks would be useful; I'm not sure that creating indicies would be any faster than just doing it after the tablespace is attached.) This would be particularly useful for fast restores of backups. Downtime while doing a restore is always a huge pain for large databases. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Curt Sampson wrote: > On Mon, 15 Apr 2002, Tom Lane wrote: > > > > I'm not looking for "runs a bit faster;" five percent either way > > > makes little difference to me. I'm looking for a five-fold performance > > > increase. > > > > You are not going to get it from this; where in the world did you get > > the notion that data integrity costs that much? > > Um...the fact that MySQL imports the same data five times as fast? :-) > > Note that this is *only* related to bulk-importing huge amounts of > data. Postgres seems a little bit slower than MySQL at building > the indicies afterwards, but this would be expected since (probably > due to higher tuple overhead) the size of the data once in postgres > is about 75% larger than in MySQL: 742 MB vs 420 MB. I've not done > any serious testing of query speed, but the bit of toying I've done > with it shows no major difference. Can you check your load and see if there is a PRIMARY key on the table at the time it is being loaded. In the old days, we created indexes only after the data was loaded, but when we added PRIMARY key, pg_dump was creating the table with PRIMARY key then loading it, meaning the table was being loaded while it had an existing index. I know we fixed this recently but I am not sure if it was in 7.2 or not. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Mon, 15 Apr 2002 21:44:26 -0400 (EDT) "Bruce Momjian" <pgman@candle.pha.pa.us> wrote: > In the old days, we created indexes > only after the data was loaded, but when we added PRIMARY key, pg_dump > was creating the table with PRIMARY key then loading it, meaning the > table was being loaded while it had an existing index. I know we fixed > this recently but I am not sure if it was in 7.2 or not. It's not in 7.2 -- but it's fixed in CVS. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
On Mon, 15 Apr 2002, Bruce Momjian wrote: > Can you check your load and see if there is a PRIMARY key on the table > at the time it is being loaded. There is not. I create the table with a PRIMARY KEY declaration, but I drop that index before doing the import, and do an ALTER TABLE to re-add the primary key afterwards. At one point I tried doing a load with all indices enabled, but after about eight or nine hours I gave up. (Typically the load takes about 30 minutes. This is using about 2% of the sample data.) > In the old days, we created indexes > only after the data was loaded, but when we added PRIMARY key, pg_dump > was creating the table with PRIMARY key then loading it, meaning the > table was being loaded while it had an existing index. I know we fixed > this recently but I am not sure if it was in 7.2 or not. Ah, I saw that fix. But I'm doing the load by hand, not using pg_restore. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
On Tue, 16 Apr 2002, Curt Sampson wrote: [snip] > What I'm thinking would be really cool would be to have an "offline" > way of creating tables using a stand-alone program that would write > the files at, one hopes, near disk speed. Personally, I think there is some merit in this. Postgres can be used for large scale data mining, an application which does not need (usually) multi-versioning and concurrency but which can benefit from postgres's implementation of SQL, as well as backend extensibility. I don't see any straight forward way of modifying the code to allow a fast path directly to relationals on-disk. However, it should be possible to bypass locking, RI, MVCC etc with the use of a bootstrap-like tool. Such a tool would only be able to be used when the database was offline. It would read data from files pasted to it in some format, perhaps that generated by COPY. Given the very low parsing and 'planning' overhead, the real cost would be WAL (the bootstrapper could fail and render the database unusable) and the subsequent updating of on-disk relations. Comments? Gavin
On Tue, 16 Apr 2002, Gavin Sherry wrote: > I don't see any straight forward way of modifying the code to allow a fast > path directly to relationals on-disk. However, it should be possible to > bypass locking, RI, MVCC etc with the use of a bootstrap-like tool. That was my thought. I'm not asking for disk-speed writes through the database server itself; I can make do with taking the server off-line, doing the imports, and bringing it back again, as you suggest. > Given the very low parsing and 'planning' overhead, the real cost would be > WAL (the bootstrapper could fail and render the database unusable) and the > subsequent updating of on-disk relations. MS SQL Server, when doing a BULK INSERT or BCP, can do it as a fully or "minimally" logged operation. When minimally logged, there's no ability to roll-forward or recover inserted data, just the ability to go back to the state at the beginning of the operation. This technique can work even though an on-line database. A bit more information is available at http://msdn.microsoft.com/library/en-us/adminsql/ad_impt_bcp_9esz.asp (You may want to browse this with lynx; the javascript in it is going to force your screen into a configuration with frames.) You can follow some of the links in that page for further information. Another option, for off-line databases, might just be not to log at all. If you take a backup first, it may be faster to restore the backup and start again than to try to roll back the operation, or roll it foward to partial completion and then figure out where to restart your import. This seems especially likely if you can restore only the files relating to the table that was actually damanaged. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
On Tue, 16 Apr 2002, Curt Sampson wrote: > > Given the very low parsing and 'planning' overhead, the real cost would be > > WAL (the bootstrapper could fail and render the database unusable) and the > > subsequent updating of on-disk relations. > > MS SQL Server, when doing a BULK INSERT or BCP, can do it as a fully or > "minimally" logged operation. When minimally logged, there's no ability > to roll-forward or recover inserted data, just the ability to go back > to the state at the beginning of the operation. This technique can work > even though an on-line database. A bit more information is available at The other reason I say that this bootstrap tool would still use WAL is that bypassing WAL would require writing a fairly significant amount of code (unless the pre-WAL heap_insert() code could be used, with relevant modification). On the other hand, I would imagine it to be very difficult to implement an 'interactive' roll back facility with the kind of tool I am describing. Gavin