Thread: autocommit (true/false) for more than 1 million records
Hello, Trying to insert into one table with 1 million records through java JDBC into psql8.3. May I know (1) or (2) is better please? (1) set autocommit(true) (2) set autocommit(false) commit every n records (e.g., 100, 500, 1000, etc) Thanks a lot! Emi
Emi Lu-2 wrote > Hello, > > Trying to insert into one table with 1 million records through java JDBC > into psql8.3. May I know (1) or (2) is better please? > > (1) set autocommit(true) > (2) set autocommit(false) > commit every n records (e.g., 100, 500, 1000, etc) > > Thanks a lot! > Emi Typically the larger the n the better. Locking and risk of data loss on a failure are the tradeoffs to consider. Other factors, like memory, make choosing too large an n bad so using 500,000 is probably wrong but 500 is probably overly conservative. Better advice depends on context and hardware. You should also consider upgrading to a newer, supported, version of PostgreSQL. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/autocommit-true-false-for-more-than-1-million-records-tp5815943p5815946.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
* Emi Lu (emilu@encs.concordia.ca) wrote: > Hello, > > Trying to insert into one table with 1 million records through java > JDBC into psql8.3. May I know (1) or (2) is better please? > > (1) set autocommit(true) > (2) set autocommit(false) > commit every n records (e.g., 100, 500, 1000, etc) It depends on what you need. Data will be available to concurrent processes earlier with (1), while (2) will go faster. Thanks, Stephen
Attachment
> * >> Trying to insert into one table with 1 million records through java >> JDBC into psql8.3. May I know (1) or (2) is better please? >> >> (1) set autocommit(true) >> (2) set autocommit(false) >> commit every n records (e.g., 100, 500, 1000, etc) > It depends on what you need. > > Data will be available to concurrent processes earlier with (1), while > (2) will go faster. No need to worry about the lock/loosing records because after data loading will do a check. For now, I'd like the fastest way. Would you suggest commit every 1000 or 3000 records? Thanks a lot! Emi
* Emi Lu (emilu@encs.concordia.ca) wrote: > >* > >>Trying to insert into one table with 1 million records through java > >>JDBC into psql8.3. May I know (1) or (2) is better please? > >> > >>(1) set autocommit(true) > >>(2) set autocommit(false) > >> commit every n records (e.g., 100, 500, 1000, etc) > >It depends on what you need. > > > >Data will be available to concurrent processes earlier with (1), while > >(2) will go faster. > No need to worry about the lock/loosing records because after data > loading will do a check. For now, I'd like the fastest way. Would > you suggest commit every 1000 or 3000 records? The improvement drops off pretty quickly in my experience, but it depends on the size of the records and other things. Try it and see..? It's almost certainly going to depend on your specific environment. Thanks, Stephen
Attachment
Good morning, >>>> Trying to insert into one table with 1 million records through java >>>> JDBC into psql8.3. May I know (1) or (2) is better please? >>>> >>>> (1) set autocommit(true) >>>> (2) set autocommit(false) >>>> commit every n records (e.g., 100, 500, 1000, etc) >>> It depends on what you need. >>> >>> Data will be available to concurrent processes earlier with (1), while >>> (2) will go faster. >> No need to worry about the lock/loosing records because after data >> loading will do a check. For now, I'd like the fastest way. Would >> you suggest commit every 1000 or 3000 records? > The improvement drops off pretty quickly in my experience, but it > depends on the size of the records and other things. The table is huge with almost 170 columns. > Try it and see..? It's almost certainly going to depend on your > specific environment. Can you let me know what are the "specific environment" please? Such as: ...... By the way, could someone let me know why set autocommit(false) is for sure faster than true please? Or, some online docs talk about this. Thanks a lot! Emi
On Mon, Aug 25, 2014 at 9:40 AM, Emi Lu <emilu@encs.concordia.ca> wrote:
By the way, could someone let me know why set autocommit(false) is for sure faster than true please? Or, some online docs talk about this.
Not sure about the docs specifically but:
Commit is expensive because as soon as it is issued all of the data has to be guaranteed written. While ultimately the same amount of data is guaranteed by doing them in batches there is opportunity to achieve economies of scale.
(I think...)
When you commit you flush data to disk - until then you can make use of RAM. Once you exhaust RAM you might as well commit and free up that RAM for the next batch.
David J.
Hi Emi,
Databases that comply to the ACID standard (http://en.wikipedia.org/wiki/ACID) ensure that that are no data loss by first writing the data changes to the database log in opposition to updating the actual data on the filesystem first (on the datafiles).
Each database has its own way of doing it, but it basically consists of writing the data to the logfile at each COMMIT and writing the data to the datafile only when it's necessary.
So the COMMIT command is a way of telling the database to write the data changes to the logfile.
Both logfiles and datafiles resides on the filesystem, but why writing to the logfile is faster?
It is because the logfile is written sequentially, while the datafile is totally dispersed and may even be fragmented.
Resuming: autocommit false is faster because you avoid going to the hard disk to write the changes into the logfile, you keep them in RAM memory until you decide to write them to the logfile (at each 10K rows for instance).
Be aware that, eventually, you will need to write data to the logfile, so you can't avoid that. But usually the performance is better if you write X rows at a time to the logfile, rather than writing every and each row one by one (because of the hard disk writing overhead).
The number of rows you need to write to get a better performance will depend on your environment and is pretty much done by blind-testing the process. For millions of rows, I usually commit at each 10K or 50K rows.
Regards,
Felipe
2014-08-25 10:40 GMT-03:00 Emi Lu <emilu@encs.concordia.ca>:
Good morning,The table is huge with almost 170 columns.The improvement drops off pretty quickly in my experience, but itNo need to worry about the lock/loosing records because after dataTrying to insert into one table with 1 million records through javaIt depends on what you need.
JDBC into psql8.3. May I know (1) or (2) is better please?
(1) set autocommit(true)
(2) set autocommit(false)
commit every n records (e.g., 100, 500, 1000, etc)
Data will be available to concurrent processes earlier with (1), while
(2) will go faster.
loading will do a check. For now, I'd like the fastest way. Would
you suggest commit every 1000 or 3000 records?
depends on the size of the records and other things.Try it and see..? It's almost certainly going to depend on yourCan you let me know what are the "specific environment" please? Such as: ......
specific environment.
By the way, could someone let me know why set autocommit(false) is for sure faster than true please? Or, some online docs talk about this.
Thanks a lot!
Emi
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Fri, Aug 22, 2014 at 1:49 PM, Emi Lu <emilu@encs.concordia.ca> wrote:
Hello,
Trying to insert into one table with 1 million records through java JDBC into psql8.3. May I know (1) or (2) is better please?
(1) set autocommit(true)
(2) set autocommit(false)
commit every n records (e.g., 100, 500, 1000, etc)
In general it is better to use COPY (however JDBC for 8.3. exposes that), as that is designed specifically for bulk loading.
Then it doesn't matter whether autocommit is on or off, because the COPY is a single statement.
Cheers,
Jeff
On the COPY's atomicity -- looking for a definitive answer from a core
developer, not a user's guess, please.
Suppose I COPY a huge amount of data, e.g. 100 records.
My 99 records are fine for the target, and the 100-th is not -- it
comes with a wrong record format or a target constraint violation.
The whole thing is aborted then, and the good 99 records are not
making it into the target table.
My question is: Where are these 99 records have been living, on the
database server, while the 100-th one hasn't come yet, and the need to
throw the previous data accumulation away has not come yet?
There have to be some limits to the space and/or counts taken by the
new, uncommitted, data, while the COPY operation is still in progress.
What are they?
Say, I am COPYing 100 TB of data and the bad records are close to the
end of the feed -- how will this all error out?
Thanks,
-- Alex
developer, not a user's guess, please.
Suppose I COPY a huge amount of data, e.g. 100 records.
My 99 records are fine for the target, and the 100-th is not -- it
comes with a wrong record format or a target constraint violation.
The whole thing is aborted then, and the good 99 records are not
making it into the target table.
My question is: Where are these 99 records have been living, on the
database server, while the 100-th one hasn't come yet, and the need to
throw the previous data accumulation away has not come yet?
There have to be some limits to the space and/or counts taken by the
new, uncommitted, data, while the COPY operation is still in progress.
What are they?
Say, I am COPYing 100 TB of data and the bad records are close to the
end of the feed -- how will this all error out?
Thanks,
-- Alex
On Mon, Aug 25, 2014 at 11:48 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Aug 22, 2014 at 1:49 PM, Emi Lu <emilu@encs.concordia.ca> wrote:Hello,
Trying to insert into one table with 1 million records through java JDBC into psql8.3. May I know (1) or (2) is better please?
(1) set autocommit(true)
(2) set autocommit(false)
commit every n records (e.g., 100, 500, 1000, etc)In general it is better to use COPY (however JDBC for 8.3. exposes that), as that is designed specifically for bulk loading.Then it doesn't matter whether autocommit is on or off, because the COPY is a single statement.Cheers,Jeff
Alex Goncharov <alex.goncharov.usa@gmail.com> wrote: > Suppose I COPY a huge amount of data, e.g. 100 records. > > My 99 records are fine for the target, and the 100-th is not -- > it comes with a wrong record format or a target constraint > violation. > > The whole thing is aborted then, and the good 99 records are not > making it into the target table. Right. This is one reason people often batch such copies or check the data very closely before copying in. > My question is: Where are these 99 records have been living, on > the database server, while the 100-th one hasn't come yet, and > the need to throw the previous data accumulation away has not > come yet? They will have been written into the table. They do not become visible to any other transaction until and unless the inserting transaction successfully commits. These slides may help: http://momjian.us/main/writings/pgsql/mvcc.pdf > There have to be some limits to the space and/or counts taken by > the new, uncommitted, data, while the COPY operation is still in > progress. What are they? Primarily disk space for the table. If you are not taking advantage of the "unlogged load" optimization, you will have written Write Ahead Log (WAL) records, too -- which (depending on your configuration) you may be archiving. In that case, you may need to be concerned about the archive space required. If you have foreign keys defined for the table, you may get into trouble on the RAM used to track pending checks for those constraints. I would recommend adding any FKs after you are done with the big bulk load. PostgreSQL does *not* have a "rollback log" which will impose a limit. > Say, I am COPYing 100 TB of data and the bad records are close > to the end of the feed -- how will this all error out? The rows will all be in the table, but not visible to any other transaction. Autovacuum will clean them out in the background, but if you want to restart your load against an empty table it might be a good idea to TRUNCATE that table; it will be a lot faster. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you, Kevin -- this is helpful.
But it still leaves questions for me.
Kevin Grittner <kgrittn@ymail.com> wrote:
> Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:
> > The whole thing is aborted then, and the good 99 records are not
> > making it into the target table.
>
> Right. This is one reason people often batch such copies or check
> the data very closely before copying in.
How do I decide, before starting a COPY data load, whether such a load
protection ("complexity") makes sense ("is necessary")?
Clearly not needed for 1 MB of data in a realistic environment.
Clearly is needed for loading 1 TB in a realistic environment.
To put it differently: If I COPY 1 TB of data, what criteria should I
use for choosing the size of the chunks to split the data into?
For INSERT-loading, for the database client interfaces offering the
array mode, the performance difference between loading 100 or 1000
rows at a time is usually negligible if any. Therefore 100- and
1000-row's array sizes are both reasonable choices.
But what is a reasonable size for a COPY chunk? It can't even be
measured in rows.
Note, that if you have a 1 TB record-formatted file to load, you can't
just split it in 1 MB chunks and feed them to COPY -- the file has to
be split on the record boundaries.
So, splitting the data for COPY is not a trivial operation, and if
such splitting can be avoided, a reasonable operator will avoid it.
But then again: when can it be avoided?
> > My question is: Where are these 99 records have been living, on
> > the database server, while the 100-th one hasn't come yet, and
> > the need to throw the previous data accumulation away has not
> > come yet?
>
> They will have been written into the table. They do not become
> visible to any other transaction until and unless the inserting
> transaction successfully commits. These slides may help:
>
> http://momjian.us/main/writings/pgsql/mvcc.pdf
Yeah, I know about the MVCC model... The question is about the huge
data storage to be reserved without a commitment while the load is not
completed, about the size constrains in effect here.
> > There have to be some limits to the space and/or counts taken by
> > the new, uncommitted, data, while the COPY operation is still in
> > progress. What are they?
>
> Primarily disk space for the table.
How can that be found? Is "df /mount/point" the deciding factor? Or
some 2^32 or 2^64 number?
> If you are not taking advantage of the "unlogged load" optimization,
> you will have written Write Ahead Log (WAL) records, too -- which
> (depending on your configuration) you may be archiving. In that
> case, you may need to be concerned about the archive space required.
"... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB?
If I am always concerned, and check something before a COPY, what
should I be checking? What are the "OK-to-proceed" criteria?
> If you have foreign keys defined for the table, you may get into
> trouble on the RAM used to track pending checks for those
> constraints. I would recommend adding any FKs after you are done
> with the big bulk load.
I am curious about the simplest case where only the data storage is to
be worried about. (As an aside: the CHECK and NOT NULL constrains are
not a storage factor, right?)
> PostgreSQL does *not* have a "rollback log" which will impose a
> limit.
Something will though, right? What would that be? The available disk
space on a file system? (I would be surprised.)
> > Say, I am COPYing 100 TB of data and the bad records are close
> > to the end of the feed -- how will this all error out?
>
> The rows will all be in the table, but not visible to any other
> transaction.
I see. How much data can I fit there while doing COPY? Not 1 TB?
-- Alex
But it still leaves questions for me.
Kevin Grittner <kgrittn@ymail.com> wrote:
> Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:
> > The whole thing is aborted then, and the good 99 records are not
> > making it into the target table.
>
> Right. This is one reason people often batch such copies or check
> the data very closely before copying in.
How do I decide, before starting a COPY data load, whether such a load
protection ("complexity") makes sense ("is necessary")?
Clearly not needed for 1 MB of data in a realistic environment.
Clearly is needed for loading 1 TB in a realistic environment.
To put it differently: If I COPY 1 TB of data, what criteria should I
use for choosing the size of the chunks to split the data into?
For INSERT-loading, for the database client interfaces offering the
array mode, the performance difference between loading 100 or 1000
rows at a time is usually negligible if any. Therefore 100- and
1000-row's array sizes are both reasonable choices.
But what is a reasonable size for a COPY chunk? It can't even be
measured in rows.
Note, that if you have a 1 TB record-formatted file to load, you can't
just split it in 1 MB chunks and feed them to COPY -- the file has to
be split on the record boundaries.
So, splitting the data for COPY is not a trivial operation, and if
such splitting can be avoided, a reasonable operator will avoid it.
But then again: when can it be avoided?
> > My question is: Where are these 99 records have been living, on
> > the database server, while the 100-th one hasn't come yet, and
> > the need to throw the previous data accumulation away has not
> > come yet?
>
> They will have been written into the table. They do not become
> visible to any other transaction until and unless the inserting
> transaction successfully commits. These slides may help:
>
> http://momjian.us/main/writings/pgsql/mvcc.pdf
Yeah, I know about the MVCC model... The question is about the huge
data storage to be reserved without a commitment while the load is not
completed, about the size constrains in effect here.
> > There have to be some limits to the space and/or counts taken by
> > the new, uncommitted, data, while the COPY operation is still in
> > progress. What are they?
>
> Primarily disk space for the table.
How can that be found? Is "df /mount/point" the deciding factor? Or
some 2^32 or 2^64 number?
> If you are not taking advantage of the "unlogged load" optimization,
> you will have written Write Ahead Log (WAL) records, too -- which
> (depending on your configuration) you may be archiving. In that
> case, you may need to be concerned about the archive space required.
"... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB?
If I am always concerned, and check something before a COPY, what
should I be checking? What are the "OK-to-proceed" criteria?
> If you have foreign keys defined for the table, you may get into
> trouble on the RAM used to track pending checks for those
> constraints. I would recommend adding any FKs after you are done
> with the big bulk load.
I am curious about the simplest case where only the data storage is to
be worried about. (As an aside: the CHECK and NOT NULL constrains are
not a storage factor, right?)
> PostgreSQL does *not* have a "rollback log" which will impose a
> limit.
Something will though, right? What would that be? The available disk
space on a file system? (I would be surprised.)
> > Say, I am COPYing 100 TB of data and the bad records are close
> > to the end of the feed -- how will this all error out?
>
> The rows will all be in the table, but not visible to any other
> transaction.
I see. How much data can I fit there while doing COPY? Not 1 TB?
-- Alex
On Tue, Aug 26, 2014 at 6:33 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:Right. This is one reason people often batch such copies or check
> Suppose I COPY a huge amount of data, e.g. 100 records.
>
> My 99 records are fine for the target, and the 100-th is not --
> it comes with a wrong record format or a target constraint
> violation.
>
> The whole thing is aborted then, and the good 99 records are not
> making it into the target table.
the data very closely before copying in.They will have been written into the table. They do not become
> My question is: Where are these 99 records have been living, on
> the database server, while the 100-th one hasn't come yet, and
> the need to throw the previous data accumulation away has not
> come yet?
visible to any other transaction until and unless the inserting
transaction successfully commits. These slides may help:
http://momjian.us/main/writings/pgsql/mvcc.pdfPrimarily disk space for the table. If you are not taking
> There have to be some limits to the space and/or counts taken by
> the new, uncommitted, data, while the COPY operation is still in
> progress. What are they?
advantage of the "unlogged load" optimization, you will have
written Write Ahead Log (WAL) records, too -- which (depending on
your configuration) you may be archiving. In that case, you may
need to be concerned about the archive space required. If you have
foreign keys defined for the table, you may get into trouble on the
RAM used to track pending checks for those constraints. I would
recommend adding any FKs after you are done with the big bulk load.
PostgreSQL does *not* have a "rollback log" which will impose a limit.The rows will all be in the table, but not visible to any other
> Say, I am COPYing 100 TB of data and the bad records are close
> to the end of the feed -- how will this all error out?
transaction. Autovacuum will clean them out in the background, but
if you want to restart your load against an empty table it might be
a good idea to TRUNCATE that table; it will be a lot faster.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Thank you, Kevin -- this is helpful.
But it still leaves questions for me.> > The whole thing is aborted then, and the good 99 records are notHow do I decide, before starting a COPY data load, whether such a load
> > making it into the target table.
>
> Right. This is one reason people often batch such copies or check
> the data very closely before copying in.
protection ("complexity") makes sense ("is necessary")?
You should probably consider something like:
(I know there are others, this one apparently has the best marketing team...)
Normal case, with normal COPY, you load a bad file into an empty table, it fails, you truncate and get better data for the next attempt.
How long that will take is system (IOPS/CPU) and data dependent.
The probability of failure is source dependent - and prior experience plays a large role here as well.
If you plan to load directly into a live table the wasted space from a bad load could kill you so smaller partial loads are better - if you can afford the implicit system inconsistency such a partial load would cause.
If you understand how the system works you should be able to evaluate the different pieces and come to a conclusion as how best to proceed in a specific situation. No one else on this list has the relevant information to make that judgement call. If this is just asking about rules-of-thumb I'd say figure out how many records 100MB consumes and COMMIT after that many records. 10,000 records is also a nice round number to pick - regardless of the amount of MB consumed. Start there and tweak based upon experience.
> If you are not taking advantage of the "unlogged load" optimization,"... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB?
> you will have written Write Ahead Log (WAL) records, too -- which
> (depending on your configuration) you may be archiving. In that
> case, you may need to be concerned about the archive space required.
If I am always concerned, and check something before a COPY, what
should I be checking? What are the "OK-to-proceed" criteria?
If you only have 500k free in your archive directory that 1MB file will pose a problem...though if you have 4TB of archive available the 1TB would fit easily. Do you compress your WAL files before shipping them off to the archive? How compressible is your data?
I'm sure people have decent rules-of-thumb here but in the end your specific environment and data, especially at the TB scale, is going to be important; and is something that you will only discover through testing.
I am curious about the simplest case where only the data storage is to
> If you have foreign keys defined for the table, you may get into
> trouble on the RAM used to track pending checks for those
> constraints. I would recommend adding any FKs after you are done
> with the big bulk load.
be worried about. (As an aside: the CHECK and NOT NULL constrains are
not a storage factor, right?)
Correct
Something will though, right? What would that be? The available disk
> PostgreSQL does *not* have a "rollback log" which will impose a
> limit.
space on a file system? (I would be surprised.)I see. How much data can I fit there while doing COPY? Not 1 TB?
> > Say, I am COPYing 100 TB of data and the bad records are close
> > to the end of the feed -- how will this all error out?
>
> The rows will all be in the table, but not visible to any other
> transaction.
-- Alex
You need the same amount of space that you would require if the file imported to completion.
PostgreSQL is optimistic in this regard - it assumes you will commit and so up until failure there is no difference between a good and bad import. The magic is described in Slide 24 of the MVCC link above (http://momjian.us/main/writings/pgsql/mvcc.pdf) - if the transaction is aborted then as far as the system is concerned the written data has been deleted and can be cleaned up just like if the following sequence of commands occurred:
BEGIN;
COPY tbl FROM ....;
COMMIT; ---success
DELETE FROM tbl ....;
Hence the comment to "TRUNCATE" after a failed load if at all possible - to avoid the unnecessary VACUUM on tbl...
QUESTION: would the vacuum reclaim the disk space in this situation (I presume yes) because if not, and another imported was to be attempted, ideally the allocated space could be reused.
I'm not sure what a reasonable formula would be, especially at the TB scale, but roughly 2x the size of the imported (uncompressed) file would be a good starting point (table + WAL). You likely would want many multiples of this unless you are dealing with a one-off event. Indexes and dead tuples in particular are likely to be involved. You get some leeway depending on compression but that is data specific and thus something you have to test yourself if you are operating at the margin of your system's resources.
David J.
View this message in context: Re: autocommit (true/false) for more than 1 million records
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
> Thank you, Kevin -- this is helpful.
Thank you David, too.
> But it still leaves questions for me.
Still...
Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:
>>> How do I decide, before starting a COPY data load, whether such a load
>>> protection ("complexity") makes sense ("is necessary")?
This is *the* practical question.
David G Johnston <david.g.johnston@gmail.com> wrote:
> You should probably consider something like: http://pgloader.io/
This is not my question; I want to see if anybody can offer a
meaningful situation evaluation strategy for a potential using or not
using COPY for loading the "big data".
If nobody can, fine: it'll give me the reason to claim "Nobody knows".
> Normal case, with normal COPY,
This is the case I am asking about: the COPY operation limitations for
the "big data": until what point a plain COPY can be used.
> you load a bad file into an empty table, it fails, you truncate and
> get better data for the next attempt.
This is not how many businesses operate.
> How long that will take is system (IOPS/CPU) and data dependent.
"How long", was not the question: my question was originally about the
behavior for a bad record at the end of a large data set submitted to
COPY; when it was stated that the data "in process" becomes an
invisible (until committed) part of the target table, it became
obvious to me that the fundamental question has to be asked: "How much
can fit there, in the temporary operational space (whatever it's
called in PostgreSQL.)?" "df /mount -> free" or "2^32"?
> The probability of failure is source dependent - and prior
> experience plays a large role here as well.
Not the question.
> If you plan to load directly into a live table the wasted space from
> a bad load could kill you so smaller partial loads are better - if
> you can afford the implicit system inconsistency such a partial load
> would cause.
Not the question.
> If you understand how the system works
I don't, to the necessary extent, so I asked for an expert opinion :)
> you should be able to evaluate the different pieces and come to a
> conclusion as how best to proceed in a specific situation. No one
> else on this list has the relevant information to make that
> judgement call.
We'll see; too early to tell yet :)
> If this is just asking about rules-of-thumb
Yes.
> I'd say figure out how many records 100MB consumes and COMMIT after that
> many records.
Pardon me: I am running COPY and know how many records are processed
so far?.. (Hmm... can't be.)
> 10,000 records is also a nice round number to pick - regardless of
> the amount of MB consumed. Start there and tweak based upon
> experience.
You are clearly suggesting to split the large data file into many
small ones. To split very intelligently, on the record boundaries.
And since this is very hard and would involve quite another, external
processing machinery, I am trying to understand until what point this
is safe not to do (subject to what factors.)
> If you are not taking advantage of the "unlogged load" optimization,
I don't see any way to control this for COPY only. Are you talking
about the 'postgresql.conf' settings?
> If you only have 500k free in your archive directory that 1MB file
> will pose a problem...though if you have 4TB of archive available
> the 1TB would fit easily.
So the answer to the "How much data can fit in the COPY storage
areas?" question is solely a "df /mount/point" thing?
I.e. before initiating the COPY, I should:
ls -l DATA-FILE
df -m /server/db-cluster/pg_data-or-something
compare the two values and be assured that my COPY will reach the end
of my DATA-FILE (whether is stumbles in the end or not) if the former
value is meaningfully smaller than the latter?
I would take this for the answer. (Let's see if there are other
evaluation suggestions.)
> Do you compress your WAL files before shipping them off to the
> archive? How compressible is your data?
Try to give me the upper limit evaluation strategy, when all the
compression and archive factors are working in my favor.
> I'm sure people have decent rules-of-thumb here
I would love to hear about them.
> but in the end your specific environment and data, especially at the
> TB scale, is going to be important; and is something that you will
> only discover through testing.
"Don't malloc 2 GB on a system with 100 MB RAM" is a meaningful rule
of thumb, not requiring any testing. I am looking for similar simple
guiding principles for COPY.
>> > > Say, I am COPYing 100 TB of data and the bad records are close
>> > > to the end of the feed -- how will this all error out?
>> >
>> > The rows will all be in the table, but not visible to any other
>> > transaction.
>>
>> I see. How much data can I fit there while doing COPY? Not 1 TB?
> You need the same amount of space that you would require if the file
> imported to completion.
> PostgreSQL is optimistic in this regard - it assumes you will commit
> and so up until failure there is no difference between a good and
> bad import.
I can see it now, thanks.
> I'm not sure what a reasonable formula would be, especially at the TB
> scale,
Make it 1 GB then :)
Can I load 1 GB (uncompressed) via one COPY?
When not -- when "df" says that there is less than 10 GB of free disk
space in the relevant file systems? Would that be all I need to know?
> but roughly 2x the size of the imported (uncompressed) file would be
> a good starting point (table + WAL). You likely would want many
> multiples of this unless you are dealing with a one-off event.
> Indexes and dead tuples in particular are likely to be involved.
> You get some leeway depending on compression but that is data
> specific and thus something you have to test yourself if you are
> operating at the margin of your system's resources.
I am willing to accept any factor -- 2x, 10x. I want to be certain the
factor of what over what, though. So far, only the "df-free-space" to
"data-file-size" consideration has come up.
Thanks,
-- Alex
Thank you David, too.
> But it still leaves questions for me.
Still...
Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:
>>> How do I decide, before starting a COPY data load, whether such a load
>>> protection ("complexity") makes sense ("is necessary")?
This is *the* practical question.
David G Johnston <david.g.johnston@gmail.com> wrote:
> You should probably consider something like: http://pgloader.io/
This is not my question; I want to see if anybody can offer a
meaningful situation evaluation strategy for a potential using or not
using COPY for loading the "big data".
If nobody can, fine: it'll give me the reason to claim "Nobody knows".
> Normal case, with normal COPY,
This is the case I am asking about: the COPY operation limitations for
the "big data": until what point a plain COPY can be used.
> you load a bad file into an empty table, it fails, you truncate and
> get better data for the next attempt.
This is not how many businesses operate.
> How long that will take is system (IOPS/CPU) and data dependent.
"How long", was not the question: my question was originally about the
behavior for a bad record at the end of a large data set submitted to
COPY; when it was stated that the data "in process" becomes an
invisible (until committed) part of the target table, it became
obvious to me that the fundamental question has to be asked: "How much
can fit there, in the temporary operational space (whatever it's
called in PostgreSQL.)?" "df /mount -> free" or "2^32"?
> The probability of failure is source dependent - and prior
> experience plays a large role here as well.
Not the question.
> If you plan to load directly into a live table the wasted space from
> a bad load could kill you so smaller partial loads are better - if
> you can afford the implicit system inconsistency such a partial load
> would cause.
Not the question.
> If you understand how the system works
I don't, to the necessary extent, so I asked for an expert opinion :)
> you should be able to evaluate the different pieces and come to a
> conclusion as how best to proceed in a specific situation. No one
> else on this list has the relevant information to make that
> judgement call.
We'll see; too early to tell yet :)
> If this is just asking about rules-of-thumb
Yes.
> I'd say figure out how many records 100MB consumes and COMMIT after that
> many records.
Pardon me: I am running COPY and know how many records are processed
so far?.. (Hmm... can't be.)
> 10,000 records is also a nice round number to pick - regardless of
> the amount of MB consumed. Start there and tweak based upon
> experience.
You are clearly suggesting to split the large data file into many
small ones. To split very intelligently, on the record boundaries.
And since this is very hard and would involve quite another, external
processing machinery, I am trying to understand until what point this
is safe not to do (subject to what factors.)
> If you are not taking advantage of the "unlogged load" optimization,
I don't see any way to control this for COPY only. Are you talking
about the 'postgresql.conf' settings?
> If you only have 500k free in your archive directory that 1MB file
> will pose a problem...though if you have 4TB of archive available
> the 1TB would fit easily.
So the answer to the "How much data can fit in the COPY storage
areas?" question is solely a "df /mount/point" thing?
I.e. before initiating the COPY, I should:
ls -l DATA-FILE
df -m /server/db-cluster/pg_data-or-something
compare the two values and be assured that my COPY will reach the end
of my DATA-FILE (whether is stumbles in the end or not) if the former
value is meaningfully smaller than the latter?
I would take this for the answer. (Let's see if there are other
evaluation suggestions.)
> Do you compress your WAL files before shipping them off to the
> archive? How compressible is your data?
Try to give me the upper limit evaluation strategy, when all the
compression and archive factors are working in my favor.
> I'm sure people have decent rules-of-thumb here
I would love to hear about them.
> but in the end your specific environment and data, especially at the
> TB scale, is going to be important; and is something that you will
> only discover through testing.
"Don't malloc 2 GB on a system with 100 MB RAM" is a meaningful rule
of thumb, not requiring any testing. I am looking for similar simple
guiding principles for COPY.
>> > > Say, I am COPYing 100 TB of data and the bad records are close
>> > > to the end of the feed -- how will this all error out?
>> >
>> > The rows will all be in the table, but not visible to any other
>> > transaction.
>>
>> I see. How much data can I fit there while doing COPY? Not 1 TB?
> You need the same amount of space that you would require if the file
> imported to completion.
> PostgreSQL is optimistic in this regard - it assumes you will commit
> and so up until failure there is no difference between a good and
> bad import.
I can see it now, thanks.
> I'm not sure what a reasonable formula would be, especially at the TB
> scale,
Make it 1 GB then :)
Can I load 1 GB (uncompressed) via one COPY?
When not -- when "df" says that there is less than 10 GB of free disk
space in the relevant file systems? Would that be all I need to know?
> but roughly 2x the size of the imported (uncompressed) file would be
> a good starting point (table + WAL). You likely would want many
> multiples of this unless you are dealing with a one-off event.
> Indexes and dead tuples in particular are likely to be involved.
> You get some leeway depending on compression but that is data
> specific and thus something you have to test yourself if you are
> operating at the margin of your system's resources.
I am willing to accept any factor -- 2x, 10x. I want to be certain the
factor of what over what, though. So far, only the "df-free-space" to
"data-file-size" consideration has come up.
Thanks,
-- Alex
On Tue, Aug 26, 2014 at 11:40 PM, David G Johnston <david.g.johnston@gmail.com> wrote:
Thank you, Kevin -- this is helpful.
But it still leaves questions for me.> > The whole thing is aborted then, and the good 99 records are notHow do I decide, before starting a COPY data load, whether such a load
> > making it into the target table.
>
> Right. This is one reason people often batch such copies or check
> the data very closely before copying in.
protection ("complexity") makes sense ("is necessary")?You should probably consider something like:(I know there are others, this one apparently has the best marketing team...)Normal case, with normal COPY, you load a bad file into an empty table, it fails, you truncate and get better data for the next attempt.How long that will take is system (IOPS/CPU) and data dependent.The probability of failure is source dependent - and prior experience plays a large role here as well.If you plan to load directly into a live table the wasted space from a bad load could kill you so smaller partial loads are better - if you can afford the implicit system inconsistency such a partial load would cause.If you understand how the system works you should be able to evaluate the different pieces and come to a conclusion as how best to proceed in a specific situation. No one else on this list has the relevant information to make that judgement call. If this is just asking about rules-of-thumb I'd say figure out how many records 100MB consumes and COMMIT after that many records. 10,000 records is also a nice round number to pick - regardless of the amount of MB consumed. Start there and tweak based upon experience.> If you are not taking advantage of the "unlogged load" optimization,"... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB?
> you will have written Write Ahead Log (WAL) records, too -- which
> (depending on your configuration) you may be archiving. In that
> case, you may need to be concerned about the archive space required.
If I am always concerned, and check something before a COPY, what
should I be checking? What are the "OK-to-proceed" criteria?If you only have 500k free in your archive directory that 1MB file will pose a problem...though if you have 4TB of archive available the 1TB would fit easily. Do you compress your WAL files before shipping them off to the archive? How compressible is your data?I'm sure people have decent rules-of-thumb here but in the end your specific environment and data, especially at the TB scale, is going to be important; and is something that you will only discover through testing.I am curious about the simplest case where only the data storage is to
> If you have foreign keys defined for the table, you may get into
> trouble on the RAM used to track pending checks for those
> constraints. I would recommend adding any FKs after you are done
> with the big bulk load.
be worried about. (As an aside: the CHECK and NOT NULL constrains are
not a storage factor, right?)CorrectSomething will though, right? What would that be? The available disk
> PostgreSQL does *not* have a "rollback log" which will impose a
> limit.
space on a file system? (I would be surprised.)I see. How much data can I fit there while doing COPY? Not 1 TB?
> > Say, I am COPYing 100 TB of data and the bad records are close
> > to the end of the feed -- how will this all error out?
>
> The rows will all be in the table, but not visible to any other
> transaction.
-- AlexYou need the same amount of space that you would require if the file imported to completion.PostgreSQL is optimistic in this regard - it assumes you will commit and so up until failure there is no difference between a good and bad import. The magic is described in Slide 24 of the MVCC link above (http://momjian.us/main/writings/pgsql/mvcc.pdf) - if the transaction is aborted then as far as the system is concerned the written data has been deleted and can be cleaned up just like if the following sequence of commands occurred:BEGIN;COPY tbl FROM ....;COMMIT; ---successDELETE FROM tbl ....;Hence the comment to "TRUNCATE" after a failed load if at all possible - to avoid the unnecessary VACUUM on tbl...QUESTION: would the vacuum reclaim the disk space in this situation (I presume yes) because if not, and another imported was to be attempted, ideally the allocated space could be reused.I'm not sure what a reasonable formula would be, especially at the TB scale, but roughly 2x the size of the imported (uncompressed) file would be a good starting point (table + WAL). You likely would want many multiples of this unless you are dealing with a one-off event. Indexes and dead tuples in particular are likely to be involved. You get some leeway depending on compression but that is data specific and thus something you have to test yourself if you are operating at the margin of your system's resources.David J.
View this message in context: Re: autocommit (true/false) for more than 1 million records
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Wed, Aug 27, 2014 at 1:02 AM, Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:
> Thank you, Kevin -- this is helpful.Thank you David, too.Still...
> But it still leaves questions for me.This is *the* practical question.
Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:
>>> How do I decide, before starting a COPY data load, whether such a load
>>> protection ("complexity") makes sense ("is necessary")?This is not my question; I want to see if anybody can offer a
David G Johnston <david.g.johnston@gmail.com> wrote:
> You should probably consider something like: http://pgloader.io/
meaningful situation evaluation strategy for a potential using or not
using COPY for loading the "big data".
OK. Though I presume that given limitations to copy - of which the whole "all-or-nothing" is one - that pointing out more user-friendly API's would be worthwhile.
If nobody can, fine: it'll give me the reason to claim "Nobody knows".
> Normal case, with normal COPY,
This is the case I am asking about: the COPY operation limitations for
the "big data": until what point a plain COPY can be used.
> you load a bad file into an empty table, it fails, you truncate and
> get better data for the next attempt.
This is not how many businesses operate.
Yet this is basically what you are asking about....
> How long that will take is system (IOPS/CPU) and data dependent.
"How long", was not the question: my question was originally about the
behavior for a bad record at the end of a large data set submitted to
COPY; when it was stated that the data "in process" becomes an
invisible (until committed) part of the target table, it became
obvious to me that the fundamental question has to be asked: "How much
can fit there, in the temporary operational space (whatever it's
called in PostgreSQL.)?" "df /mount -> free" or "2^32"?
> The probability of failure is source dependent - and prior
> experience plays a large role here as well.
Not the question.
> If you plan to load directly into a live table the wasted space from
> a bad load could kill you so smaller partial loads are better - if
> you can afford the implicit system inconsistency such a partial load
> would cause.
Not the question.
These were things to consider when deciding on whether it is worthwhile to split the large file into chunks.
> If you understand how the system works
I don't, to the necessary extent, so I asked for an expert opinion :)
> you should be able to evaluate the different pieces and come to a
> conclusion as how best to proceed in a specific situation. No one
> else on this list has the relevant information to make that
> judgement call.
We'll see; too early to tell yet :)
> If this is just asking about rules-of-thumb
Yes.
> I'd say figure out how many records 100MB consumes and COMMIT after that
> many records.
Pardon me: I am running COPY and know how many records are processed
so far?.. (Hmm... can't be.)
Take you 1TB file, extract the first 100MB, count the number of records-separators. Commit after that many.
> 10,000 records is also a nice round number to pick - regardless of
> the amount of MB consumed. Start there and tweak based upon
> experience.
You are clearly suggesting to split the large data file into many
small ones. To split very intelligently, on the record boundaries.
And since this is very hard and would involve quite another, external
processing machinery, I am trying to understand until what point this
is safe not to do (subject to what factors.)
See thoughts to consider from previous e-mail.
I don't see any way to control this for COPY only. Are you talking
> If you are not taking advantage of the "unlogged load" optimization,
about the 'postgresql.conf' settings?
I am not sure if this is the same thing but I am pretty sure he is referring to creating an unlogged table as the copy target - thus avoiding WAL.
> If you only have 500k free in your archive directory that 1MB file
> will pose a problem...though if you have 4TB of archive available
> the 1TB would fit easily.
So the answer to the "How much data can fit in the COPY storage
areas?" question is solely a "df /mount/point" thing?
I.e. before initiating the COPY, I should:
ls -l DATA-FILE
df -m /server/db-cluster/pg_data-or-something
compare the two values and be assured that my COPY will reach the end
of my DATA-FILE (whether is stumbles in the end or not) if the former
value is meaningfully smaller than the latter?
I would take this for the answer. (Let's see if there are other
evaluation suggestions.)
That should get the copy to succeed though whether you blow up your archives or slaves would not be addressed.
> Do you compress your WAL files before shipping them off to the
> archive? How compressible is your data?
Try to give me the upper limit evaluation strategy, when all the
compression and archive factors are working in my favor.
Assume worse-case unless you know, from experimentation, what an appropriate compression factor would be. Keeping in mind I presume you expect other simultaneous activity on the same server. If you then fall into a marginal situation you can see whether reducing your estimates to hit you goal is worth the risk. Though you can incorporate that into your overall planned buffer as well.
> I'm sure people have decent rules-of-thumb here
I would love to hear about them.
> but in the end your specific environment and data, especially at the
> TB scale, is going to be important; and is something that you will
> only discover through testing.
"Don't malloc 2 GB on a system with 100 MB RAM" is a meaningful rule
of thumb, not requiring any testing. I am looking for similar simple
guiding principles for COPY.
> You need the same amount of space that you would require if the file
>> > > Say, I am COPYing 100 TB of data and the bad records are close
>> > > to the end of the feed -- how will this all error out?
>> >
>> > The rows will all be in the table, but not visible to any other
>> > transaction.
>>
>> I see. How much data can I fit there while doing COPY? Not 1 TB?
> imported to completion.
> PostgreSQL is optimistic in this regard - it assumes you will commit
> and so up until failure there is no difference between a good and
> bad import.
I can see it now, thanks.
> I'm not sure what a reasonable formula would be, especially at the TB
> scale,
Make it 1 GB then :)
Can I load 1 GB (uncompressed) via one COPY?
You cannot load compressed data via COPY...
While I have never done such scale myself my conclusion thus far is that with enough hard drive space and, at least depending on the FK situation noted, RAM you should be able to load any size file with a single copy without getting any system errors and/or crashing the server (postgres or OS).
In the simple case the question to split depends on the probability of a data error and how much data (and time) you wish to lose should one occur.
When not -- when "df" says that there is less than 10 GB of free disk
space in the relevant file systems? Would that be all I need to know?
> but roughly 2x the size of the imported (uncompressed) file would be
> a good starting point (table + WAL). You likely would want many
> multiples of this unless you are dealing with a one-off event.
> Indexes and dead tuples in particular are likely to be involved.
> You get some leeway depending on compression but that is data
> specific and thus something you have to test yourself if you are
> operating at the margin of your system's resources.
I am willing to accept any factor -- 2x, 10x. I want to be certain the
factor of what over what, though. So far, only the "df-free-space" to
"data-file-size" consideration has come up.
RAM did as well and that was not enumerated - other than it being optional in the case that no FKs are defined.
I'm not sure what kind of overhead there is on WAL and data pages but there is going to be some. At scale hopefully compression would wash out the overhead so figuring 2x for any stored data seems like reasonable disk free space required for the most basic scenario. Indexes count as data - you'd likely want to consider at least one that operates as the primary key.
Kevin is a better source for this than I - mostly I'm drawing conclusions from what I read in his post.
David J.
Alex Goncharov wrote: > Thank you, Kevin -- this is helpful. > > But it still leaves questions for me. >> Alex Goncharov <alex.goncharov.usa@gmail.com> wrote: > >>> The whole thing is aborted then, and the good 99 records are not >>> making it into the target table. >> >> Right. This is one reason people often batch such copies or check >> the data very closely before copying in. > > How do I decide, before starting a COPY data load, whether such a load > protection ("complexity") makes sense ("is necessary")? > > Clearly not needed for 1 MB of data in a realistic environment. > > Clearly is needed for loading 1 TB in a realistic environment. > > To put it differently: If I COPY 1 TB of data, what criteria should I > use for choosing the size of the chunks to split the data into? > > For INSERT-loading, for the database client interfaces offering the > array mode, the performance difference between loading 100 or 1000 > rows at a time is usually negligible if any. Therefore 100- and > 1000-row's array sizes are both reasonable choices. > > But what is a reasonable size for a COPY chunk? It can't even be > measured in rows. > > Note, that if you have a 1 TB record-formatted file to load, you can't > just split it in 1 MB chunks and feed them to COPY -- the file has to > be split on the record boundaries. > > So, splitting the data for COPY is not a trivial operation, and if > such splitting can be avoided, a reasonable operator will avoid it. > > But then again: when can it be avoided? You don't need to split the data at all if you make sure that they are correct. If you cannot be certain, and you want to avoid having to restart a huge load with corrected data, the batch size is pretty much a matter of taste: How much overhead does it generate to split the data in N parts? How much time are you ready to wait for (re)loading a single part? You'll probably have to experiment to find a solution that fits you. >>> My question is: Where are these 99 records have been living, on >>> the database server, while the 100-th one hasn't come yet, and >>> the need to throw the previous data accumulation away has not >>> come yet? >> >> They will have been written into the table. They do not become >> visible to any other transaction until and unless the inserting >> transaction successfully commits. These slides may help: >> >> http://momjian.us/main/writings/pgsql/mvcc.pdf > > Yeah, I know about the MVCC model... The question is about the huge > data storage to be reserved without a commitment while the load is not > completed, about the size constrains in effect here. I don't understand that question. You need the space anyway to complete the load. If the load fails, you simply reclaim the space (VACUUM) and reuse it. There is no extra storage needed. >>> There have to be some limits to the space and/or counts taken by >>> the new, uncommitted, data, while the COPY operation is still in >>> progress. What are they? >> >> Primarily disk space for the table. > > How can that be found? Is "df /mount/point" the deciding factor? Or > some 2^32 or 2^64 number? Disk space can be measure with "df". >> If you are not taking advantage of the "unlogged load" optimization, >> you will have written Write Ahead Log (WAL) records, too -- which >> (depending on your configuration) you may be archiving. In that >> case, you may need to be concerned about the archive space required. > > "... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB? > > If I am always concerned, and check something before a COPY, what > should I be checking? What are the "OK-to-proceed" criteria? That means "you should consider", not "you should be worried". Unless you are loading into a table created in the same transaction, "redo" information will be generated and stored in "WAL files", which end up in your WAL archive. This needs extra storage, proportional to the storage necessary for the data itself. >> If you have foreign keys defined for the table, you may get into >> trouble on the RAM used to track pending checks for those >> constraints. I would recommend adding any FKs after you are done >> with the big bulk load. > > I am curious about the simplest case where only the data storage is to > be worried about. (As an aside: the CHECK and NOT NULL constrains are > not a storage factor, right?) Right. >> PostgreSQL does *not* have a "rollback log" which will impose a >> limit. > > Something will though, right? What would that be? The available disk > space on a file system? (I would be surprised.) You can find something on the limitations here: http://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F >>> Say, I am COPYing 100 TB of data and the bad records are close >>> to the end of the feed -- how will this all error out? >> >> The rows will all be in the table, but not visible to any other >> transaction. > > I see. How much data can I fit there while doing COPY? Not 1 TB? Sure, why not? Yours, Laurenz Albe
This might also help:
Bulk load tables from text files in almost all RDMS are "log free" (Postgres' COPY is one of them).
The reason is that the database doesn't need to waste resources by writing the log because there's no risk of data loss. If the COPY operation fails, your data will still live in the text files you're trying to bulk load from.
2014-08-27 5:42 GMT-03:00 Albe Laurenz <laurenz.albe@wien.gv.at>:
Alex Goncharov wrote:
> Thank you, Kevin -- this is helpful.
>
> But it still leaves questions for me.>> Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:You don't need to split the data at all if you make sure that they are
>
>>> The whole thing is aborted then, and the good 99 records are not
>>> making it into the target table.
>>
>> Right. This is one reason people often batch such copies or check
>> the data very closely before copying in.
>
> How do I decide, before starting a COPY data load, whether such a load
> protection ("complexity") makes sense ("is necessary")?
>
> Clearly not needed for 1 MB of data in a realistic environment.
>
> Clearly is needed for loading 1 TB in a realistic environment.
>
> To put it differently: If I COPY 1 TB of data, what criteria should I
> use for choosing the size of the chunks to split the data into?
>
> For INSERT-loading, for the database client interfaces offering the
> array mode, the performance difference between loading 100 or 1000
> rows at a time is usually negligible if any. Therefore 100- and
> 1000-row's array sizes are both reasonable choices.
>
> But what is a reasonable size for a COPY chunk? It can't even be
> measured in rows.
>
> Note, that if you have a 1 TB record-formatted file to load, you can't
> just split it in 1 MB chunks and feed them to COPY -- the file has to
> be split on the record boundaries.
>
> So, splitting the data for COPY is not a trivial operation, and if
> such splitting can be avoided, a reasonable operator will avoid it.
>
> But then again: when can it be avoided?
correct.
If you cannot be certain, and you want to avoid having to restart a huge
load with corrected data, the batch size is pretty much a matter of taste:
How much overhead does it generate to split the data in N parts?
How much time are you ready to wait for (re)loading a single part?
You'll probably have to experiment to find a solution that fits you.I don't understand that question.
>>> My question is: Where are these 99 records have been living, on
>>> the database server, while the 100-th one hasn't come yet, and
>>> the need to throw the previous data accumulation away has not
>>> come yet?
>>
>> They will have been written into the table. They do not become
>> visible to any other transaction until and unless the inserting
>> transaction successfully commits. These slides may help:
>>
>> http://momjian.us/main/writings/pgsql/mvcc.pdf
>
> Yeah, I know about the MVCC model... The question is about the huge
> data storage to be reserved without a commitment while the load is not
> completed, about the size constrains in effect here.
You need the space anyway to complete the load.
If the load fails, you simply reclaim the space (VACUUM) and reuse it.
There is no extra storage needed.Disk space can be measure with "df".
>>> There have to be some limits to the space and/or counts taken by
>>> the new, uncommitted, data, while the COPY operation is still in
>>> progress. What are they?
>>
>> Primarily disk space for the table.
>
> How can that be found? Is "df /mount/point" the deciding factor? Or
> some 2^32 or 2^64 number?That means "you should consider", not "you should be worried".
>> If you are not taking advantage of the "unlogged load" optimization,
>> you will have written Write Ahead Log (WAL) records, too -- which
>> (depending on your configuration) you may be archiving. In that
>> case, you may need to be concerned about the archive space required.
>
> "... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB?
>
> If I am always concerned, and check something before a COPY, what
> should I be checking? What are the "OK-to-proceed" criteria?
Unless you are loading into a table created in the same transaction,
"redo" information will be generated and stored in "WAL files", which
end up in your WAL archive.
This needs extra storage, proportional to the storage necessary
for the data itself.Right.
>> If you have foreign keys defined for the table, you may get into
>> trouble on the RAM used to track pending checks for those
>> constraints. I would recommend adding any FKs after you are done
>> with the big bulk load.
>
> I am curious about the simplest case where only the data storage is to
> be worried about. (As an aside: the CHECK and NOT NULL constrains are
> not a storage factor, right?)You can find something on the limitations here:
>> PostgreSQL does *not* have a "rollback log" which will impose a
>> limit.
>
> Something will though, right? What would that be? The available disk
> space on a file system? (I would be surprised.)
http://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3FSure, why not?
>>> Say, I am COPYing 100 TB of data and the bad records are close
>>> to the end of the feed -- how will this all error out?
>>
>> The rows will all be in the table, but not visible to any other
>> transaction.
>
> I see. How much data can I fit there while doing COPY? Not 1 TB?
Yours,
Laurenz Albe
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
[about loadling large amounts of data] Felipe Santos wrote: > This might also help: > http://www.postgresql.org/docs/9.1/static/populate.html > > > Bulk load tables from text files in almost all RDMS are "log free" (Postgres' COPY is one of them). > > The reason is that the database doesn't need to waste resources by writing the log because there's no > risk of data loss. If the COPY operation fails, your data will still live in the text files you're > trying to bulk load from. That is only true if the table was created in the same transaction as the COPY statement. Otherwise it could be that recovery starts after CREATE TABLE but before COPY, and it would have to recover the loaded data. Yours, Laurenz Albe
Alex Goncharov <alex.goncharov.usa@gmail.com> wrote: > Kevin Grittner <kgrittn@ymail.com> wrote: >> The rows will all be in the table, but not visible to any other >> transaction. > > How much data can I fit there while doing COPY? Not 1 TB? As has already been said, why not? This is not some special section of the table -- the data is written to the table. Period. Commit or rollback just tells new transactions whether data flagged with that transaction number is visible. Nobody can tell you how much space that will take -- it depends on many factors, including how many columns of what kind of data, how compressible it is, and how it is indexed. But the point is, we are not talking about any separate space from what is needed to store the data in the database. FWIW, I think the largest single COPY statement I ever ran was generated by pg_dump and piped directly to psql for a major release upgrade (before pg_upgrade was available), and it was somewhere in the 2TB to 3TB range. It took a long time, but it "just worked". That should be true for 10TB or 100TB, as long as you have sized the machine correctly and are loading clean data. Whether you have that covered, and how you want to "hedge your bets" based on your degree of confidence in those things is a judgment call. When I'm in the position of needing to make such a call, I like to do some tests. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello All, I learned a lot by inputs from all of you. To share one more thing about java_JDBC bypassing autocommit that I tried: (1) Read/save source data into f1.csv, f2.csv, ...... (2) Copy/load into dest psql.DB CopyManager cm = null; FileReader fileReader = null; cm = new CopyManager((BaseConnection) conn_psql); fileReader = new FileReader("f1.csv"); cm.copyIn("COPY table_name FROM STDIN WITH DELIMITER '|'", fileReader); fileReader.close(); Emi On 08/27/2014 08:59 AM, Kevin Grittner wrote: > Alex Goncharov <alex.goncharov.usa@gmail.com> wrote: >> Kevin Grittner <kgrittn@ymail.com> wrote: >>> The rows will all be in the table, but not visible to any other >>> transaction. >> How much data can I fit there while doing COPY? Not 1 TB? > As has already been said, why not? This is not some special > section of the table -- the data is written to the table. Period. > Commit or rollback just tells new transactions whether data flagged > with that transaction number is visible. > > Nobody can tell you how much space that will take -- it depends on > many factors, including how many columns of what kind of data, how > compressible it is, and how it is indexed. But the point is, we > are not talking about any separate space from what is needed to > store the data in the database. > > FWIW, I think the largest single COPY statement I ever ran was > generated by pg_dump and piped directly to psql for a major release > upgrade (before pg_upgrade was available), and it was somewhere in > the 2TB to 3TB range. It took a long time, but it "just worked". > That should be true for 10TB or 100TB, as long as you have sized > the machine correctly and are loading clean data. Whether you have > that covered, and how you want to "hedge your bets" based on your > degree of confidence in those things is a judgment call. When I'm > in the position of needing to make such a call, I like to do some > tests. > > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company