Thread: Updating a large table
Hello! We have a large table 11GB ( about 37 million records ) and we need to alter a table - add a new column with defaultvalues is false. Also 'NOT NULL' is required. So, first I've done: ALTER TABLE clusters ALTER COLUMN "is_paid"; after that: UPDATE clusters SET is_paid = DEFAULT where ctime <= now() - interval '720h' AND is_paid != FALSE; Everything went ok. Then I tried to run it again for an interval of 1 years. And I got that no one can't see - the was noavailable space on a disk. The reason was WAL-files ate everything. Master-server couldn't send some WAL-file to their replicas. Bandwidth wasn't enough. Well, I'm searching for a better idea to update the table. Solutions I found. 1. Separate my UPDATE by chunks. 2. Alter a table using a new temporary table, but it's not convenient for me because there is a lot of foreign keys and indexes. 3. Hot-update. This is the most interesting case for me. Speaking of HOT-update https://www.dbrnd.com/2016/03/postgresql-the-awesome-table-fillfactor-to-speedup-update-and-select-statement/ The article says: it might be useful for tables that change often and moreover It would be the best way to increase the speedof UPDATE. So, my questions are will it work for all tuples? It says that - no https://www.dbrnd.com/2016/03/postgresql-alter-table-to-change-fillfactor-value/,but I could not find a confirmation in officialpostresql's documentation. Why do I need to launch vacuum after updating? How should I reduce the better fillfactor? What will be with WAL-files it this case? Thank you! PostgreSQL 9.6 -- Timokhin 'maf' Maxim
Hello,
Does the tale have foreign keys, if not you could create another table may be unlogged and then do the changes you want via INSERT ... SELECT; and finally convert the unlogged table to logged table.
In addition to that there are several ways to increase data writing performance for example the following configuration settings have a impact on write performance: synchronous_commit, commit_delay, max_wal_size, wal_buffers and maintenance_work_mem.
Regards
On Friday, December 22, 2017, 6:59:43 PM GMT+1, Timokhin Maxim <ncx2@yandex.com> wrote:
Hello! We have a large table 11GB ( about 37 million records ) and we need to alter a table - add a new column with default values is false. Also 'NOT NULL' is required.
So, first I've done:
ALTER TABLE clusters ALTER COLUMN "is_paid";
after that:
UPDATE clusters SET is_paid = DEFAULT where ctime <= now() - interval '720h' AND is_paid != FALSE;
Everything went ok. Then I tried to run it again for an interval of 1 years. And I got that no one can't see - the was no available space on a disk. The reason was WAL-files ate everything.
Master-server couldn't send some WAL-file to their replicas. Bandwidth wasn't enough.
Well, I'm searching for a better idea to update the table.
Solutions I found.
1. Separate my UPDATE by chunks.
2. Alter a table using a new temporary table, but it's not convenient for me because there is a lot of foreign keys and indexes.
3. Hot-update. This is the most interesting case for me.
Speaking of HOT-update https://www.dbrnd.com/2016/03/postgresql-the-awesome-table-fillfactor-to-speedup-update-and-select-statement/
The article says: it might be useful for tables that change often and moreover It would be the best way to increase the speed of UPDATE.
So, my questions are will it work for all tuples? It says that - no https://www.dbrnd.com/2016/03/postgresql-alter-table-to-change-fillfactor-value/, but I could not find a confirmation in official postresql's documentation.
Why do I need to launch vacuum after updating?
How should I reduce the better fillfactor?
What will be with WAL-files it this case?
Thank you!
PostgreSQL 9.6
--
Timokhin 'maf' Maxim
On 12/22/2017 05:46 PM, Timokhin Maxim wrote: > > Hello! We have a large table 11GB ( about 37 million records ) and we > need to alter a table - add a new column with default values is > false. Also 'NOT NULL' is required. > > So, first I've done: > > ALTER TABLE clusters ALTER COLUMN "is_paid"; > That seems somewhat incomplete ... what exactly did the ALTER do? > after that: > > UPDATE clusters SET is_paid = DEFAULT where ctime <= now() - interval '720h' AND is_paid != FALSE; > > Everything went ok. Then I tried to run it again for an interval of 1 > years. And I got that no one can't see - the was no available space > on a disk. The reason was WAL-files ate everything. > Master-server couldn't send some WAL-file to their replicas. Bandwidth wasn't enough. > Well, then perhaps the best solution is to add more disk space and/or make sure the network bandwidth is sufficient? In any case, don't forget this may also need to update all indexes on the table, because the new row versions will end up on different pages. So while the table has 11GB, this update may need much more WAL space than that. > Well, I'm searching for a better idea to update the table. > Solutions I found. > 1. Separate my UPDATE by chunks. If this is a one-time change, this is probably the best option. > 2. Alter a table using a new temporary table, but it's not convenient > for me because there is a lot of foreign keys and indexes. Right. > 3. Hot-update. This is the most interesting case for me. > Speaking of HOT-update https://www.dbrnd.com/2016/03/postgresql-the-awesome-table-fillfactor-to-speedup-update-and-select-statement/ > The article says: it might be useful for tables that change often and moreover It would be the best way to increase thespeed of UPDATE. First of all, to make HOT possible there would have to be enough free space on the pages. As you need to update the whole table, that means each table would have to be only 50% full. That's unlikely to be true, and you can't fix that at this point. > So, my questions are will it work for all tuples? It says that - no > https://www.dbrnd.com/2016/03/postgresql-alter-table-to-change- > fillfactor-value/, but I could not find a confirmation in official > postresql's documentation. Not sure I understand your question, but HOT can only happen when two conditions are met: 1) the update does not change any indexed column This is likely met, assuming you don't have an index on is_paid. 2) there's enough space on the same page for the new row version This is unlikely to be true, because the default fillfactor for tables is 90%. You may change fillfactor using ALTER TABLE, but that only applies to new data. Moreover, as the article says - this is useful for tables that change often. Which is not quite what one-time table rewrite does. So HOT is not the solution you're looking for. > Why do I need to launch vacuum after updating? You don't need to launch vacuum - autovacuum will take care of that eventually. But you may do that, to do the cleanup when it's convenient for you. > How should I reduce the better fillfactor? For example to change fillfactor to 75% (i.e. 25% free space): ALTER TABLE t SET (fillfactor = 75); But as I said, it's not a solution for you. > What will be with WAL-files it this case? Not sure what you mean. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Tomas! Thank you for the useful answer! 23.12.2017, 23:58, "Tomas Vondra" <tomas.vondra@2ndquadrant.com>: > On 12/22/2017 05:46 PM, Timokhin Maxim wrote: >> Hello! We have a large table 11GB ( about 37 million records ) and we >> need to alter a table - add a new column with default values is >> false. Also 'NOT NULL' is required. >> >> So, first I've done: >> >> ALTER TABLE clusters ALTER COLUMN "is_paid"; > > That seems somewhat incomplete ... what exactly did the ALTER do? I'll try to explain what exactly I meant. ALTER TABLE clusters ADD COLUMN "is_paid" BOOLEAN NOT NULL DEFAULT FALSE; What exactly I need. But that query would lock the whole table for about 40 minutes. I decided to separate it like: 1. ALTER TABLE clusters ADD COLUMN "is_paid" BOOLEAN DEFAULT FALSE; 2. UPDATE clusters SET is_paid = DEFAULT where ctime <= now() - interval '720h' AND is_paid != FALSE; ( This was neededas soon as possible ) 3. UPDATE another part by chunks 4. set NOT NULL for the table. I was thinking about how to optimize the 3th step. Well, my solution was to write a script which runs two threads. The first one UPDATE "is_paid" by chunks, another one checksmy metrics. If something is becoming wrong first thread stops until metrics are good. Thank you, Tomas. > >> after that: >> >> UPDATE clusters SET is_paid = DEFAULT where ctime <= now() - interval '720h' AND is_paid != FALSE; >> >> Everything went ok. Then I tried to run it again for an interval of 1 >> years. And I got that no one can't see - the was no available space >> on a disk. The reason was WAL-files ate everything. >> Master-server couldn't send some WAL-file to their replicas. Bandwidth wasn't enough. > > Well, then perhaps the best solution is to add more disk space and/or > make sure the network bandwidth is sufficient? > > In any case, don't forget this may also need to update all indexes on > the table, because the new row versions will end up on different pages. > So while the table has 11GB, this update may need much more WAL space > than that. > Got it, thank you! >> Well, I'm searching for a better idea to update the table. >> Solutions I found. >> 1. Separate my UPDATE by chunks. > > If this is a one-time change, this is probably the best option. > Exactly, thank you! >> 2. Alter a table using a new temporary table, but it's not convenient >> for me because there is a lot of foreign keys and indexes. > > Right. > >> 3. Hot-update. This is the most interesting case for me. >> Speaking of HOT-update https://www.dbrnd.com/2016/03/postgresql-the-awesome-table-fillfactor-to-speedup-update-and-select-statement/ >> The article says: it might be useful for tables that change often and moreover It would be the best way to increase thespeed of UPDATE. > > First of all, to make HOT possible there would have to be enough free > space on the pages. As you need to update the whole table, that means > each table would have to be only 50% full. That's unlikely to be true, > and you can't fix that at this point. > >> So, my questions are will it work for all tuples? It says that - no >> https://www.dbrnd.com/2016/03/postgresql-alter-table-to-change- >> fillfactor-value/, but I could not find a confirmation in official >> postresql's documentation. > > Not sure I understand your question, but HOT can only happen when two > conditions are met: > > 1) the update does not change any indexed column > > This is likely met, assuming you don't have an index on is_paid. > > 2) there's enough space on the same page for the new row version > > This is unlikely to be true, because the default fillfactor for tables > is 90%. You may change fillfactor using ALTER TABLE, but that only > applies to new data. > > Moreover, as the article says - this is useful for tables that change > often. Which is not quite what one-time table rewrite does. > > So HOT is not the solution you're looking for. > >> Why do I need to launch vacuum after updating? > > You don't need to launch vacuum - autovacuum will take care of that > eventually. But you may do that, to do the cleanup when it's convenient > for you. > >> How should I reduce the better fillfactor? > > For example to change fillfactor to 75% (i.e. 25% free space): > > ALTER TABLE t SET (fillfactor = 75); > > But as I said, it's not a solution for you. > >> What will be with WAL-files it this case? > > Not sure what you mean. > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello > 1. ALTER TABLE clusters ADD COLUMN "is_paid" BOOLEAN DEFAULT FALSE; this is wrong. To avoid large table lock you need DEFAULT NULL: ALTER TABLE clusters ADD COLUMN "is_paid" BOOLEAN DEFAULT NULL; Default null changes only system catalog, default with any non-null value will rewrite all rows. After adding column youcan set default value - it applied only for future inserts: ALTER TABLE clusters ALTER COLUMN "is_paid" SET DEFAULT FALSE; And then you can update all old rows in table by small chunks. Finally, when here is no NULL values you can set not null: ALTER TABLE clusters ALTER COLUMN "is_paid" SET NOT NULL; But unfortunately this locks table for some time - smaller what rewrite time, but time of full seqscan. I hope my patch [1]will be merged and not null can be set in future by temporary adding check constraint (not valid, then validate) - whichnot require large table lock [1] https://www.postgresql.org/message-id/flat/81911511895540@web58j.yandex.ru#81911511895540@web58j.yandex.ru Regards, Sergei
Hello, Sergey! 09.01.2018, 15:53, "Sergei Kornilov" <sk@zsrv.org>: > Hello > >> 1. ALTER TABLE clusters ADD COLUMN "is_paid" BOOLEAN DEFAULT FALSE; > > this is wrong. To avoid large table lock you need DEFAULT NULL: > ALTER TABLE clusters ADD COLUMN "is_paid" BOOLEAN DEFAULT NULL; > Default null changes only system catalog, default with any non-null value will rewrite all rows. After adding column youcan set default value - it applied only for future inserts: > ALTER TABLE clusters ALTER COLUMN "is_paid" SET DEFAULT FALSE; > > And then you can update all old rows in table by small chunks. Finally, when here is no NULL values you can set not null: What you wrote are exactly I'm doing. Moreover, I'm checking current metrics to avoid previously problems. > ALTER TABLE clusters ALTER COLUMN "is_paid" SET NOT NULL; > But unfortunately this locks table for some time - smaller what rewrite time, but time of full seqscan. I hope my patch[1] will be merged and not null can be set in future by temporary adding check constraint (not valid, then validate)- which not require large table lock Hope your commit will be merged. It will be realy useful. > > [1] https://www.postgresql.org/message-id/flat/81911511895540@web58j.yandex.ru#81911511895540@web58j.yandex.ru > > Regards, Sergei -- Timokhin 'maf' Maxim