Thread: a heavy duty operation on an "unused" table kills my server
Hi list, I'm having a problem when dealing with operations that asks too much CPU from the server.
The scenario is this:
I have a multithreaded server, each thread with its own connection to the database. Everything is working fine, actually great, actually outstandingly, in normal operation.
I've a table named "a" with 1.8 million records, and growing, but I'm ok with it, at least for the moment. Maybe in the near future we will cut it down, backup old data, and free it up. But this is not the issue, as I said, everything is working great. I have a cpl of indexes to help some queries, and that's it.
Now my problem started when I tried to do some model refactoring on this production table.
First I tried a dumb approach.
I connected from pgadmin, opened a new session.
I tried an ALTER TABLE on this table just to turn a char(255) field into char(250), and it locked up my system.
No surprise, since I had many threads waiting for this alter table to finish. What I did not foresee was that this alter table would take up so much time. Ok, my fault, for not having calculated the time that it would take the ALTER TABLE to complete.
Now, with this experience, I tried a simple workaround.
Created an empty version of "a" named "a_empty", identical in every sense.
renamed "a" to "a_full", and "a_empty" to "a". This procedure costed me like 0 seconds of downtime, and everything kept working smoothly. Maybe a cpl of operations could have failed if they tried to write in the very second that there was actually no table named "a", but since the operation was transactional, the worst scenario was that if the operation should have failed, the client application would just inform of the error and ask the user for a retry. No big deal.
Now, this table, that is totally unattached to the system in every way (no one references this table, its like a dumpster for old records), is not begin accessed by no other thread in the system, so an ALTER table on it, to turn a char(255) to char(250), should have no effect on the system.
So, with this in mind, I tried the ALTER TABLE this time on the "a_full" (totally unrelated) table.
The system went non-responsive again, and this time it had nothing to do with threads waiting for the alter table to complete. The pgAdmin GUI went non-responsive, as well as the application's server GUI, whose threads kept working on the background, but starting to take more and more time for every clients request (up to 25 seconds, which are just ridiculous and completely unacceptable in normal conditions).
This resulted in my client applications to start disconnecting after their operations failed due to timeout, and the system basically went down again, from a users point of view.
This time, since I saw no relation between my operation on a totally unrelated table, and the server BIG slowdown, I blamed the servers memory.
After some tests, I came up to the conclusion that any heavy duty operation on any thread (ALTER TABLE on 1.8 million records tables, updates on this table, or an infinite loop, just to make my point), would affect the whole server.
Bottom line is, I can't seem to do any heavy processing on the database (or any operation that would require the server to enter into high CPU usage), and still expect the server to behave normally. Whatever heavy duty operation, DDL, DML, on whatever table (related, or unrelated), on whatever thread, would tear down my servers integrity.
My question then is: is there a way to limit the CPU assigned to a specific connection?
I mean, I don't care if my ALTER TABLE takes 4 days instead of 4 hours.
Something like:
pg_set_max_cpu_usage(2/100);
and rest assured that no matter what that thread is asking the database to do, it just wont affect the other running threads. Obviosly, assuring that the process itself does not involve any locking of the other threads.
Is something like that possible?
Thanks in advance,
Eduardo.
The scenario is this:
I have a multithreaded server, each thread with its own connection to the database. Everything is working fine, actually great, actually outstandingly, in normal operation.
I've a table named "a" with 1.8 million records, and growing, but I'm ok with it, at least for the moment. Maybe in the near future we will cut it down, backup old data, and free it up. But this is not the issue, as I said, everything is working great. I have a cpl of indexes to help some queries, and that's it.
Now my problem started when I tried to do some model refactoring on this production table.
First I tried a dumb approach.
I connected from pgadmin, opened a new session.
I tried an ALTER TABLE on this table just to turn a char(255) field into char(250), and it locked up my system.
No surprise, since I had many threads waiting for this alter table to finish. What I did not foresee was that this alter table would take up so much time. Ok, my fault, for not having calculated the time that it would take the ALTER TABLE to complete.
Now, with this experience, I tried a simple workaround.
Created an empty version of "a" named "a_empty", identical in every sense.
renamed "a" to "a_full", and "a_empty" to "a". This procedure costed me like 0 seconds of downtime, and everything kept working smoothly. Maybe a cpl of operations could have failed if they tried to write in the very second that there was actually no table named "a", but since the operation was transactional, the worst scenario was that if the operation should have failed, the client application would just inform of the error and ask the user for a retry. No big deal.
Now, this table, that is totally unattached to the system in every way (no one references this table, its like a dumpster for old records), is not begin accessed by no other thread in the system, so an ALTER table on it, to turn a char(255) to char(250), should have no effect on the system.
So, with this in mind, I tried the ALTER TABLE this time on the "a_full" (totally unrelated) table.
The system went non-responsive again, and this time it had nothing to do with threads waiting for the alter table to complete. The pgAdmin GUI went non-responsive, as well as the application's server GUI, whose threads kept working on the background, but starting to take more and more time for every clients request (up to 25 seconds, which are just ridiculous and completely unacceptable in normal conditions).
This resulted in my client applications to start disconnecting after their operations failed due to timeout, and the system basically went down again, from a users point of view.
This time, since I saw no relation between my operation on a totally unrelated table, and the server BIG slowdown, I blamed the servers memory.
After some tests, I came up to the conclusion that any heavy duty operation on any thread (ALTER TABLE on 1.8 million records tables, updates on this table, or an infinite loop, just to make my point), would affect the whole server.
Bottom line is, I can't seem to do any heavy processing on the database (or any operation that would require the server to enter into high CPU usage), and still expect the server to behave normally. Whatever heavy duty operation, DDL, DML, on whatever table (related, or unrelated), on whatever thread, would tear down my servers integrity.
My question then is: is there a way to limit the CPU assigned to a specific connection?
I mean, I don't care if my ALTER TABLE takes 4 days instead of 4 hours.
Something like:
pg_set_max_cpu_usage(2/100);
and rest assured that no matter what that thread is asking the database to do, it just wont affect the other running threads. Obviosly, assuring that the process itself does not involve any locking of the other threads.
Is something like that possible?
Thanks in advance,
Eduardo.
Eduardo Piombino wrote: > Hi list, I'm having a problem when dealing with operations that asks too > much CPU from the server. > The scenario is this: A nice description below, but ... you give no information about your system: number of CPUs, disk types and configuration,how much memory, what have you changed in your Postgres configuration? And what operating system, what versionof Postgres, etc., etc. The more information you give, the better the answer. If you're operating on a single disk with a tiny amount of memory, and old, misconfigured Postgres on a laptop computer,that's a whole different problem than if you're on a big sytem with 16 CPUs and a huge RAID 1+0 with battery-backedcache. Craig > > I have a multithreaded server, each thread with its own connection to > the database. Everything is working fine, actually great, actually > outstandingly, in normal operation. > > I've a table named "a" with 1.8 million records, and growing, but I'm ok > with it, at least for the moment. Maybe in the near future we will cut > it down, backup old data, and free it up. But this is not the issue, as > I said, everything is working great. I have a cpl of indexes to help > some queries, and that's it. > > Now my problem started when I tried to do some model refactoring on this > production table. > > First I tried a dumb approach. > I connected from pgadmin, opened a new session. > I tried an ALTER TABLE on this table just to turn a char(255) field into > char(250), and it locked up my system. > > No surprise, since I had many threads waiting for this alter table to > finish. What I did not foresee was that this alter table would take up > so much time. Ok, my fault, for not having calculated the time that it > would take the ALTER TABLE to complete. > > Now, with this experience, I tried a simple workaround. > Created an empty version of "a" named "a_empty", identical in every sense. > renamed "a" to "a_full", and "a_empty" to "a". This procedure costed me > like 0 seconds of downtime, and everything kept working smoothly. Maybe > a cpl of operations could have failed if they tried to write in the very > second that there was actually no table named "a", but since the > operation was transactional, the worst scenario was that if the > operation should have failed, the client application would just inform > of the error and ask the user for a retry. No big deal. > > Now, this table, that is totally unattached to the system in every way > (no one references this table, its like a dumpster for old records), is > not begin accessed by no other thread in the system, so an ALTER table > on it, to turn a char(255) to char(250), should have no effect on the > system. > > So, with this in mind, I tried the ALTER TABLE this time on the "a_full" > (totally unrelated) table. > The system went non-responsive again, and this time it had nothing to do > with threads waiting for the alter table to complete. The pgAdmin GUI > went non-responsive, as well as the application's server GUI, whose > threads kept working on the background, but starting to take more and > more time for every clients request (up to 25 seconds, which are just > ridiculous and completely unacceptable in normal conditions). > > This resulted in my client applications to start disconnecting after > their operations failed due to timeout, and the system basically went > down again, from a users point of view. > > This time, since I saw no relation between my operation on a totally > unrelated table, and the server BIG slowdown, I blamed the servers memory. > > After some tests, I came up to the conclusion that any heavy duty > operation on any thread (ALTER TABLE on 1.8 million records tables, > updates on this table, or an infinite loop, just to make my point), > would affect the whole server. > > Bottom line is, I can't seem to do any heavy processing on the database > (or any operation that would require the server to enter into high CPU > usage), and still expect the server to behave normally. Whatever heavy > duty operation, DDL, DML, on whatever table (related, or unrelated), on > whatever thread, would tear down my servers integrity. > > My question then is: is there a way to limit the CPU assigned to a > specific connection? > I mean, I don't care if my ALTER TABLE takes 4 days instead of 4 hours. > > Something like: > pg_set_max_cpu_usage(2/100); > > and rest assured that no matter what that thread is asking the database > to do, it just wont affect the other running threads. Obviosly, assuring > that the process itself does not involve any locking of the other threads. > > Is something like that possible? > > Thanks in advance, > Eduardo. >
On 13/01/2010 12:59 PM, Eduardo Piombino wrote: > My question then is: is there a way to limit the CPU assigned to a > specific connection? > I mean, I don't care if my ALTER TABLE takes 4 days instead of 4 hours. > > Something like: > pg_set_max_cpu_usage(2/100); You're assuming the issue is CPU. I think that unlikely. In general, a single thread/process that wants as much CPU as it can get won't bring any machine with a half-decent OS to its knees. Any UNIX system should barely notice - everything else will slow down somewhat, depending on its scheduler, but in any sane setup shouldn't slow down by more than 1/2. Modern Windows tends to be fairly well behaved here too. What's much more likely is that you're working with a crappy disk setup - such as a RAID 5 array without battery-backed cache, or a single slow disk. You probably also have quite deep write queuing in the RAID controller / disk / OS. This means that your disk-intensive ALTER TABLE makes your disk subsystem so busy that it takes ages before any other process gets a look-in. It's not unlikely that I/O requests are being queued so deeply that it (often) takes several seconds for the controller to get around to executing a newly submitted read or write request. If your other queries need to do more than a few steps where they read some data, think about it, and read other data depending on the first read, then they're going to take forever, because they're going to have to ensure a long delay before disk access each time. Of course, that's just a guess, since you've provided no information on your hardware. Try collecting up some of the information shown here: http://wiki.postgresql.org/wiki/Guide_to_reporting_problems#Things_you_need_to_mention In any case, if it *is* I/O related, what to do about it depends on exactly what sort of I/O issue it is. Extremely deep queuing? Looks good for throughput benchmarks, but is stupid if you care about latency and have some I/O that's higher priority than others, so reduce your queue depth. Very slow writes hammering reads? Don't use RAID 5. Etc. -- Craig Ringer
I'm sorry.
The server is a production server HP Proliant, I don't remember the exact model, but the key features were:
4 cores, over 2GHz each (I'm sorry I don't remember the actual specs), I think it had 16G of RAM (if that is possible?)
It has two 320G disks in RAID (mirrored).
I don't even have the emails with the specs here, but I can give you the exact configuration by tomorrow.
Operating system: Windows 2003 server, with latest patches.
Postgres version: 8.2.4, with all defaults, except DateStyle and TimeZone.
At any given time, the server is on 0% CPU load, with peaks of 1%, 2%, max. In normal operation.
I've been digging a little in the archives, and one thing that it helped me come up with, is that I don't really remember seeing high CPU usage (fact that surprised me, but i do remember seeing high IO activity). I'm sorry, its pretty late here.
I know this single statement is enough to almost change everything I've just asked.
Please try interpreting again my original mail, considering that when I said "high CPU usage" It might very well be "high IO usage".
The final effect was that the server went non-responsive, for all matters, not even the TaskManager would come up when i hit CTRL-ALT-DEL, and of course, every client would suffer horrific (+20 secs) for the simplest operations like SELECT NOW();
I've just made a little modification to my original questions, to extend to the possibility of a IO usage issue, instead of just CPU.
The server is a production server HP Proliant, I don't remember the exact model, but the key features were:
4 cores, over 2GHz each (I'm sorry I don't remember the actual specs), I think it had 16G of RAM (if that is possible?)
It has two 320G disks in RAID (mirrored).
I don't even have the emails with the specs here, but I can give you the exact configuration by tomorrow.
Operating system: Windows 2003 server, with latest patches.
Postgres version: 8.2.4, with all defaults, except DateStyle and TimeZone.
At any given time, the server is on 0% CPU load, with peaks of 1%, 2%, max. In normal operation.
I've been digging a little in the archives, and one thing that it helped me come up with, is that I don't really remember seeing high CPU usage (fact that surprised me, but i do remember seeing high IO activity). I'm sorry, its pretty late here.
I know this single statement is enough to almost change everything I've just asked.
Please try interpreting again my original mail, considering that when I said "high CPU usage" It might very well be "high IO usage".
The final effect was that the server went non-responsive, for all matters, not even the TaskManager would come up when i hit CTRL-ALT-DEL, and of course, every client would suffer horrific (+20 secs) for the simplest operations like SELECT NOW();
I've just made a little modification to my original questions, to extend to the possibility of a IO usage issue, instead of just CPU.
Bottom line is, I can't seem to do any heavy processing on the database (or any operation that would require the server to enter into high CPU usage or IO USAGE), and still expect the server to behave normally. Whatever heavy duty operation, DDL, DML, on whatever table (related, or unrelated), on whatever thread, would tear down my servers integrity.
My question then is: is there a way to limit the CPU or IO USAGE assigned to a specific connection?
I mean, I don't care if my ALTER TABLE takes 4 days instead of 4 hours.
Something like:
pg_set_max_cpu _or_io_usage(2/100);
On Wed, Jan 13, 2010 at 2:14 AM, Craig James <craig_james@emolecules.com> wrote:
Eduardo Piombino wrote:A nice description below, but ... you give no information about your system: number of CPUs, disk types and configuration, how much memory, what have you changed in your Postgres configuration? And what operating system, what version of Postgres, etc., etc. The more information you give, the better the answer.Hi list, I'm having a problem when dealing with operations that asks too much CPU from the server.
The scenario is this:
If you're operating on a single disk with a tiny amount of memory, and old, misconfigured Postgres on a laptop computer, that's a whole different problem than if you're on a big sytem with 16 CPUs and a huge RAID 1+0 with battery-backed cache.
Craig
I have a multithreaded server, each thread with its own connection to the database. Everything is working fine, actually great, actually outstandingly, in normal operation.
I've a table named "a" with 1.8 million records, and growing, but I'm ok with it, at least for the moment. Maybe in the near future we will cut it down, backup old data, and free it up. But this is not the issue, as I said, everything is working great. I have a cpl of indexes to help some queries, and that's it.
Now my problem started when I tried to do some model refactoring on this production table.
First I tried a dumb approach.
I connected from pgadmin, opened a new session.
I tried an ALTER TABLE on this table just to turn a char(255) field into char(250), and it locked up my system.
No surprise, since I had many threads waiting for this alter table to finish. What I did not foresee was that this alter table would take up so much time. Ok, my fault, for not having calculated the time that it would take the ALTER TABLE to complete.
Now, with this experience, I tried a simple workaround.
Created an empty version of "a" named "a_empty", identical in every sense.
renamed "a" to "a_full", and "a_empty" to "a". This procedure costed me like 0 seconds of downtime, and everything kept working smoothly. Maybe a cpl of operations could have failed if they tried to write in the very second that there was actually no table named "a", but since the operation was transactional, the worst scenario was that if the operation should have failed, the client application would just inform of the error and ask the user for a retry. No big deal.
Now, this table, that is totally unattached to the system in every way (no one references this table, its like a dumpster for old records), is not begin accessed by no other thread in the system, so an ALTER table on it, to turn a char(255) to char(250), should have no effect on the system.
So, with this in mind, I tried the ALTER TABLE this time on the "a_full" (totally unrelated) table.
The system went non-responsive again, and this time it had nothing to do with threads waiting for the alter table to complete. The pgAdmin GUI went non-responsive, as well as the application's server GUI, whose threads kept working on the background, but starting to take more and more time for every clients request (up to 25 seconds, which are just ridiculous and completely unacceptable in normal conditions).
This resulted in my client applications to start disconnecting after their operations failed due to timeout, and the system basically went down again, from a users point of view.
This time, since I saw no relation between my operation on a totally unrelated table, and the server BIG slowdown, I blamed the servers memory.
After some tests, I came up to the conclusion that any heavy duty operation on any thread (ALTER TABLE on 1.8 million records tables, updates on this table, or an infinite loop, just to make my point), would affect the whole server.
Bottom line is, I can't seem to do any heavy processing on the database (or any operation that would require the server to enter into high CPU usage or IO USAGE), and still expect the server to behave normally. Whatever heavy duty operation, DDL, DML, on whatever table (related, or unrelated), on whatever thread, would tear down my servers integrity.
My question then is: is there a way to limit the CPU or IO USAGE assigned to a specific connection?
I mean, I don't care if my ALTER TABLE takes 4 days instead of 4 hours.
Something like:
pg_set_max_cpu_or_io_usage(2/100);
and rest assured that no matter what that thread is asking the database to do, it just wont affect the other running threads. Obviosly, assuring that the process itself does not involve any locking of the other threads.
Is something like that possible?
Thanks in advance,
Eduardo.
On 13/01/2010 1:47 PM, Eduardo Piombino wrote: > I'm sorry. > > The server is a production server HP Proliant, I don't remember the > exact model, but the key features were: > 4 cores, over 2GHz each (I'm sorry I don't remember the actual specs), I > think it had 16G of RAM (if that is possible?) > It has two 320G disks in RAID (mirrored). Plain 'ol SATA disks in RAID-1? Hardware RAID (and if so, controller model)? With battery backup? Write cache on or off? Or software RAID? If so, Windows build-in sw raid, or some vendor's fakeraid (Highpoint, Promise, Adaptec, etc) ? Anyway, with two disks in RAID-1 I'm not surprised you're seeing some performance issues with heavy writes, especially since it seems unlikely that you have a BBU hardware RAID controller. In RAID-1 a write must hit both disks, so a 1Mb write effectively costs twice as much as a 1Mb read. Since many controllers try for high throughput (because it looks good in benchmarks) at the expense of latency they also tend to try to batch writes into long blocks, which keeps the disks busy in extended bursts. That slaughters read latencies. I had this sort of issue with a 3Ware 8500-8, and landed up modifying and recompiling the driver to reduce its built-in queue depth. I also increased readahead. It was still pretty awful as I was working with RAID 5 on SATA disks, but it made a big difference and more importantly meant that my Linux server was able to honour `ionice' priorities and feed more important requests to the controller first. On windows, I really don't know what to do about it beyond getting a better I/O subsystem. Google may help - look into I/O priorities, queue depths, reducing read latencies, etc. > I don't even have the emails with the specs here, but I can give you the > exact configuration by tomorrow. > > Operating system: Windows 2003 server, with latest patches. > Postgres version: 8.2.4, with all defaults, except DateStyle and TimeZone. Urk. 8.2 ? Pg on Windows improves a lot with each release, and that's an old buggy version of 8.2 at that. Looking into an upgrade would be a really, REALLY good idea. > Please try interpreting again my original mail, considering that when I > said "high CPU usage" It might very well be "high IO usage". > > The final effect was that the server went non-responsive, for all > matters, not even the TaskManager would come up when i hit CTRL-ALT-DEL, > and of course, every client would suffer horrific (+20 secs) for the > simplest operations like SELECT NOW(); That sounds a LOT like horrible read latencies caused by total I/O overload. It could also be running out of memory and swapping heavily, so do keep an eye out for that, but I wouldn't expect to see that with an ALTER TABLE - especially on a 16GB server. > / My question then is: is there a way to limit the CPU* or **IO > USAGE* assigned to a specific connection?/ In win32 you can set CPU priorities manually in Task Manager, but only once you already know the process ID of the Pg backend that's going to be hammering the machine. Not helpful. I don't know of any way to do per-process I/O priorities in Win32, but I only use win32 reluctantly and don't use it for anything I care about (like a production Pg server) so I'm far from a definitive source. -- Craig Ringer
Excellent, lots of useful information in your message.
I will follow your advices, and keep you posted on any progress. I have yet to confirm you with some technical details of my setup, but I'm pretty sure you hit the nail in any case.
One last question, this IO issue I'm facing, do you think it is just a matter of RAID configuration speed, or a matter of queue gluttony (and not leaving time for other processes to get into the IO queue in a reasonable time)?
Because if it was just a matter of speed, ok, with my actual RAID configuration lets say it takes 10 minutes to process the ALTER TABLE (leaving no space to other IOs until the ALTER TABLE is done), lets say then i put the fastest possible RAID setup, or even remove RAID for the sake of speed, and it completes in lets say again, 10 seconds (an unreal assumption). But if my table now grows 60 times, I would be facing the very same problem again, even with the best RAID configuration.
The problem would seem to be in the way the OS (or hardware, or someone else, or all of them) is/are inserting the IO requests into the queue.
What can I do to control the order in which these IO requests are finally entered into the queue?
I mean .. what i would like to obtain is:
Considering the ALTER TABLE as a sequence of 100.000 READ/WRITE OPERATIONS
Considering the SELECT * FROM xxx as a sequence of 100 READ OPERATIONS (totally unrelated in disk)
First i run the ALTER TABLE on a thread...
Lets say by the time it generates 1.000 READ/WRITE OPERATIONS, the other thread starts with the SELECT * FROM xxx ...
I would expect the IO system to give chance to the those 100 READ OPERATIONS to execute immediately (with no need to wait for the remaining 990.000 READ/WRITE OPERATIONS finish), that is, to enter the queue at *almost* the very same moment the IO request were issued.
If I can not guarantee that, I'm kinda doomed, because the largest the amount of IO operations requested by a "heavy duty operation", the longest it will take any other thread to start doing anything.
What cards do I have to manipulate the order the IO requests are entered into the "queue"?
Can I disable this queue?
Should I turn disk's IO operation caches off?
Not use some specific disk/RAID vendor, for instance?
I think I have some serious reading to do on this matter, google will help of course, but as always, every advice for small it may seem, will be very much appreciated.
Nonetheless, thanks a lot for all the light you already brought me on this matter.
I really appreciate it.
Eduardo.
I will follow your advices, and keep you posted on any progress. I have yet to confirm you with some technical details of my setup, but I'm pretty sure you hit the nail in any case.
One last question, this IO issue I'm facing, do you think it is just a matter of RAID configuration speed, or a matter of queue gluttony (and not leaving time for other processes to get into the IO queue in a reasonable time)?
Because if it was just a matter of speed, ok, with my actual RAID configuration lets say it takes 10 minutes to process the ALTER TABLE (leaving no space to other IOs until the ALTER TABLE is done), lets say then i put the fastest possible RAID setup, or even remove RAID for the sake of speed, and it completes in lets say again, 10 seconds (an unreal assumption). But if my table now grows 60 times, I would be facing the very same problem again, even with the best RAID configuration.
The problem would seem to be in the way the OS (or hardware, or someone else, or all of them) is/are inserting the IO requests into the queue.
What can I do to control the order in which these IO requests are finally entered into the queue?
I mean .. what i would like to obtain is:
Considering the ALTER TABLE as a sequence of 100.000 READ/WRITE OPERATIONS
Considering the SELECT * FROM xxx as a sequence of 100 READ OPERATIONS (totally unrelated in disk)
First i run the ALTER TABLE on a thread...
Lets say by the time it generates 1.000 READ/WRITE OPERATIONS, the other thread starts with the SELECT * FROM xxx ...
I would expect the IO system to give chance to the those 100 READ OPERATIONS to execute immediately (with no need to wait for the remaining 990.000 READ/WRITE OPERATIONS finish), that is, to enter the queue at *almost* the very same moment the IO request were issued.
If I can not guarantee that, I'm kinda doomed, because the largest the amount of IO operations requested by a "heavy duty operation", the longest it will take any other thread to start doing anything.
What cards do I have to manipulate the order the IO requests are entered into the "queue"?
Can I disable this queue?
Should I turn disk's IO operation caches off?
Not use some specific disk/RAID vendor, for instance?
I think I have some serious reading to do on this matter, google will help of course, but as always, every advice for small it may seem, will be very much appreciated.
Nonetheless, thanks a lot for all the light you already brought me on this matter.
I really appreciate it.
Eduardo.
On Wed, Jan 13, 2010 at 3:02 AM, Craig Ringer <craig@postnewspapers.com.au> wrote:
On 13/01/2010 1:47 PM, Eduardo Piombino wrote:Plain 'ol SATA disks in RAID-1?I'm sorry.
The server is a production server HP Proliant, I don't remember the
exact model, but the key features were:
4 cores, over 2GHz each (I'm sorry I don't remember the actual specs), I
think it had 16G of RAM (if that is possible?)
It has two 320G disks in RAID (mirrored).
Hardware RAID (and if so, controller model)? With battery backup? Write cache on or off?
Or software RAID? If so, Windows build-in sw raid, or some vendor's fakeraid (Highpoint, Promise, Adaptec, etc) ?
Anyway, with two disks in RAID-1 I'm not surprised you're seeing some performance issues with heavy writes, especially since it seems unlikely that you have a BBU hardware RAID controller. In RAID-1 a write must hit both disks, so a 1Mb write effectively costs twice as much as a 1Mb read. Since many controllers try for high throughput (because it looks good in benchmarks) at the expense of latency they also tend to try to batch writes into long blocks, which keeps the disks busy in extended bursts. That slaughters read latencies.
I had this sort of issue with a 3Ware 8500-8, and landed up modifying and recompiling the driver to reduce its built-in queue depth. I also increased readahead. It was still pretty awful as I was working with RAID 5 on SATA disks, but it made a big difference and more importantly meant that my Linux server was able to honour `ionice' priorities and feed more important requests to the controller first.
On windows, I really don't know what to do about it beyond getting a better I/O subsystem. Google may help - look into I/O priorities, queue depths, reducing read latencies, etc.Urk. 8.2 ?I don't even have the emails with the specs here, but I can give you the
exact configuration by tomorrow.
Operating system: Windows 2003 server, with latest patches.
Postgres version: 8.2.4, with all defaults, except DateStyle and TimeZone.
Pg on Windows improves a lot with each release, and that's an old buggy version of 8.2 at that. Looking into an upgrade would be a really, REALLY good idea.That sounds a LOT like horrible read latencies caused by total I/O overload. It could also be running out of memory and swapping heavily, so do keep an eye out for that, but I wouldn't expect to see that with an ALTER TABLE - especially on a 16GB server.Please try interpreting again my original mail, considering that when I
said "high CPU usage" It might very well be "high IO usage".
The final effect was that the server went non-responsive, for all
matters, not even the TaskManager would come up when i hit CTRL-ALT-DEL,
and of course, every client would suffer horrific (+20 secs) for the
simplest operations like SELECT NOW();/ My question then is: is there a way to limit the CPU* or **IO
USAGE* assigned to a specific connection?/
In win32 you can set CPU priorities manually in Task Manager, but only once you already know the process ID of the Pg backend that's going to be hammering the machine. Not helpful.
I don't know of any way to do per-process I/O priorities in Win32, but I only use win32 reluctantly and don't use it for anything I care about (like a production Pg server) so I'm far from a definitive source.
--
Craig Ringer
Eduardo Piombino wrote: > Postgres version: 8.2.4, with all defaults, except DateStyle and TimeZone. Ugh...there are several features in PostgreSQL 8.3 and later specifically to address the sort of issue you're running into. If you want to get good write performance out of this system, you may need to upgrade to at least that version. It's impossible to resolve several of the common problems in write operations being too intense using any 8.2 version. > The final effect was that the server went non-responsive, for all > matters, not even the TaskManager would come up when i hit > CTRL-ALT-DEL, and of course, every client would suffer horrific (+20 > secs) for the simplest operations like SELECT NOW(); The thing that you have to realize is that altering a table is basically making a new copy of that table, which is a really heavy amount of writing. It's quite easy for an I/O heavy operation like that to fill up a lot of RAM with data to be written out, and when the database periodically needs to force all that data out to disk the whole system grinds to a halt when it happens. There's no way I'm aware of to throttle that writing down to a reasonable amount under Windows either, to achieve your goal of just making the ALTER run using less resources. Some reading: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server goes over basic tuning of the database server. If you haven't already increased the checkpoint_segments parameters of your system, that's the first thing to try--increase it *a lot* (32 or more, default is 3) because it can really help with this problem. A moderate increase to shared_buffers is in order too; since you're on Windows, increasing it to 256MB is a reasonable change. The rest of the changes in there aren't likely to help out with this specific problem. http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm : covers the most likely cause of the issue you're running into. Unfortunately, most of the solutions you'll see there are things changed in 8.3. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Yes, one of the things I will do asap is to migrate to the latest version.
On other occasion I went through the checkpoint parameters you mentioned, but left them untouched since they seemed logical.
I'm a little reluctant of changing the checkpoint configuration just to let me do a -once in a lifetime- ALTER.
The checkpoints would then remain too far away in time (or in traffic).
And thinking of touching it and retouching it every time I need to do sthing different bugs me a little. But if there is no other option I will definitely give it a try.
Are you sure, for instance, that the ALTER command (and the internal data it may require to handle, lets say 1.8 million records * 1024 bytes/record (aprox)) goes to RAM, then to disk, and gets logged in the WAL during the whole process? Maybe it does not get logged at all until the ALTER is completed? Since the original table can be left untouched until this copy of the table gets updated ... Just guessing here.
On other occasion I went through the checkpoint parameters you mentioned, but left them untouched since they seemed logical.
I'm a little reluctant of changing the checkpoint configuration just to let me do a -once in a lifetime- ALTER.
The checkpoints would then remain too far away in time (or in traffic).
And thinking of touching it and retouching it every time I need to do sthing different bugs me a little. But if there is no other option I will definitely give it a try.
Are you sure, for instance, that the ALTER command (and the internal data it may require to handle, lets say 1.8 million records * 1024 bytes/record (aprox)) goes to RAM, then to disk, and gets logged in the WAL during the whole process? Maybe it does not get logged at all until the ALTER is completed? Since the original table can be left untouched until this copy of the table gets updated ... Just guessing here.
On Wed, Jan 13, 2010 at 4:39 AM, Greg Smith <greg@2ndquadrant.com> wrote:
Eduardo Piombino wrote:Ugh...there are several features in PostgreSQL 8.3 and later specifically to address the sort of issue you're running into. If you want to get good write performance out of this system, you may need to upgrade to at least that version. It's impossible to resolve several of the common problems in write operations being too intense using any 8.2 version.Postgres version: 8.2.4, with all defaults, except DateStyle and TimeZone.The thing that you have to realize is that altering a table is basically making a new copy of that table, which is a really heavy amount of writing. It's quite easy for an I/O heavy operation like that to fill up a lot of RAM with data to be written out, and when the database periodically needs to force all that data out to disk the whole system grinds to a halt when it happens. There's no way I'm aware of to throttle that writing down to a reasonable amount under Windows either, to achieve your goal of just making the ALTER run using less resources.The final effect was that the server went non-responsive, for all matters, not even the TaskManager would come up when i hit CTRL-ALT-DEL, and of course, every client would suffer horrific (+20 secs) for the simplest operations like SELECT NOW();
Some reading:
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server goes over basic tuning of the database server. If you haven't already increased the checkpoint_segments parameters of your system, that's the first thing to try--increase it *a lot* (32 or more, default is 3) because it can really help with this problem. A moderate increase to shared_buffers is in order too; since you're on Windows, increasing it to 256MB is a reasonable change. The rest of the changes in there aren't likely to help out with this specific problem.
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm : covers the most likely cause of the issue you're running into. Unfortunately, most of the solutions you'll see there are things changed in 8.3.
--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com
On 13/01/2010 3:03 PM, Eduardo Piombino wrote: > One last question, this IO issue I'm facing, do you think it is just a > matter of RAID configuration speed, or a matter of queue gluttony (and > not leaving time for other processes to get into the IO queue in a > reasonable time)? Hard to say with the data provided. It's not *just* a matter of a slow array, but that might contribute. Specifically, though, by "slow array" in this case I'm looking at latency rather than throughput, particularly read latency under heavy write load. Simple write throughput isn't really the issue, though bad write throughput can make it fall apart under a lighter load than it would otherwise. High read latencies may not be caused by deep queuing, though that's one possible cause. A controller that prioritizes batching sequential writes efficiently over serving random reads would cause it too - though reducing its queue depth so it can't see as many writes to batch would help. Let me stress, again, that if you have a decent RAID controller with a battery backed cache unit you can enable write caching and most of these issues just go away. Using an array format with better read/write concurrency, like RAID 10, may help as well. Honestly, though, at this point you need to collect data on what the system is actually doing, what's slowing it down and where. *then* look into how to address it. I can't advise you much on that as you're using Windows, but there must be lots of info on optimising windows I/O latencies and throughput on the 'net... > Because if it was just a matter of speed, ok, with my actual RAID > configuration lets say it takes 10 minutes to process the ALTER TABLE > (leaving no space to other IOs until the ALTER TABLE is done), lets say > then i put the fastest possible RAID setup, or even remove RAID for the > sake of speed, and it completes in lets say again, 10 seconds (an unreal > assumption). But if my table now grows 60 times, I would be facing the > very same problem again, even with the best RAID configuration. Only if the issue is one of pure write throughput. I don't think it is. You don't care how long the ALTER takes, only how much it impacts other users. Reducing the impact on other users so your ALTER can complete in its own time without stamping all over other work is the idea. > The problem would seem to be in the way the OS (or hardware, or someone > else, or all of them) is/are inserting the IO requests into the queue. It *might* be. There's just not enough information to tell that yet. You'll need to do quite a bit more monitoring. I don't have the expertise to advise you on what to do and how to do it under Windows. > What can I do to control the order in which these IO requests are > finally entered into the queue? No idea. You probably need to look into I/O priorities on Windows. Ideally you shouldn't have to, though. If you can keep read latencies at sane levels under high write load on your array, you don't *need* to mess with this. Note that I'm still guessing about the issue being high read latencies under write load. It fits what you describe, but there isn't enough data to be sure, and I don't know how to collect it on Windows. > What cards do I have to manipulate the order the IO requests are entered > into the "queue"? > Can I disable this queue? > Should I turn disk's IO operation caches off? > Not use some specific disk/RAID vendor, for instance? Don't know. Contact your RAID card tech support, Google, search MSDN, etc. -- Craig Ringer
Re: a heavy duty operation on an "unused" table kills my server
From
Euler Taveira de Oliveira
Date:
Eduardo Piombino escreveu: > Maybe it does not get logged at all until the ALTER is completed? > This feature [1] was implemented a few months ago and it will be available only in the next PostgreSQL version (8.5). [1] http://archives.postgresql.org/pgsql-committers/2009-11/msg00018.php -- Euler Taveira de Oliveira http://www.timbira.com/
With that said, I assume my current version of pgsql DOES make all this heavy work go through WAL logging.
Curious thing is that I remember (of course) reviewing logs of the crash times, and I didn't see anything strange, not even the famous warning "you are making checkpoints too often. maybe you should consider using extending the checkpoint_segments parameter".
I will check it again.
Besides, I will try to gather as much information on the system itself (RAID controllers, disk vendors, etc).
Thank you, will keep you posted.
Curious thing is that I remember (of course) reviewing logs of the crash times, and I didn't see anything strange, not even the famous warning "you are making checkpoints too often. maybe you should consider using extending the checkpoint_segments parameter".
I will check it again.
Besides, I will try to gather as much information on the system itself (RAID controllers, disk vendors, etc).
Thank you, will keep you posted.
On Wed, Jan 13, 2010 at 11:06 AM, Euler Taveira de Oliveira <euler@timbira.com> wrote:
Eduardo Piombino escreveu:> Maybe it does not get logged at all until the ALTER is completed?This feature [1] was implemented a few months ago and it will be available
>
only in the next PostgreSQL version (8.5).
[1] http://archives.postgresql.org/pgsql-committers/2009-11/msg00018.php
--
Euler Taveira de Oliveira
http://www.timbira.com/
On Wed, Jan 13, 2010 at 2:03 AM, Eduardo Piombino <drakorg@gmail.com> wrote: > Excellent, lots of useful information in your message. > I will follow your advices, and keep you posted on any progress. I have yet > to confirm you with some technical details of my setup, but I'm pretty sure > you hit the nail in any case. > > One last question, this IO issue I'm facing, do you think it is just a > matter of RAID configuration speed, or a matter of queue gluttony (and not > leaving time for other processes to get into the IO queue in a reasonable > time)? > > Because if it was just a matter of speed, ok, with my actual RAID > configuration lets say it takes 10 minutes to process the ALTER TABLE > (leaving no space to other IOs until the ALTER TABLE is done), lets say then > i put the fastest possible RAID setup, or even remove RAID for the sake of > speed, and it completes in lets say again, 10 seconds (an unreal > assumption). But if my table now grows 60 times, I would be facing the very > same problem again, even with the best RAID configuration. > > The problem would seem to be in the way the OS (or hardware, or someone > else, or all of them) is/are inserting the IO requests into the queue. > What can I do to control the order in which these IO requests are finally > entered into the queue? > I mean .. what i would like to obtain is: > > Considering the ALTER TABLE as a sequence of 100.000 READ/WRITE OPERATIONS > Considering the SELECT * FROM xxx as a sequence of 100 READ OPERATIONS > (totally unrelated in disk) > > First i run the ALTER TABLE on a thread... > Lets say by the time it generates 1.000 READ/WRITE OPERATIONS, the other > thread starts with the SELECT * FROM xxx ... > I would expect the IO system to give chance to the those 100 READ OPERATIONS > to execute immediately (with no need to wait for the remaining 990.000 > READ/WRITE OPERATIONS finish), that is, to enter the queue at *almost* the > very same moment the IO request were issued. > > If I can not guarantee that, I'm kinda doomed, because the largest the > amount of IO operations requested by a "heavy duty operation", the longest > it will take any other thread to start doing anything. One thing you can do - although it's a darn lot of work compared to just running a DDL command - is create a new empty table with the schema you want and then write a script that copies, say, 1000 records from the old table to the new table. If your table has a primary key with a natural sort ordering, it's not too hard to keep track of where you left off the last time and continue on from there. Then you can incrementally get all of your data over without swamping the system. I realize that's a pain in the neck, of course. I'm kind of surprised that there are disk I/O subsystems that are so bad that a single thread doing non-stop I/O can take down the whole server. Is that normal? Does it happen on non-Windows operating systems? What kind of hardware should I not buy to make sure this doesn't happen to me? ...Robert
On Tue, Jan 12, 2010 at 9:59 PM, Eduardo Piombino <drakorg@gmail.com> wrote: ... > Now, with this experience, I tried a simple workaround. > Created an empty version of "a" named "a_empty", identical in every sense. > renamed "a" to "a_full", and "a_empty" to "a". This procedure costed me like > 0 seconds of downtime, and everything kept working smoothly. Maybe a cpl of > operations could have failed if they tried to write in the very second that > there was actually no table named "a", but since the operation was > transactional, the worst scenario was that if the operation should have > failed, the client application would just inform of the error and ask the > user for a retry. No big deal. > > Now, this table, that is totally unattached to the system in every way (no > one references this table, its like a dumpster for old records), is not > begin accessed by no other thread in the system, so an ALTER table on it, to > turn a char(255) to char(250), should have no effect on the system. > > So, with this in mind, I tried the ALTER TABLE this time on the "a_full" > (totally unrelated) table. > The system went non-responsive again, and this time it had nothing to do > with threads waiting for the alter table to complete. The pgAdmin GUI went > non-responsive, as well as the application's server GUI, whose threads kept > working on the background, but starting to take more and more time for every > clients request (up to 25 seconds, which are just ridiculous and completely > unacceptable in normal conditions). OK, I'm not entirely sure this table is not still locking something else. If you make a copy by doing something like: select * into test_table from a; and then alter test_table do you still get the same problems? If so, then it is an IO issue, most likely. If not, then there is some client connection still referencing this table or something and that could cause this type of behaviour as well.
OK, I'm not entirely sure this table is not still locking something
else. If you make a copy by doing something like:
select * into test_table from a;
and then alter test_table do you still get the same problems? If so,
then it is an IO issue, most likely. If not, then there is some
client connection still referencing this table or something and that
could cause this type of behaviour as well.
I can guarantee you that the table is not being referenced by any other thread, table or process, and that it is totally unrelated to everything else in the system.
Its just a plain table, with 1.8 million records, that no thread knows it exists. It has no foreign keys that would allow thinking of a possible "lock" on the parent table, nor it is being referenced by any other table in the model. It has no triggers associated, and no indexes. It could very well even be on another database on the same physical server, and still do the same damage. I did not try this, but I'm pretty sure of the outcome. I would'nt like to bring the server down just to prove this, but I will do it if I find it necessary.
The only things that are common to this table and other tables in the system, as I see are:
RAM, IO, and CPU, at a very low level. One of these is being stressed out by the thread executing the ALTER, and the other threads (not just pgsql application threads, but system processes in general) suffer from the lack of this resource. All the previous discussions tend to induce that the resource we are talking about is IO.
The fact that the Task Manager does not come up, would also not be explained by a lock in a client thread.
Besides all that, all the client queries are NO WAIT, thus any lock would just return immediately, and no retry would be done until the response gets back to the user and the user confirms it. In that case, all the errors presented to the final users would be "The element is being processed some other place", as my default handler to pgsql error code "55P03", instead of the horrible "Operation timed out", that is what final users got during the huge slowdown/downtime.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Wed, Jan 13, 2010 at 10:54 AM, Eduardo Piombino <drakorg@gmail.com> wrote: > >> OK, I'm not entirely sure this table is not still locking something >> else. If you make a copy by doing something like: >> >> select * into test_table from a; >> >> and then alter test_table do you still get the same problems? If so, >> then it is an IO issue, most likely. If not, then there is some >> client connection still referencing this table or something and that >> could cause this type of behaviour as well. > > I can guarantee you that the table is not being referenced by any other > thread, table or process, and that it is totally unrelated to everything > else in the system. If you rename a table that WAS being referenced by other threads, then it might still be being accessed or waited on etc by those threads, as their transaction would have started earlier. The only way you can guarantee it's not being reference in some way is to create it fresh and new as I suggested and test on that. Until then, your guarantee is based on a belief, not verifiable fact. I too tend to believe this is an IO problem btw, but claiming that it can't be a problem with some locks without looking at pg_locks at least, is a bit premature.
Robert Haas wrote: > I'm kind of surprised that there are disk I/O subsystems that are so > bad that a single thread doing non-stop I/O can take down the whole > server. Is that normal? Does it happen on non-Windows operating > systems? What kind of hardware should I not buy to make sure this > doesn't happen to me? > You can kill any hardware on any OS with the right abusive client. Create a wide table and insert a few million records into it with generate_series one day and watch what it does to queries trying to run in parallel with that. I think the missing step here to nail down exactly what's happening on Eduardo's system is that he should open up some of the Windows system monitoring tools, look at both disk I/O and CPU usage, and then watch what changes when the troublesome ALTER TABLE shows up. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg, I will post more detailed data as soon as I'm able to gather it.
I was trying out if the cancellation of the ALTER cmd worked ok, I might give the ALTER another try, and see how much CPU, RAM and IO usage gets involved. I will be doing this monitoring with the process explorer from sysinternals, but I don't know how I can make it to log the results. Do you know any tool that you have used that can help me generate this evidence? I will google a little as soon as possible.
I was trying out if the cancellation of the ALTER cmd worked ok, I might give the ALTER another try, and see how much CPU, RAM and IO usage gets involved. I will be doing this monitoring with the process explorer from sysinternals, but I don't know how I can make it to log the results. Do you know any tool that you have used that can help me generate this evidence? I will google a little as soon as possible.
On Wed, Jan 13, 2010 at 3:46 PM, Greg Smith <greg@2ndquadrant.com> wrote:
Robert Haas wrote:You can kill any hardware on any OS with the right abusive client. Create a wide table and insert a few million records into it with generate_series one day and watch what it does to queries trying to run in parallel with that.I'm kind of surprised that there are disk I/O subsystems that are so
bad that a single thread doing non-stop I/O can take down the whole
server. Is that normal? Does it happen on non-Windows operating
systems? What kind of hardware should I not buy to make sure this
doesn't happen to me?
I think the missing step here to nail down exactly what's happening on Eduardo's system is that he should open up some of the Windows system monitoring tools, look at both disk I/O and CPU usage, and then watch what changes when the troublesome ALTER TABLE shows up.
--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com
Robert Haas wrote: > I'm kind of surprised that there are disk I/O subsystems that are so > bad that a single thread doing non-stop I/O can take down the whole > server. Is that normal? No. > Does it happen on non-Windows operating > systems? Yes. My 3ware 8500-8 on a Debian Sarge box was so awful that launching a terminal would go from a 1/4 second operation to a 5 minute operation under heavy write load by one writer. I landed up having to modify the driver to partially mitigate the issue, but a single user on the terminal server performing any sort of heavy writing would still absolutely nuke performance. I landed up having dramatically better results by disabling the controller's RAID features, instead exposing each disk to the OS separately and using Linux's software RAID. > What kind of hardware should I not buy to make sure this > doesn't happen to me? 3ware's older cards. Apparently their new ones are a lot better, but I haven't verified this personally. Anything in RAID-5 without a BBU. Anything at all without a BBU, preferably. -- Craig Ringer
On 1/13/2010 11:36 PM, Craig Ringer wrote: > Robert Haas wrote: > >> I'm kind of surprised that there are disk I/O subsystems that are so >> bad that a single thread doing non-stop I/O can take down the whole >> server. Is that normal? > > No. > >> Does it happen on non-Windows operating >> systems? > > Yes. My 3ware 8500-8 on a Debian Sarge box was so awful that launching a > terminal would go from a 1/4 second operation to a 5 minute operation > under heavy write load by one writer. I landed up having to modify the > driver to partially mitigate the issue, but a single user on the > terminal server performing any sort of heavy writing would still > absolutely nuke performance. On a side note, on linux, would using the deadline scheduler resolve that? -Andy
> "high CPU usage" It might very well be "high IO usage". Try this : Copy (using explorer, the shell, whatever) a huge file. This will create load similar to ALTER TABLE. Measure throughput, how much is it ? If your server blows up just like it did on ALTER TABLE, you got a IO system problem. If everything is smooth, you can look into other things. How's your fragmentation ? Did the disk ever get full ? What does the task manager say (swap in/out, disk queue lengthn etc) PS : try a separate tablespace on another disk.
Andy Colson wrote: > On 1/13/2010 11:36 PM, Craig Ringer wrote: >> Yes. My 3ware 8500-8 on a Debian Sarge box was so awful that launching a >> terminal would go from a 1/4 second operation to a 5 minute operation >> under heavy write load by one writer. I landed up having to modify the >> driver to partially mitigate the issue, but a single user on the >> terminal server performing any sort of heavy writing would still >> absolutely nuke performance. > > On a side note, on linux, would using the deadline scheduler resolve > that? I've never seen the deadline scheduler resolve anything. If you're out of I/O capacity and that's blocking other work, performance is dominated by the policies of the underlying controller/device caches. Think about it a minute: disks nowadays can easily have 32MB of buffer in them, right? And random read/write operations are lucky to clear 2MB/s on cheap drivers. So once the drive is filled with requests, you can easily sit there for ten seconds before the scheduler even has any input on resolving the situation. That's even more true if you've got a larger controller cache in the mix. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On 1/14/2010 12:07 PM, Greg Smith wrote: > Andy Colson wrote: >> On 1/13/2010 11:36 PM, Craig Ringer wrote: >>> Yes. My 3ware 8500-8 on a Debian Sarge box was so awful that launching a >>> terminal would go from a 1/4 second operation to a 5 minute operation >>> under heavy write load by one writer. I landed up having to modify the >>> driver to partially mitigate the issue, but a single user on the >>> terminal server performing any sort of heavy writing would still >>> absolutely nuke performance. >> >> On a side note, on linux, would using the deadline scheduler resolve >> that? > > I've never seen the deadline scheduler resolve anything. If you're out > of I/O capacity and that's blocking other work, performance is dominated > by the policies of the underlying controller/device caches. Think about > it a minute: disks nowadays can easily have 32MB of buffer in them, > right? And random read/write operations are lucky to clear 2MB/s on > cheap drivers. So once the drive is filled with requests, you can easily > sit there for ten seconds before the scheduler even has any input on > resolving the situation. That's even more true if you've got a larger > controller cache in the mix. > That makes sense. So if there is very little io, or if there is way way too much, then the scheduler really doesn't matter. So there is a slim middle ground where the io is within a small percent of the HD capacity where the scheduler might make a difference? -Andy
Andy Colson wrote: > So if there is very little io, or if there is way way too much, then > the scheduler really doesn't matter. So there is a slim middle ground > where the io is within a small percent of the HD capacity where the > scheduler might make a difference? That's basically how I see it. There seem to be people who run into workloads in the middle ground where the scheduler makes a world of difference. I've never seen one myself, and suspect that some of the reports of deadline being a big improvement just relate to some buginess in the default CFQ implementation that I just haven't encountered. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Regarding the hardware the system is running on:
It's an HP Proliant DL-180 G5 server.
Here are the specs... our actual configuration only has one CPU, and 16G of RAM.
The model of the 2 disks I will post later today, when I get to the server.
I was with many things, sorry.
http://h18000.www1.hp.com/products/quickspecs/12903_na/12903_na.HTML
http://h18004.www1.hp.com/products/quickspecs/DS_00126/DS_00126.pdf
At A Glance
The HP ProLiant DL180 G5 is a low cost high capacity storage optimized 2-way server that delivers on a history of design excellence and 2U density for a variety of rack deployments and applications.
Regarding the SATA RAID controller, on the other spec pages it says that for the 8 disks model (ours), it comes with a Smart Array E200. I will try to check out if we are using the original, since I recall hearing something about that our disks were SAS (Serial Attached SCSI), and I don't know if it is possible to connect those disks to embedded Smart Array E200 controller. Would it be possible?
It's an HP Proliant DL-180 G5 server.
Here are the specs... our actual configuration only has one CPU, and 16G of RAM.
The model of the 2 disks I will post later today, when I get to the server.
I was with many things, sorry.
http://h18000.www1.hp.com/products/quickspecs/12903_na/12903_na.HTML
http://h18004.www1.hp.com/products/quickspecs/DS_00126/DS_00126.pdf
At A Glance
The HP ProLiant DL180 G5 is a low cost high capacity storage optimized 2-way server that delivers on a history of design excellence and 2U density for a variety of rack deployments and applications.
- Processors:
- Supports up to two Quad-Core Intel® Xeon® processors: 5400 sequence with 12MB Level 2 cache
- Intel® 5100 Chipset
- Memory:
- Up to 32 GB of memory supported by six (6) PC2-5300 (667 MHz) DDR2 memory slots
- Internal Drive Support:
- Supports up to twelve via CTO with controller or up to eight via BTO with the addition of a controller:
- Hot Plug Serial ATA (SATA) 3.5"hard drives; or
- Hot Plug Serial Attached SCSI (SAS) 3.5"hard drives
NOTE: 4 hard drives are supported standard via BTO. 8 hard drive support requires the addition of a Smart Array or HBA controller. Hot Plug and SAS functionality require the addition of a Smart Array or HBA controller. 12 hard drive support available via CTO only and requires a SAS controller that supports expanders.
- Internal storage capacity:
- SATA Models: Up to 12.0TB (12 x 1TB Hot Plug 3.5" hard drives)
- SAS Model: Up to 12.0TB (12 x 1TB Hot Plug 3.5" hard drives)
- Supports up to twelve via CTO with controller or up to eight via BTO with the addition of a controller:
- Network Controller:
- One integrated NC105i PCI-e Gigabit NIC (embedded) (Wake on LAN and PXE capable)
- Storage Controllers:
- HP Embedded SATA RAID Controller (up to 4 hard drive support on standard BTO models)
NOTE: Transfer rate 1.5 Gb/s SATA
- HP Embedded SATA RAID Controller (up to 4 hard drive support on standard BTO models)
- Expansion Slots:
- One available Low Profile x8 PCI-Express slot using a Low profile Riser.
- Two Full Height/ Full Length Riser options
- Option1: 2 full-length/full-height PCI-Express x8 connector slots (x4 electrical - Standard)
- Option2: full-length/full-height riser with 2 PCI-X Slots(Optional)
- Infrastructure Management:
- Optional HP Lights Out 100c Remote Management card with Virtual KVM and Virtual Media support (includes IPMI2.0 and SMASH support)
- USB Ports:
- Seven USB ports (2) front, (4) rear, (1) internal
- Optical Drive:
- Support for one:
- Optional Multi-bay DVD
- Optional Floppy (USB only, USB key)
- Support for one:
- Power Supply:
- 750W Power Supply (Optional Redundancy Hot Plug, Autoswitching) CSCI 2007/8
- 1200W High Efficiency Power Supply (Optional Redundancy Hot Plug, Autoswitching) (Optional) CSCI 2007/8
- NOTE: Climate Savers Computing Initiative, 2007-2008 Compliant
- Form Factor:
- 2U rack models
Regarding the SATA RAID controller, on the other spec pages it says that for the 8 disks model (ours), it comes with a Smart Array E200. I will try to check out if we are using the original, since I recall hearing something about that our disks were SAS (Serial Attached SCSI), and I don't know if it is possible to connect those disks to embedded Smart Array E200 controller. Would it be possible?
On Wed, Jan 13, 2010 at 4:13 PM, Eduardo Piombino <drakorg@gmail.com> wrote:
Greg, I will post more detailed data as soon as I'm able to gather it.
I was trying out if the cancellation of the ALTER cmd worked ok, I might give the ALTER another try, and see how much CPU, RAM and IO usage gets involved. I will be doing this monitoring with the process explorer from sysinternals, but I don't know how I can make it to log the results. Do you know any tool that you have used that can help me generate this evidence? I will google a little as soon as possible.On Wed, Jan 13, 2010 at 3:46 PM, Greg Smith <greg@2ndquadrant.com> wrote:Robert Haas wrote:You can kill any hardware on any OS with the right abusive client. Create a wide table and insert a few million records into it with generate_series one day and watch what it does to queries trying to run in parallel with that.I'm kind of surprised that there are disk I/O subsystems that are so
bad that a single thread doing non-stop I/O can take down the whole
server. Is that normal? Does it happen on non-Windows operating
systems? What kind of hardware should I not buy to make sure this
doesn't happen to me?
I think the missing step here to nail down exactly what's happening on Eduardo's system is that he should open up some of the Windows system monitoring tools, look at both disk I/O and CPU usage, and then watch what changes when the troublesome ALTER TABLE shows up.
--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com
Regarding the EA-200 card, here are the specs.
It seems it has support for SAS disks, so it is most probably that we are using the embedded/default controller.
http://h18000.www1.hp.com/products/quickspecs/12460_div/12460_div.html
http://h18000.www1.hp.com/products/quickspecs/12460_div/12460_div.pdf
It seems it has support for SAS disks, so it is most probably that we are using the embedded/default controller.
http://h18000.www1.hp.com/products/quickspecs/12460_div/12460_div.html
http://h18000.www1.hp.com/products/quickspecs/12460_div/12460_div.pdf
Key Features
- Seamless upgrades from past generations and upgrades to next generation HP high performance and high capacity Serial Attached SCSI Smart Array controllers.
- 3G SAS technology delivers high performance and data bandwidth up to 300 MB\s per physical link and contains full compatibility with 1.5G SATA technology.
- x4 2.5G PCI Express host interface technology delivers high performance and data bandwidth up to 2 GB/s maximum bandwidth.
- Addition of the battery backed cache upgrade enables BBWC, RAID 5, Capacity Expansion, RAID migration, and Stripe Size Migration.
- Mix-and-match SAS and SATA hard drives, lets you deploy drive technology as needed to fit your computing environment.
- Support for up to 2 TB in a single logical drive.
- Software consistency among all Smart Array family products: Array Configuration Utility (ACU), Option ROM Configuration for Arrays (ORCA), Systems Insight Manager, Array Diagnostic Utility (ADU) and SmartStart. Some of these features are not available with ProLiant 100 series platforms.
- The SA-E200 controller supports up to 8 drives. The SA-E200i supports 2-8 drives depending on the server implementation.
Performance
HP's High Performance Architecture sets new boundaries of industry performance expectations!
- 3Gb/s SAS (300MB/s bandwidth per physical link)
- x8 3Gb/s SAS physical links (compatible with 1.5G SATA)
- 64 MB or 128 MB DDR1-266 battery-backed cache provides up to 4.2 GB/s maximum bandwidth.
- x4 2.5G PCI Express host interface provides 2 GB/s maximum bandwidth.
- MIPS 32-bit Processor
- Read ahead caching
- Write-back caching (with battery-backed write cache upgrade)
Capacity
Given the increasing need for high performance and rapid capacity expansion, the SA-E200 offers:
- Up to 6TB of total storage with 6 x 1TB SATA MDL hard drives (3.5")
NOTE: Support for greater than 2TB in a single logical drive. - Up to 2.4TB of total storage with 8 x 300GB SFF SAS hard drives
On Thu, Jan 14, 2010 at 5:49 PM, Eduardo Piombino <drakorg@gmail.com> wrote:
Regarding the hardware the system is running on:
It's an HP Proliant DL-180 G5 server.
Here are the specs... our actual configuration only has one CPU, and 16G of RAM.
The model of the 2 disks I will post later today, when I get to the server.
I was with many things, sorry.
http://h18000.www1.hp.com/products/quickspecs/12903_na/12903_na.HTML
http://h18004.www1.hp.com/products/quickspecs/DS_00126/DS_00126.pdf
At A Glance
The HP ProLiant DL180 G5 is a low cost high capacity storage optimized 2-way server that delivers on a history of design excellence and 2U density for a variety of rack deployments and applications.
- Processors:
- Supports up to two Quad-Core Intel® Xeon® processors: 5400 sequence with 12MB Level 2 cache
- Intel® 5100 Chipset
- Memory:
- Up to 32 GB of memory supported by six (6) PC2-5300 (667 MHz) DDR2 memory slots
- Internal Drive Support:
- Supports up to twelve via CTO with controller or up to eight via BTO with the addition of a controller:
- Hot Plug Serial ATA (SATA) 3.5"hard drives; or
- Hot Plug Serial Attached SCSI (SAS) 3.5"hard drives
NOTE: 4 hard drives are supported standard via BTO. 8 hard drive support requires the addition of a Smart Array or HBA controller. Hot Plug and SAS functionality require the addition of a Smart Array or HBA controller. 12 hard drive support available via CTO only and requires a SAS controller that supports expanders.- Internal storage capacity:
- SATA Models: Up to 12.0TB (12 x 1TB Hot Plug 3.5" hard drives)
- SAS Model: Up to 12.0TB (12 x 1TB Hot Plug 3.5" hard drives)
- Network Controller:
- One integrated NC105i PCI-e Gigabit NIC (embedded) (Wake on LAN and PXE capable)
- Storage Controllers:
- HP Embedded SATA RAID Controller (up to 4 hard drive support on standard BTO models)
NOTE: Transfer rate 1.5 Gb/s SATA- Expansion Slots:
- One available Low Profile x8 PCI-Express slot using a Low profile Riser.
- Two Full Height/ Full Length Riser options
- Option1: 2 full-length/full-height PCI-Express x8 connector slots (x4 electrical - Standard)
- Option2: full-length/full-height riser with 2 PCI-X Slots(Optional)
- Infrastructure Management:
- Optional HP Lights Out 100c Remote Management card with Virtual KVM and Virtual Media support (includes IPMI2.0 and SMASH support)
- USB Ports:
- Seven USB ports (2) front, (4) rear, (1) internal
- Optical Drive:
- Support for one:
- Optional Multi-bay DVD
- Optional Floppy (USB only, USB key)
- Power Supply:
- 750W Power Supply (Optional Redundancy Hot Plug, Autoswitching) CSCI 2007/8
- 1200W High Efficiency Power Supply (Optional Redundancy Hot Plug, Autoswitching) (Optional) CSCI 2007/8
- NOTE: Climate Savers Computing Initiative, 2007-2008 Compliant
- Form Factor:
- 2U rack models
Regarding the SATA RAID controller, on the other spec pages it says that for the 8 disks model (ours), it comes with a Smart Array E200. I will try to check out if we are using the original, since I recall hearing something about that our disks were SAS (Serial Attached SCSI), and I don't know if it is possible to connect those disks to embedded Smart Array E200 controller. Would it be possible?On Wed, Jan 13, 2010 at 4:13 PM, Eduardo Piombino <drakorg@gmail.com> wrote:Greg, I will post more detailed data as soon as I'm able to gather it.
I was trying out if the cancellation of the ALTER cmd worked ok, I might give the ALTER another try, and see how much CPU, RAM and IO usage gets involved. I will be doing this monitoring with the process explorer from sysinternals, but I don't know how I can make it to log the results. Do you know any tool that you have used that can help me generate this evidence? I will google a little as soon as possible.On Wed, Jan 13, 2010 at 3:46 PM, Greg Smith <greg@2ndquadrant.com> wrote:Robert Haas wrote:You can kill any hardware on any OS with the right abusive client. Create a wide table and insert a few million records into it with generate_series one day and watch what it does to queries trying to run in parallel with that.I'm kind of surprised that there are disk I/O subsystems that are so
bad that a single thread doing non-stop I/O can take down the whole
server. Is that normal? Does it happen on non-Windows operating
systems? What kind of hardware should I not buy to make sure this
doesn't happen to me?
I think the missing step here to nail down exactly what's happening on Eduardo's system is that he should open up some of the Windows system monitoring tools, look at both disk I/O and CPU usage, and then watch what changes when the troublesome ALTER TABLE shows up.
--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com
On Thu, 14 Jan 2010, Greg Smith wrote: > Andy Colson wrote: >> So if there is very little io, or if there is way way too much, then the >> scheduler really doesn't matter. So there is a slim middle ground where >> the io is within a small percent of the HD capacity where the scheduler >> might make a difference? > > That's basically how I see it. There seem to be people who run into > workloads in the middle ground where the scheduler makes a world of > difference. I've never seen one myself, and suspect that some of the reports > of deadline being a big improvement just relate to some buginess in the > default CFQ implementation that I just haven't encountered. That's the perception I get. CFQ is the default scheduler, but in most systems I have seen, it performs worse than the other three schedulers, all of which seem to have identical performance. I would avoid anticipatory on a RAID array though. It seems to me that CFQ is simply bandwidth limited by the extra processing it has to perform. Matthew -- Experience is what allows you to recognise a mistake the second time you make it.
Matthew Wakeling wrote: > On Thu, 14 Jan 2010, Greg Smith wrote: >> Andy Colson wrote: >>> So if there is very little io, or if there is way way too much, then >>> the scheduler really doesn't matter. So there is a slim middle >>> ground where the io is within a small percent of the HD capacity >>> where the scheduler might make a difference? >> >> That's basically how I see it. There seem to be people who run into >> workloads in the middle ground where the scheduler makes a world of >> difference. I've never seen one myself, and suspect that some of the >> reports of deadline being a big improvement just relate to some >> buginess in the default CFQ implementation that I just haven't >> encountered. > > That's the perception I get. CFQ is the default scheduler, but in most > systems I have seen, it performs worse than the other three schedulers, > all of which seem to have identical performance. I would avoid > anticipatory on a RAID array though. I thought the best strategy for a good RAID controller was NOOP. Anything the OS does just makes it harder for the RAIDcontroller to do its job. With a direct-attached disk, the OS knows where the heads are, but with a battery-backed RAIDcontroller, the OS has no idea what's actually happening. Craig
On Fri, 15 Jan 2010, Craig James wrote: >> That's the perception I get. CFQ is the default scheduler, but in most >> systems I have seen, it performs worse than the other three schedulers, all >> of which seem to have identical performance. I would avoid anticipatory on >> a RAID array though. > > I thought the best strategy for a good RAID controller was NOOP. Agreed. That's what we use here. My observation is though that noop is identical in performance to anticipatory and deadline. Theoretically, it should be faster. Matthew -- "Take care that thou useth the proper method when thou taketh the measure of high-voltage circuits so that thou doth not incinerate both thee and the meter; for verily, though thou has no account number and can be easily replaced, the meter doth have one, and as a consequence, bringeth much woe upon the Supply Department." -- The Ten Commandments of Electronics
Matthew Wakeling wrote: > CFQ is the default scheduler, but in most systems I have seen, it > performs worse than the other three schedulers, all of which seem to > have identical performance. I would avoid anticipatory on a RAID array > though. > > It seems to me that CFQ is simply bandwidth limited by the extra > processing it has to perform. I'm curious what you are doing when you see this. I've got several hundred hours worth of pgbench data on CFQ vs. deadline from a couple of system collected over the last three years, and I've never seen either a clear deadline win or a major CFQ failing. Most results are an even tie, with the occasional mild preference for CFQ under really brutal loads. My theory has been that the "extra processing it has to perform" you describe just doesn't matter in the context of a fast system where physical I/O is always the bottleneck. I'd really like to have a compelling reason to prefer deadline, because the concept seems better, but I just haven't seen the data to back that up. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Eduardo Piombino wrote: > Going to the disk properties (in windows), I just realized it does not > have the Write Cache enabled, and it doesn't also allow me to set it > up. I've read in google that the lack of ability to turn it on (that > is, that the checkbox remains checked after you apply the changes), > has to do with the lack of batter backup in the controller (which is > default bundle option for embedded EA-200, which is our case). > > Regarding actual disk performance, I did some silly tests: > Copied a 496 Mbytes file from a folder to another folder in C: and it > took almost 90 secs. > That would be 496MB/90 sec = 5.51MB/sec > I'd suggest http://www.hdtune.com/ as a better way to test transfer speed here across the drive(s). I think you'll find that your server continues to underperform expectations until you get the battery installed that allows turning the write cache on. A quick look at HP's literature suggests they believe you only need the battery to enable the write-cache if you're using RAID5. That's completely wrong for database use, where you will greatly benefit from it regardless of underlying RAID setup. If you've got an EA-200 but don't have a battery for it to unlock all the features, you're unlikely to find a more cost effect way to improve your system than to buy one. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
I will give it a try, thanks.
However, besides all the analysis and tests and stats that I've been collecting, I think the point of discussion turned into if my hardware is good enough, and if it can keep up with the needs in normal, or even heaviest users load. And if that is the question, the answer would be yes, it does. The whole system performs outstandingly well under the maximum stress users can ever request.
Of course it could be better, and that of course would be fantastic, but I have the feeling that in this setup, buying more hardware, replace parts, etc, would be just a temporary fix (maybe temporary = forever in this context). I'm not saying that I won't buy that battery for the card, no, because that will greatly boost my performance for this kind of administrative background tasks, but my point is that current hardware, seems more than sufficient for current users needs.
What I'm trying to say is that:
I think pg is wasting resources, it could be very well taking advantage of, if you guys just tell me get better hardware. I mean ... the IO subsystem is obviously the bottleneck of my system. But most of the time it is on very very light load, actually ALL of the time, unless I do some heavy background processing like the original ALTER, or the procedure that updates 800.000 rows. What I would consider to be a great feature, would be able to tell pgsql, that a certain operation, is not time critical, so that it does not try to use all the available IO subsystem at once, but rather rationalize its use. Maybe that's the operating system / disk controller / other system component responsibility, and in this case it's others module fault, that it turns out that a single process has full control over a shared resource.
The "whole system" failed to rationalize the IO subsystem use, and I agree that it is not pgsql fault, at all.
But already knowing that the base system (i.e. components out of pg's control, like OS, hardware, etc) may be "buggy" or that it can fail in rationalizing the IO, maybe it would be nice to tell to whoever is responsible for making use of the IO subsystem (pg_bg_writer?), to use it in a moderately manner. That is ... This operation is not critical, please do not trash my system because it is not necessary. Use all the delays you would like, go slowly, please, I don't really care if you take a month. Or at least, be aware of current status of the IO system. If it is being busy, slow down, if it is free, speed up. Of course I won't care if it takes less time to complete.
Today, one can rationalize use of CPU, with a simple pg_sleep() call.
It would be nice to have maybe an ALTER table option (for ALTERs) or an option in the BEGIN transaction command, that would let me say:
BEGIN SLOW TRANSACTION;
or BEGIN TRANSACTION RATIONALIZE_IO;
indicating that all the IO operations that are going to be performed in this transaction, are not time critical, and thus, there is no need to put the system in risk of a IO storm, just for a silly set of updates, that no one is waiting for.
So if that feature was available, there would be no need for me (or maybe, thousands of pg users), to upgrade hardware just to be able to perform a single, unrelated, operation. I mean, the hardware is there, and is working pretty good. If I could just tell pg that I don't care if an operation takes all the time in the world, I think that would be awesome, and it would be making the MOST of every possible hardware configuration.
I truly love pg. I just feel that something is not quite right the moment I am required to upgrade my hardware, knowing that at any given time, I have 90% of the IO subsystem idle, that could be very well used in a better fashion, and now would be completely wasted.
Well thank you, just some thoughts. And if the idea of a RATIONALIZED transaction picks up, I would be more than glad to help implement it or to help in any other way I can.
Best regards,
Eduardo.
However, besides all the analysis and tests and stats that I've been collecting, I think the point of discussion turned into if my hardware is good enough, and if it can keep up with the needs in normal, or even heaviest users load. And if that is the question, the answer would be yes, it does. The whole system performs outstandingly well under the maximum stress users can ever request.
Of course it could be better, and that of course would be fantastic, but I have the feeling that in this setup, buying more hardware, replace parts, etc, would be just a temporary fix (maybe temporary = forever in this context). I'm not saying that I won't buy that battery for the card, no, because that will greatly boost my performance for this kind of administrative background tasks, but my point is that current hardware, seems more than sufficient for current users needs.
What I'm trying to say is that:
I think pg is wasting resources, it could be very well taking advantage of, if you guys just tell me get better hardware. I mean ... the IO subsystem is obviously the bottleneck of my system. But most of the time it is on very very light load, actually ALL of the time, unless I do some heavy background processing like the original ALTER, or the procedure that updates 800.000 rows. What I would consider to be a great feature, would be able to tell pgsql, that a certain operation, is not time critical, so that it does not try to use all the available IO subsystem at once, but rather rationalize its use. Maybe that's the operating system / disk controller / other system component responsibility, and in this case it's others module fault, that it turns out that a single process has full control over a shared resource.
The "whole system" failed to rationalize the IO subsystem use, and I agree that it is not pgsql fault, at all.
But already knowing that the base system (i.e. components out of pg's control, like OS, hardware, etc) may be "buggy" or that it can fail in rationalizing the IO, maybe it would be nice to tell to whoever is responsible for making use of the IO subsystem (pg_bg_writer?), to use it in a moderately manner. That is ... This operation is not critical, please do not trash my system because it is not necessary. Use all the delays you would like, go slowly, please, I don't really care if you take a month. Or at least, be aware of current status of the IO system. If it is being busy, slow down, if it is free, speed up. Of course I won't care if it takes less time to complete.
Today, one can rationalize use of CPU, with a simple pg_sleep() call.
It would be nice to have maybe an ALTER table option (for ALTERs) or an option in the BEGIN transaction command, that would let me say:
BEGIN SLOW TRANSACTION;
or BEGIN TRANSACTION RATIONALIZE_IO;
indicating that all the IO operations that are going to be performed in this transaction, are not time critical, and thus, there is no need to put the system in risk of a IO storm, just for a silly set of updates, that no one is waiting for.
So if that feature was available, there would be no need for me (or maybe, thousands of pg users), to upgrade hardware just to be able to perform a single, unrelated, operation. I mean, the hardware is there, and is working pretty good. If I could just tell pg that I don't care if an operation takes all the time in the world, I think that would be awesome, and it would be making the MOST of every possible hardware configuration.
I truly love pg. I just feel that something is not quite right the moment I am required to upgrade my hardware, knowing that at any given time, I have 90% of the IO subsystem idle, that could be very well used in a better fashion, and now would be completely wasted.
Well thank you, just some thoughts. And if the idea of a RATIONALIZED transaction picks up, I would be more than glad to help implement it or to help in any other way I can.
Best regards,
Eduardo.
On Fri, Jan 15, 2010 at 7:32 PM, Greg Smith <greg@2ndquadrant.com> wrote:
Eduardo Piombino wrote:I'd suggest http://www.hdtune.com/ as a better way to test transfer speed here across the drive(s).Going to the disk properties (in windows), I just realized it does not have the Write Cache enabled, and it doesn't also allow me to set it up. I've read in google that the lack of ability to turn it on (that is, that the checkbox remains checked after you apply the changes), has to do with the lack of batter backup in the controller (which is default bundle option for embedded EA-200, which is our case).
Regarding actual disk performance, I did some silly tests:
Copied a 496 Mbytes file from a folder to another folder in C: and it took almost 90 secs.
That would be 496MB/90 sec = 5.51MB/sec
I think you'll find that your server continues to underperform expectations until you get the battery installed that allows turning the write cache on. A quick look at HP's literature suggests they believe you only need the battery to enable the write-cache if you're using RAID5. That's completely wrong for database use, where you will greatly benefit from it regardless of underlying RAID setup. If you've got an EA-200 but don't have a battery for it to unlock all the features, you're unlikely to find a more cost effect way to improve your system than to buy one.
--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Eduardo Piombino wrote: > But already knowing that the base system (i.e. components out of pg's > control, like OS, hardware, etc) may be "buggy" or that it can fail in > rationalizing the IO, maybe it would be nice to tell to whoever is > responsible for making use of the IO subsystem (pg_bg_writer?), to use > it in a moderately manner. That is ... This operation is not critical, > please do not trash my system because it is not necessary. Use all the > delays you would like, go slowly, please, I don't really care if you > take a month. Or at least, be aware of current status of the IO > system. If it is being busy, slow down, if it is free, speed up. Of > course I won't care if it takes less time to complete. There are three problems here: 1) The background writer does not have a central role in the I/O of the system, and even if it did that would turn into a scalability issue. Clients initiate a lot of work on their own, and it's not so easy to actually figure out where to put a limiter at given that. 2) PostgreSQL aims to be cross-platform, and writing something that adjusts operations based on what the OS is doing requires a lot of OS-specific code. You end up needing to write a whole new library for every platform you want to support. 3) Everyone who is spending money/time improving PostgreSQL has things they think are more important to work on than resource limiters, so there's just not anybody working on this. Your request is completely reasonable and there are plenty of uses for it. It's just harder than it might seem to build. One day we may find someone with money to spend who can justify sponsoring development in this area because it's a must-have for their PostgreSQL deployment. I assure you that any number of people reading this list would be happy to quote out that job. But right now, there is no such sponsor I'm aware of. That means the best we can do is try and help people work around the issues they do run into in the most effective way possible, which in your case has wandered into this investigation of your underlying disk subsystem. It's not that we don't see that an alternate approach would make the problem go away, the code needed just isn't available, and other project development work (like the major replication advance that was just committed today) are seen as more important. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Eduardo Piombino wrote: > I think pg is wasting resources, it could be very well taking advantage > of, if you guys just tell me get better hardware. I mean ... the IO > subsystem is obviously the bottleneck of my system. But most of the time > it is on very very light load, actually ALL of the time, unless I do > some heavy background processing like the original ALTER, or the > procedure that updates 800.000 rows. What I would consider to be a great > feature, would be able to tell pgsql, that a certain operation, is not > time critical, so that it does not try to use all the available IO > subsystem at once, but rather rationalize its use. Rate-limiting (or preferably prioritizing) I/O from Pg would be nice. It's already possible to prioritize I/O from Pg, though, albeit somewhat clumsily: http://wiki.postgresql.org/wiki/Priorities ... as the OS provides I/O priority features. Pg shouldn't have to re-implement those, only provide more convenient access to them. ( On Windows? Who knows. If you find out how to set I/O priorities on Windows please extend that article! ) The trouble is that if you have a crappy RAID setup, the OS's I/O priorities may be ineffective. The OS will do its best to prioritize anything else over your don't-care-how-long-it-takes backend's query, but if the RAID controller is queuing requests seconds-deep nothing the OS does will make any real difference. To my eternal frustration, there don't seem to be any RAID controllers that have any concept of I/O priorities. I'd love Linux to be able to submit requests to different queues within the controller depending on priority, so low priority requests only got serviced if the higher-priority queue was empty. AFAIK there isn't really anything like that out there, though - all the RAID controllers seem to be built for overall throughput at the expense of request latency to one extent or another. So ... your can prioritize I/O in the OS as much as you like, but your RAID controller may merrily undermine all your work. Doing it within Pg would suffer from many of the same issues. Pg has no way to know how deeply the controller is queuing requests and when it's actually finished a request, so it it's very hard for Pg to rate-limit it's I/O effectively for low-priority work. It doesn't know how to strike a balance between sending requests too fast (ruining latency for higher priority work) and sending far too few (so taking forever for the low priority work). What's insanely conservative on some hardware is insanely too much to ask from other hardware. To be sure the controller is done with a set of writes and ready for another, you'd have to fsync() and that'd be murderous on performance, completely ruining any benefits gained from pacing the work. It's also complicated by the fact that Pg's architecture is very poorly suited to prioritizing I/O based on query or process. (AFAIK) basically all writes go through shared_buffers and the bgwriter - neither Pg nor in fact the OS know what query or what backend created a given set of block writes. To be able to effectively prioritize I/O you'd really have to be able to bypass the bgwriter, instead doing the writes direct from the low priority backend after ionice()ing or otherwise setting up low OS-level I/O priorities. Even then, RAID-controller level queuing and buffering might land up giving most of the I/O bandwidth to the low priority work anyway. I guess some kind of dynamic rate-limiting could theoretically also allow Pg to write at (say) 50% of the device's write capacity at any given time, but the multiple layers of buffering and the dynamic load changes in the system would make it incredibly hard to effectively evaluate what the system's write capacity actually was. You'd probably have to run a dedicated Pg benchmark to generate some parameters to calibrate low priority write rates... but they'd still change depending on the random vs seq I/O mix of other processes and Pg backends on the system, the amount of fsync() activity, etc etc etc. It's a more complicated (!) version of the problem of rate-limiting TCP/IP data sending. ( Actually, implementing something akin to TCP/IP connection rate limiting for allocating I/O write bandwidth in low-priority connections would be ... fascinating. I'm sure the people who write OS write schedulers and priority systems like ionice have looked into it and found reasons why it's not suitable. ) The point of all that rambling: it's not as easy as just adding query priorities to Pg! > responsible for making use of the IO subsystem (pg_bg_writer?), to use > it in a moderately manner. That is ... This operation is not critical, > please do not trash my system because it is not necessary. Use all the > delays you would like, go slowly, please, I don't really care if you > take a month. Trouble is, that's a rather rare case. Usually you *do* care if it takes a month vs a week, because you're worried about lock times. > Or at least, be aware of current status of the IO system. > If it is being busy, slow down, if it is free, speed up. Of course I > won't care if it takes less time to complete. There just isn't the visibility into the OS and hardware level to know that. Alas. At best you can measure how long it takes for the OS to return from an I/O request or fsync() ... but all the caching and buffering and queuing means that bears little relationship to the capacity of the system. > Today, one can rationalize use of CPU, with a simple pg_sleep() call. > It would be nice to have maybe an ALTER table option (for ALTERs) or an > option in the BEGIN transaction command, that would let me say: > BEGIN SLOW TRANSACTION; > or BEGIN TRANSACTION RATIONALIZE_IO; > indicating that all the IO operations that are going to be performed in > this transaction, are not time critical, and thus, there is no need to > put the system in risk of a IO storm, just for a silly set of updates, > that no one is waiting for. I'd love that myself - if it could be made to work fairly simply. I'm not sure it can. In reality it'd probably have to look more like: BEGIN SLOW TRANSACTION WITH io_max_ops_persec = 5 io_max_bytes_written_persec = 10000; where those params would pretty much be "make it up and see what works" stuff with a bit of benchmark guidance. Maybe that'd still be useful. If so, you'd need to answer how to separate such low-priority I/O out so the bgwriter could rate-limit it separately, or how to bypass the bgwriter for such I/O. -- Craig Ringer
Craig Ringer wrote: > It's also complicated by the fact that Pg's architecture is very poorly > suited to prioritizing I/O based on query or process. (AFAIK) basically > all writes go through shared_buffers and the bgwriter - neither Pg nor > in fact the OS know what query or what backend created a given set of > block writes. You're correct that all writes go through shared_buffers, and all information about the query that dirties the page in the first place is gone by the time it's written out. In 8.3 and later, buffers get written three ways: (1) A backend needs to allocate a buffer to do some work. The buffer it is allocated is dirty. In this case, the backend itself ends up writing the page out. (2) The background writer monitors how many allocations are going on, and it tries to keep ahead of the backends by writing pages likely to be re-used in the near future out before (1) happens. (This is the part that was different in earlier versions--the background writer just roamed the whole buffer cache looking for work to do before, unrelated to the amount of activity on the system). (3) Checkpoints (which are also executed by the background writer) have to write out every dirty buffer in order to reconcile everything between memory and disk. One reason you can't just ionice the backend and make all the problems go away is (3); you can't let a sluggish backend stop checkpoints from happening. You might note that only one of these sources--a backend allocating a buffer--is connected to the process you want to limit. If you think of the problem from that side, it actually becomes possible to do something useful here. The most practical way to throttle something down without a complete database redesign is to attack the problem via allocation. If you limited the rate of how many buffers a backend was allowed to allocate and dirty in the first place, that would be extremely effective in limiting its potential damage to I/O too, albeit indirectly. Trying to limit the damage on the write and OS side instead is a dead end, you'll never make that work without a major development job--one that I would bet against ever being committed even if someone did it for a specific platform, because they're all going to be so different and the code so hackish. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith <greg@2ndquadrant.com> writes: > You might note that only one of these sources--a backend allocating a > buffer--is connected to the process you want to limit. If you think of > the problem from that side, it actually becomes possible to do something > useful here. The most practical way to throttle something down without > a complete database redesign is to attack the problem via allocation. > If you limited the rate of how many buffers a backend was allowed to > allocate and dirty in the first place, that would be extremely effective > in limiting its potential damage to I/O too, albeit indirectly. This is in fact exactly what the vacuum_cost_delay logic does. It might be interesting to investigate generalizing that logic so that it could throttle all of a backend's I/O not just vacuum. In principle I think it ought to work all right for any I/O-bound query. But, as noted upthread, this is not high on the priority list of any of the major developers. regards, tom lane
Tom Lane wrote: > This is in fact exactly what the vacuum_cost_delay logic does. > It might be interesting to investigate generalizing that logic > so that it could throttle all of a backend's I/O not just vacuum. > In principle I think it ought to work all right for any I/O-bound > query. > So much for inventing a new idea; never considered that parallel before. The logic is perfectly reusable, not so sure how much of the implementation would be though. I think the main difference is that there's one shared VacuumCostBalance to worry about, whereas each backend that might be limited would need its own clear scratchpad to accumulate costs into. That part seems similar to how the new EXPLAIN BUFFERS capability instruments things though, which was the angle I was thinking of approaching this from. Make that instrumenting more global, periodically compute a total cost from that instrument snapshot, and nap whenever the delta between the cost at the last nap and the current cost exceeds your threshold. Bet I could find some more consumers in user land who'd love to watch that instrumented data too, if it were expanded to be available for operations beyond just plan execution. I know it would make a lot of jobs easier if you could measure "that <x> statement cost you <y>" for more than just queries--for example, tracking whether any given UPDATE goes outside of the buffer cache or not would be fascinating tuning fodder. Ditto if you could get a roll-up of everything a particular connection did. The part specific to the rate limiting that I don't have any good idea about yet is where to put the napping logic at, such that it would work across everything an I/O limited backend might do. The only common point here seems to be the calls into the buffer manager code, but since that's happening with locks and pins you can't sleep in there. Not enthusiastic about sprinkling every type of backend operation with a call to some nap check routine. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Sat, Jan 16, 2010 at 4:09 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Tom Lane wrote: >> >> This is in fact exactly what the vacuum_cost_delay logic does. >> It might be interesting to investigate generalizing that logic >> so that it could throttle all of a backend's I/O not just vacuum. >> In principle I think it ought to work all right for any I/O-bound >> query. >> > > So much for inventing a new idea; never considered that parallel before. > The logic is perfectly reusable, not so sure how much of the implementation > would be though. > > I think the main difference is that there's one shared VacuumCostBalance to > worry about, whereas each backend that might be limited would need its own > clear scratchpad to accumulate costs into. That part seems similar to how > the new EXPLAIN BUFFERS capability instruments things though, which was the > angle I was thinking of approaching this from. Make that instrumenting more > global, periodically compute a total cost from that instrument snapshot, and > nap whenever the delta between the cost at the last nap and the current cost > exceeds your threshold. Seems like you'd also need to think about priority inversion, if the "low-priority" backend is holding any locks. ...Robert
Robert Haas wrote: > Seems like you'd also need to think about priority inversion, if the > "low-priority" backend is holding any locks. > Right, that's what I was alluding to in the last part: the non-obvious piece here is not how to decide when the backend should nap because it's done too much I/O, it's how to figure out when it's safe for it to do so without causing trouble for others. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Seems like you'd also need to think about priority inversion, if the
"low-priority" backend is holding any locks.
I'm not sure that priority inversion would be right in this scenario, because in that case the IO storm would still be able to exist, in the cases where the slow jobs collide with the need-to-remain-fast (aka real-time) operations on some lock . I'm using pg in a real time environment communicating with many different hardware, which all produce a light load, but all require real time response times, and allowing a proiority inversion would indirectly allow IO storms in those cases, going back to where everything started.
However, if such a mechanism was to be implemented, maybe it (the inversion of priorities) could be left as an option in the configuration, that could be turned on or off. In my case, I would just leave it off, but maybe for some applications they find it useful, knowing that io storms may still appear, given a series of conditions.
In the case where priority inversion is not to be used, I would however still greatly benefit from the slow jobs/fast jobs mechanism, just being extra-careful that the slow jobs, obviously, did not acquire any locks that a fast job would ever require. This alone would be, still, a *huge* feature if it was ever to be introduced, reinforcing the real-time awareness/requirements, that many applications look for today.
Eduardo Piombino wrote: > In the case where priority inversion is not to be used, I would > however still greatly benefit from the slow jobs/fast jobs mechanism, > just being extra-careful that the slow jobs, obviously, did not > acquire any locks that a fast job would ever require. This alone would > be, still, a *huge* feature if it was ever to be introduced, > reinforcing the real-time awareness/requirements, that many > applications look for today. In this context, "priority inversion" is not a generic term related to running things with lower priorities. It means something very specific: that you're allowing low-priority jobs to acquire locks on resources needed by high-priority ones, and therefore blocking the high-priority ones from running effectively. Unfortunately, much like deadlock, it's impossible to avoid the problem in a generic way just by being careful. It's one of the harder issues that needs to be considered in order to make progress on implementing this feature one day. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith <greg@2ndquadrant.com> writes: > In this context, "priority inversion" is not a generic term related to > running things with lower priorities. It means something very > specific: that you're allowing low-priority jobs to acquire locks on > resources needed by high-priority ones, and therefore blocking the > high-priority ones from running effectively. Unfortunately, much like > deadlock, it's impossible to avoid the problem in a generic way just by > being careful. It's one of the harder issues that needs to be > considered in order to make progress on implementing this feature one day. It might be worth remarking on how the vacuum_cost_delay logic deals with the issue. Basically, the additional sleeps are carefully inserted only at places where we are not holding any low-level locks (such as buffer content locks). We do not do anything about the table-level lock that vacuum has got, but vacuum's table lock is weak enough that it won't block most ordinary queries. So in typical circumstances it's not a problem if vacuum runs for a very long time. But you can definitely get burnt if you have a competing session trying to acquire an exclusive lock on the table being vacuumed, or if you enable vacuum_cost_delay on a VACUUM FULL. Autovacuum takes a couple of extra precautions: it never does VACUUM FULL at all, and it is set up so that a request for a conflicting exclusive lock causes the autovacuum operation to get canceled. The upshot is that you can enable autovacuum_cost_delay without much fear of creating priority-inversion delays for competing tasks. But it's not at all clear how we'd generalize this design to allow slowdown of other operations without creating significant inversion hazards. BTW, it was suggested upthread that the "cost balance" stuff represented an additional problem that'd have to be surmounted to get to a general solution. I don't think this is necessarily the case. The point of the cost balance code is to ensure that multiple autovacuum workers don't eat a disproportionate amount of resources. It's not clear that someone would even want such a feature for user-level background queries, and even if desirable it's certainly not a must-have thing. regards, tom lane
On Fri, 15 Jan 2010, Greg Smith wrote: >> It seems to me that CFQ is simply bandwidth limited by the extra processing >> it has to perform. > > I'm curious what you are doing when you see this. 16 disc 15kRPM RAID0, when using fadvise with more than 100 simultaneous 8kB random requests. I sent an email to the mailing list on 29 Jan 2008, but it got thrown away by the mailing list spam filter because it had an image in it (the graph showing interesting information). Gregory Stark replied to it in http://archives.postgresql.org/pgsql-performance/2008-01/msg00285.php I was using his synthetic test case program. > My theory has been that the "extra processing it has to perform" you describe > just doesn't matter in the context of a fast system where physical I/O is > always the bottleneck. Basically, to an extent, that's right. However, when you get 16 drives or more into a system, then it starts being an issue. Matthew -- For every complex problem, there is a solution that is simple, neat, and wrong. -- H. L. Mencken
Matthew Wakeling wrote: > On Fri, 15 Jan 2010, Greg Smith wrote: >> My theory has been that the "extra processing it has to perform" you >> describe just doesn't matter in the context of a fast system where >> physical I/O is always the bottleneck. > > Basically, to an extent, that's right. However, when you get 16 drives > or more into a system, then it starts being an issue. I guess if I test a system with *only* 16 drives in it one day, maybe I'll find out. Seriously though, there is some difference between a completely synthetic test like you noted issues with here, and anything you can see when running the database. I was commenting more on the state of things from the perspective of a database app, where I just haven't seen any of the CFQ issues I hear reports of in other contexts. I'm sure there are plenty of low-level tests where the differences between the schedulers is completely obvious and it doesn't look as good anymore, and I'll take a look at whether I can replicate the test case you saw a specific concern with here. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wed, 20 Jan 2010, Greg Smith wrote: >> Basically, to an extent, that's right. However, when you get 16 drives or >> more into a system, then it starts being an issue. > > I guess if I test a system with *only* 16 drives in it one day, maybe I'll > find out. *Curious* What sorts of systems have you tried so far? As the graph I just sent shows, the four schedulers are pretty-much identical in performance, until you start saturating it with simultaneous requests. CFQ levels out at a performance a little lower than the other three. > Seriously though, there is some difference between a completely synthetic > test like you noted issues with here, and anything you can see when running > the database. Granted, this test is rather synthetic. It is testing the rather unusual case of lots of simultaneous random small requests - more simultaneous requests than we advise people to run backends on a server. You'd probably need to get a RAID array a whole lot bigger than 16 drives to have a "normal workload" capable of demonstrating the performance difference, and even that isn't particularly major. Would be interesting research if anyone has a 200-spindle RAID array hanging around somewhere. Matthew -- A good programmer is one who looks both ways before crossing a one-way street. Considering the quality and quantity of one-way streets in Cambridge, it should be no surprise that there are so many good programmers there.