Thread: Performance Woes
Hi,
I have several databases. They are each about 35gb in size and have about 10.5K relations (count from pg_stat_all_tables) in them. Pg_class is about 26k rows and the data directory contains about 70k files. These are busy machines, they run about 50 xactions per second, ( aproxx insert / update / delete about 500 rows per second).
We started getting errors about the number of open file descriptors
: 2007-05-09 03:07:50.083 GMT 1146975740: LOG: 53000: out of file descriptors: Too many open files; release and retry
2007-05-09 03:07:50.083 GMT 1146975740: CONTEXT: SQL statement "insert ….. "
PL/pgSQL function "trigfunc_whatever" line 50 at execute statement
2007-05-09 03:07:50.083 GMT 1146975740: LOCATION: BasicOpenFile, fd.c:471
2007-05-09 03:07:50.091 GMT 0: LOG: 00000: duration: 12.362 ms
2007-05-09 03:07:50.091 GMT 0: LOCATION: exec_simple_query, postgres.c:1090
So we decreased the max_files_per_process to 800. This took care of the error *BUT* about quadrupled the IO wait that is happening on the machine. It went from a peek of about 50% to peeks of over 200% (4 processor machines, 4 gigs ram, raid). The load on the machine remained constant.
I am really to get an understanding of exactly what this setting is and ‘what’ is out of file descriptors and how I can fix that. I need to bring that IO back down.
Thanks for any help.
Ralph
--
Internal Virus Database is out-of-date.
Checked by AVG Free Edition.
Version: 7.5.432 / Virus Database: 268.15.9/573 - Release Date: 05/12/2006 16:07
I have several databases. They are each about 35gb in size and have about 10.5K relations (count from pg_stat_all_tables) in them. Pg_class is about 26k rows and the data directory contains about 70k files. These are busy machines, they run about 50 xactions per second, ( aproxx insert / update / delete about 500 rows per second).
We started getting errors about the number of open file descriptors
: 2007-05-09 03:07:50.083 GMT 1146975740: LOG: 53000: out of file descriptors: Too many open files; release and retry
2007-05-09 03:07:50.083 GMT 1146975740: CONTEXT: SQL statement "insert ….. "
PL/pgSQL function "trigfunc_whatever" line 50 at execute statement
2007-05-09 03:07:50.083 GMT 1146975740: LOCATION: BasicOpenFile, fd.c:471
2007-05-09 03:07:50.091 GMT 0: LOG: 00000: duration: 12.362 ms
2007-05-09 03:07:50.091 GMT 0: LOCATION: exec_simple_query, postgres.c:1090
So we decreased the max_files_per_process to 800. This took care of the error *BUT* about quadrupled the IO wait that is happening on the machine. It went from a peek of about 50% to peeks of over 200% (4 processor machines, 4 gigs ram, raid). The load on the machine remained constant.
What version of Pg/OS? What is your hardware config?
I had seen these errors with earlier versions of Pg 7.4.x which was fixed in later releases according to the changelogs
> 2007-05-09 03:07:50.083 GMT 1146975740: LOCATION: BasicOpenFile, > fd.c:471 > > 2007-05-09 03:07:50.091 GMT 0: LOG: 00000: duration: 12.362 ms > > 2007-05-09 03:07:50.091 GMT 0: LOCATION: exec_simple_query, > postgres.c:1090 > > > > So we decreased the max_files_per_process to 800. This took care > of the error **BUT** about quadrupled the IO wait that is happening > on the machine. It went from a peek of about 50% to peeks of over > 200% (4 processor machines, 4 gigs ram, raid). The load on the > machine remained constant. > Sounds to me like you just need to up the total amount of open files allowed by the operating system. Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of CAJ CAJ
Sent: 10 May 2007 12:26
To: Ralph Mason
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance Woes
I have several databases. They are each about 35gb in size and have about 10.5K relations (count from pg_stat_all_tables) in them. Pg_class is about 26k rows and the data directory contains about 70k files. These are busy machines, they run about 50 xactions per second, ( aproxx insert / update / delete about 500 rows per second).
We started getting errors about the number of open file descriptors
: 2007-05-09 03:07:50.083 GMT 1146975740: LOG: 53000: out of file descriptors: Too many open files; release and retry
2007-05-09 03:07:50.083 GMT 1146975740: CONTEXT: SQL statement "insert ….. "
PL/pgSQL function "trigfunc_whatever" line 50 at execute statement
2007-05-09 03:07:50.083 GMT 1146975740: LOCATION: BasicOpenFile, fd.c:471
2007-05-09 03:07:50.091 GMT 0: LOG: 00000: duration: 12.362 ms
2007-05-09 03:07:50.091 GMT 0: LOCATION: exec_simple_query, postgres.c:1090
So we decreased the max_files_per_process to 800. This took care of the error *BUT* about quadrupled the IO wait that is happening on the machine. It went from a peek of about 50% to peeks of over 200% (4 processor machines, 4 gigs ram, raid). The load on the machine remained constant.
>What version of Pg/OS? What is your hardware config?
>I had seen these errors with earlier versions of Pg 7.4.x which was fixed in later releases according to the changelogs
"PostgreSQL 8.1.4 on x86_64-redhat-linux-gnu, compiled by GCC x86_64-redhat-linux-gcc (GCC) 4.1.0 20060304 (Red Hat 4.1.0-3)"
su postgres -c 'ulimit -a'
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
max nice (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 49152
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1000000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
max rt priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 49152
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
file locks (-x) unlimited
Seems like I should be able to use lots and lots of open files.
Machines are quad processor opterons, 4gb ram with raid 5 data and logging to a raid 0.
Ralph
--
Internal Virus Database is out-of-date.
Checked by AVG Free Edition.
Version: 7.5.432 / Virus Database: 268.15.9/573 - Release Date: 05/12/2006 16:07
Hello, You likely need to increase your file-max parameters using sysctl.conf. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
On Wed, 2007-05-09 at 17:29 -0700, Joshua D. Drake wrote: > > 2007-05-09 03:07:50.083 GMT 1146975740: LOCATION: BasicOpenFile, > > fd.c:471 > > > > 2007-05-09 03:07:50.091 GMT 0: LOG: 00000: duration: 12.362 ms > > > > 2007-05-09 03:07:50.091 GMT 0: LOCATION: exec_simple_query, > > postgres.c:1090 > > > > > > > > So we decreased the max_files_per_process to 800. This took care > > of the error **BUT** about quadrupled the IO wait that is happening > > on the machine. It went from a peek of about 50% to peeks of over > > 200% (4 processor machines, 4 gigs ram, raid). The load on the > > machine remained constant. > > > > Sounds to me like you just need to up the total amount of open files > allowed by the operating system. It looks more like the opposite, here's the docs for max_files_per_process: "Sets the maximum number of simultaneously open files allowed to each server subprocess. The default is one thousand files. If the kernel is enforcing a safe per-process limit, you don't need to worry about this setting. But on some platforms (notably, most BSD systems), the kernel will allow individual processes to open many more files than the system can really support when a large number of processes all try to open that many files. If you find yourself seeing "Too many open files" failures, try reducing this setting. This parameter can only be set at server start." To me, that means that his machine is allowing the new FD to be created, but then can't really support that many so it gives an error. Ralph, how many connections do you have open at once? It seems like the machine perhaps just can't handle that many FDs in all of those processes at once. That is a lot of tables. Maybe a different OS will handle it better? Maybe there's some way that you can use fewer connections and then the OS could still handle it? Regards, Jeff Davis
>To me, that means that his machine is allowing the new FD to be created, >but then can't really support that many so it gives an error. files-max is 297834 ulimit is 1000000 (doesn't make sense but there you go) What I don’t really understand is with max_files_per_process at 800 we don't get the problem, but with 1000 we do. $lsof | wc -l 14944 $cat /proc/sys/fs/file-nr 12240 0 297834 >Ralph, how many connections do you have open at once? It seems like the >machine perhaps just can't handle that many FDs in all of those >processes at once. There are only 30 connections - of those probably only 10 are really active. It doesn't seem like we should be stressing this machine/ >That is a lot of tables. Maybe a different OS will handle it better? >Maybe there's some way that you can use fewer connections and then the >OS could still handle it? It would be less but then you can't maintain the db b/c of the constant vacuuming needed :-( I think the linux folks would get up in arms if you told them they couldn't handle that many open files ;-) Thanks, Ralph -- Internal Virus Database is out-of-date. Checked by AVG Free Edition. Version: 7.5.432 / Virus Database: 268.15.9/573 - Release Date: 05/12/2006 16:07
We have the kernel file-max setting set to 297834 (256 per 4mb of ram).
/proc/sys/fs/file-nr tells us that we have roughly 13000 allocated handles of which zero are always free.
On Wed, 2007-05-09 at 17:29 -0700, Joshua D. Drake wrote:
> > 2007-05-09 03:07:50.083 GMT 1146975740: LOCATION: BasicOpenFile,
> > fd.c:471
> >
> > 2007-05-09 03:07:50.091 GMT 0: LOG: 00000: duration: 12.362 ms
> >
> > 2007-05-09 03:07:50.091 GMT 0: LOCATION: exec_simple_query,
> > postgres.c:1090
> >
> >
> >
> > So we decreased the max_files_per_process to 800. This took care
> > of the error **BUT** about quadrupled the IO wait that is happening
> > on the machine. It went from a peek of about 50% to peeks of over
> > 200% (4 processor machines, 4 gigs ram, raid). The load on the
> > machine remained constant.
> >
>
> Sounds to me like you just need to up the total amount of open files
> allowed by the operating system.
It looks more like the opposite, here's the docs for
max_files_per_process:
"Sets the maximum number of simultaneously open files allowed to each
server subprocess. The default is one thousand files. If the kernel is
enforcing a safe per-process limit, you don't need to worry about this
setting. But on some platforms (notably, most BSD systems), the kernel
will allow individual processes to open many more files than the system
can really support when a large number of processes all try to open that
many files. If you find yourself seeing "Too many open files" failures,
try reducing this setting. This parameter can only be set at server
start."
To me, that means that his machine is allowing the new FD to be created,
but then can't really support that many so it gives an error.
Ralph, how many connections do you have open at once? It seems like the
machine perhaps just can't handle that many FDs in all of those
processes at once.
That is a lot of tables. Maybe a different OS will handle it better?
Maybe there's some way that you can use fewer connections and then the
OS could still handle it?
Regards,
Jeff Davis
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
--
Scott Mohekey
Systems Administrator
Telogis
Intelligent Location Technologies
NOTICE:
This message (including any attachments) contains CONFIDENTIAL INFORMATION intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message and are hereby notified that any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited
Ralph Mason wrote: > I have several databases. They are each about 35gb in size and have about > 10.5K relations (count from pg_stat_all_tables) in them. Pg_class is about > 26k rows and the data directory contains about 70k files. These are busy > machines, they run about 50 xactions per second, ( aproxx insert / update / > delete about 500 rows per second). Is it always the same trigger the problematic one? Is it just PL/pgSQL, or do you have something else? Something that may be trying to open additional files for example? Something that may be trying to open files behind your back? PL/Perl with funky operators or temp files? Also, what PG version is this? > So we decreased the max_files_per_process to 800. This took care of the > error *BUT* about quadrupled the IO wait that is happening on the machine. > It went from a peek of about 50% to peeks of over 200% (4 processor > machines, 4 gigs ram, raid). The load on the machine remained constant. The max_files_per_process settings controls how many actual file descriptors each process is allowed to have. Postgres uses internally a "virtual file descriptor", which normally have one file descriptor open each. However, if your transactions need to access lots of files, the VFDs will close the kernel FDs to allow other VFDs to open theirs. So it sounds like your transaction has more than 800 files open. The extra IO wait could be caused by the additional system calls to open and close those files as needed. I would actually expect it to cause extra "system" load (as opposed to "user") rather than IO, but I'm not sure. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Jeff Davis <pgsql@j-davis.com> writes: > On Wed, 2007-05-09 at 17:29 -0700, Joshua D. Drake wrote: >> Sounds to me like you just need to up the total amount of open files >> allowed by the operating system. > It looks more like the opposite, here's the docs for > max_files_per_process: I think Josh has got the right advice. The manual is just saying that you can reduce max_files_per_process to avoid the failure, but it's not making any promises about the performance penalty for doing that. Apparently Ralph's app needs a working set of between 800 and 1000 open files to have reasonable performance. > That is a lot of tables. Maybe a different OS will handle it better? > Maybe there's some way that you can use fewer connections and then the > OS could still handle it? Also, it might be worth rethinking the database structure to reduce the number of tables. But for a quick-fix, increasing the kernel limit seems like the easiest answer. regards, tom lane