Thread: Is my database now too big?
I'm starting to wonder if my database has finally grown too big for my computer. But, rather than peform badly, I see a number of "bad things" happen: * pg_dump fails to run, causing an "out of memory" error (I don't understand why this happens, but it does); * restarting postgres across a reboot results in tables disappearing without any error messages being given out (this is *really disturbing*.) postgres itself, however, starts up fine. "wc -l" of the output from the last successuful dump_all is around 8million lines, spread across half a dozen or so tables. I'm afraid that if I restructure the db to have smaller tables, I'd just be putting off "hitting" the ceiling. What can I safely do to get postgres running normally? I'm currently using 8.1.3, will I fare any better with a more recent version? My end goal is to have maybe 50 million records. Will postgresql handle this (with only 1.5GB of RAM) or do I need to look elsewhere for a db of this size? Thoughts? Cheers, Darren
Darren Reed wrote: > I'm starting to wonder if my database has finally grown too big for my > computer. > > But, rather than peform badly, I see a number of "bad things" happen: > > * pg_dump fails to run, causing an "out of memory" error (I don't > understand > why this happens, but it does); > > * restarting postgres across a reboot results in tables disappearing > without any error messages being given out (this is *really > disturbing*.) Are you vacuuming? > > postgres itself, however, starts up fine. > > "wc -l" of the output from the last successuful dump_all is around > 8million > lines, spread across half a dozen or so tables. > > I'm afraid that if I restructure the db to have smaller tables, I'd just > be > putting off "hitting" the ceiling. > > What can I safely do to get postgres running normally? > > I'm currently using 8.1.3, will I fare any better with a more recent > version? > > My end goal is to have maybe 50 million records. Will postgresql handle > this > (with only 1.5GB of RAM) or do I need to look elsewhere for a db of this > size? > > Thoughts? > > Cheers, > Darren > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
On Sat, 06 Oct 2007 23:05:23 -0700, "Joshua D. Drake" <jd@commandprompt.com> said: > Darren Reed wrote: > > I'm starting to wonder if my database has finally grown too big for my > > computer. > > > > But, rather than peform badly, I see a number of "bad things" happen: > > > > * pg_dump fails to run, causing an "out of memory" error (I don't > > understand > > why this happens, but it does); > > > > * restarting postgres across a reboot results in tables disappearing > > without any error messages being given out (this is *really > > disturbing*.) > > Are you vacuuming? No. I'm just using the standard config If you're implying I should? The problems I've come across have become evident with as little work as this: - run "initdb" - start postgres - run "psql" to import the dumped data - run for a short period of time (less than an hour) - shutdown postgres - reboot - postgres starts - the large tables are now missing With the above, I wouldn't have thought there is enough time to do vacuums, which seem to take quite a while on large tables. The activity is a mixture of inserts and updates (from triggers and plsql functions.) Darren
On 10/7/07, Darren Reed <darrenr+postgres@fastmail.net> wrote: > On Sat, 06 Oct 2007 23:05:23 -0700, "Joshua D. Drake" > <jd@commandprompt.com> said: > > Darren Reed wrote: > > > I'm starting to wonder if my database has finally grown too big for my > > > computer. > > > > > > But, rather than peform badly, I see a number of "bad things" happen: > > > > > > * pg_dump fails to run, causing an "out of memory" error (I don't > > > understand > > > why this happens, but it does); > > > > > > * restarting postgres across a reboot results in tables disappearing > > > without any error messages being given out (this is *really > > > disturbing*.) > > > > Are you vacuuming? > > No. I'm just using the standard config > > If you're implying I should? > > The problems I've come across have become evident with as little work as > this: > - run "initdb" > - start postgres > - run "psql" to import the dumped data > - run for a short period of time (less than an hour) > - shutdown postgres > - reboot > - postgres starts > - the large tables are now missing > > With the above, I wouldn't have thought there is enough time to do > vacuums, > which seem to take quite a while on large tables. The activity is a > mixture > of inserts and updates (from triggers and plsql functions.) Any reasonably modern version of pgsql should simply stop accepting requests rather than suffering loss due to txid wraparound. So,I can think of two possibilities here. Bad hardware or operator error. Assuming you've checked out your machine thoroughly for bad hardware, I can see a scenario where one does something like: begin; create table xyz; load 10,000,000 rows manipulate rows shutdown db without committing start database voila, table xyz is gone, and rightly so. Got more detailed info on what you're doing? There's no problem with a few million rows. That's actually a pretty small pgsql db. I have a reporting db that grows by about 1-2 million rows a month and takes up 40 gigs, and that's pretty small in the pgsql universe.
On 10/7/07, Darren Reed <darrenr@fastmail.net> wrote: > Scott Marlowe wrote: > > ... > > > > Any reasonably modern version of pgsql should simply stop accepting > > requests rather than suffering loss due to txid wraparound.So,I can > > think of two possibilities here. Bad hardware or operator error. > > > > Assuming you've checked out your machine thoroughly for bad hardware, > > I can see a scenario where one does something like: > > > > begin; > > create table xyz; > > load 10,000,000 rows > > manipulate rows > > shutdown db without committing > > start database > > voila, table xyz is gone, and rightly so. > > > > Got more detailed info on what you're doing? > > That does describe what was happening (I haven't used BEGIN/COMMIT.) then it isn't the same thing. If you did a begin, then did everything else without commit, the table would rightly disappear. > Nothing very special, I thought... > > But, doing "SELECT * FROM ifl LIMIT 1;" causes postgres to grow its > process to 2GB and then die because the OS ran out of swap! I doubt that exact query is causing the db to run out of memory, unless ifl is a complex view or something. Can you be more specific on what exact query causes the problem to show up? > Actually, this is a table that sees a lot of INSERT/DELETE (it's a place to > store work to be done and bits get removed when completed) and I haven't > been using BEGIN/COMMIT. This is how postgres currently handles it: > > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: record with zero length at 0/891157C8 > LOG: redo is not required > LOG: database system is ready > LOG: transaction ID wrap limit is 2147484146, limited by database > "postgres" > LOG: unexpected EOF on client connection > LOG: server process (PID 7212) was terminated by signal 9 > LOG: terminating any other active server processes > WARNING: terminating connection because of crash of another server process Looks like some query is running the server out of memory. Normally, postgresql will spill to disk if it needs more memory, unless it's miconfigured. > I'm modifying the work to use BEGIN/COMMIT, but the ifl table worries me... > I can't seem to do anything with it that doesn't cause postgres to crap > out ;( begin/commit ain't the problem here. Looks like you've either got pgsql set to use too much memory or it's choosing a bad plan where it thinks something will fit in memory but it won't. Have you been analyzing your data before you start working on it? Can we see your postgresql.conf file?
"Darren Reed" <darrenr+postgres@fastmail.net> writes: > * pg_dump fails to run, causing an "out of memory" error (I don't > understand > why this happens, but it does); Let's see the exact command to pg_dump and the exact failure message. It might be useful to add -v so we can get some sense of where in the process it fails. > * restarting postgres across a reboot results in tables disappearing > without any error messages being given out (this is *really > disturbing*.) Hm, system reboot you mean? To be honest I think this is likely pilot error, along the lines of the postmaster you are running after the reboot is using a different data directory than the one you were talking to before. Comparing the output of "SHOW data_directory;" before and after might be illuminating. I remember a vaguely similar incident awhile back in which it turned out that the DBA had put the data directory on a soft-mounted NFS server, which sometimes hadn't come up by the time Postgres started, and so sometimes he got the NFS server's version of the database and sometimes he got a version that was on the local disk underneath the NFS mount point. *That* was a mess :-(, and that was one reason why the PGDG and Red Hat init scripts for PG no longer do an automatic initdb if they don't find a valid data directory where it's supposed to be. Exactly how are you starting the postmaster, anyway, and what is the underlying platform here? Did you roll-your-own Postgres build or is this a prepackaged distribution? If the latter, whose? While I'm asking questions, is it just the tables that disappear or is it all your database objects (including functions etc)? Do *all* your tables disappear or just some of them? What do you mean by "disappear" exactly --- do you get 'relation "foo" does not exist', or some weirder error, or are the tables present but empty? > "wc -l" of the output from the last successuful dump_all is around > 8million lines, spread across half a dozen or so tables. As already pointed out, this database is tiny. I'm not sure what your problem is, but I'm quite sure it's not "database is too big". > I'm currently using 8.1.3, will I fare any better with a more recent > version? You should certainly be on 8.1.10 just on general principles, though I don't immediately recall any bugs that seem like they could be related. There have been plenty of bugs fixed since 8.1.3 though. regards, tom lane
Scott Marlowe wrote: > On 10/7/07, Darren Reed <darrenr@fastmail.net> wrote: > > Scott Marlowe wrote: > > > ... > > > > > > Any reasonably modern version of pgsql should simply stop accepting > > > requests rather than suffering loss due to txid wraparound.So,I can > > > think of two possibilities here. Bad hardware or operator error. > > > > > > Assuming you've checked out your machine thoroughly for bad hardware, > > > I can see a scenario where one does something like: > > > > > > begin; > > > create table xyz; > > > load 10,000,000 rows > > > manipulate rows > > > shutdown db without committing > > > start database > > > voila, table xyz is gone, and rightly so. > > > > > > Got more detailed info on what you're doing? > > > > That does describe what was happening (I haven't used BEGIN/COMMIT.) > > then it isn't the same thing. If you did a begin, then did everything > else without commit, the table would rightly disappear. > Right, I'm with you on that. A few days ago I did: pg_dumpall > foo What I was doing yesterday was: rm -rf /data/db/* initdb -D /data/db start psql < foo run for some period stop reboot start ...tables have gone but disk space is still in use. I dont know if it was during the period of running that the database got corrupted (interrupted insert/update/query?) or what happened. > > Nothing very special, I thought... > > > > But, doing "SELECT * FROM ifl LIMIT 1;" causes postgres to grow its > > process to 2GB and then die because the OS ran out of swap! > > I doubt that exact query is causing the db to run out of memory, > unless ifl is a complex view or something. > > Can you be more specific on what exact query causes the problem to show up? > It turned out that _any_ query on that table caused the problem to show up. I couldn't even do "DROP TABLE ifl;" without postgres growing until it ran out of memory. So in the end, I wiped it clean and reloaded the data - this time bounding all of the work with BEGIN/COMMIT. So far things are looking better. All of the data I've been building the tables with is elsewhere, so I can reconstruct it. Maybe adding BEGIN/COMMIT makes no difference to not using them before, but I'm curious to see if it does. Ideally I'd like to get to a place where I don't need to use vacuum at all. > > Actually, this is a table that sees a lot of INSERT/DELETE (it's a place to > > store work to be done and bits get removed when completed) and I haven't > > been using BEGIN/COMMIT. This is how postgres currently handles it: > > > > LOG: database system was not properly shut down; automatic recovery in > > progress > > LOG: record with zero length at 0/891157C8 > > LOG: redo is not required > > LOG: database system is ready > > LOG: transaction ID wrap limit is 2147484146, limited by database > > "postgres" > > LOG: unexpected EOF on client connection > > LOG: server process (PID 7212) was terminated by signal 9 > > LOG: terminating any other active server processes > > WARNING: terminating connection because of crash of another server process > > Looks like some query is running the server out of memory. Normally, > postgresql will spill to disk if it needs more memory, unless it's > miconfigured. > Yes. I tried increasing the swap space but that just meant it grew larger...from limit: datasize 3145728 kbytes This is from NetBSD 4.99. I ended up running with 3.5GB of SWAP and 1.5GB of RAM. > > I'm modifying the work to use BEGIN/COMMIT, but the ifl table worries me... > > I can't seem to do anything with it that doesn't cause postgres to crap > > out ;( > > begin/commit ain't the problem here. Looks like you've either got > pgsql set to use too much memory or it's choosing a bad plan where it > thinks something will fit in memory but it won't. > I have no other problems with any of the other tables and it is only a small table (at the time it should have had less than 5000 rows.) > Have you been analyzing your data before you start working on it? > No. > Can we see your postgresql.conf file? > Sure, I've attached it. I've also run with the "default" .conf file without tuning it (down.) Darren # ----------------------------- # PostgreSQL configuration file # ----------------------------- # # This file consists of lines of the form: # # name = value # # (The '=' is optional.) White space may be used. Comments are introduced # with '#' anywhere on a line. The complete list of option names and # allowed values can be found in the PostgreSQL documentation. The # commented-out settings shown in this file represent the default values. # # Please note that re-commenting a setting is NOT sufficient to revert it # to the default value, unless you restart the postmaster. # # Any option can also be given as a command line switch to the # postmaster, e.g. 'postmaster -c log_connections=on'. Some options # can be changed at run-time with the 'SET' SQL command. # # This file is read on postmaster startup and when the postmaster # receives a SIGHUP. If you edit the file on a running system, you have # to SIGHUP the postmaster for the changes to take effect, or use # "pg_ctl reload". Some settings, such as listen_addresses, require # a postmaster shutdown and restart to take effect. #--------------------------------------------------------------------------- # FILE LOCATIONS #--------------------------------------------------------------------------- # The default values of these variables are driven from the -D command line # switch or PGDATA environment variable, represented here as ConfigDir. #data_directory = 'ConfigDir' # use data in another directory #hba_file = 'ConfigDir/pg_hba.conf' # host-based authentication file #ident_file = 'ConfigDir/pg_ident.conf' # IDENT configuration file # If external_pid_file is not explicitly set, no extra pid file is written. #external_pid_file = '(none)' # write an extra pid file #--------------------------------------------------------------------------- # CONNECTIONS AND AUTHENTICATION #--------------------------------------------------------------------------- # - Connection Settings - #listen_addresses = 'localhost' # what IP address(es) to listen on; # comma-separated list of addresses; # defaults to 'localhost', '*' = all #port = 5432 max_connections = 15 # note: increasing max_connections costs ~400 bytes of shared memory per # connection slot, plus lock space (see max_locks_per_transaction). You # might also need to raise shared_buffers to support more connections. superuser_reserved_connections = 2 #unix_socket_directory = '' #unix_socket_group = '' #unix_socket_permissions = 0777 # octal #bonjour_name = '' # defaults to the computer name # - Security & Authentication - #authentication_timeout = 60 # 1-600, in seconds #ssl = off #password_encryption = on #db_user_namespace = off # Kerberos #krb_server_keyfile = '' #krb_srvname = 'postgres' #krb_server_hostname = '' # empty string matches any keytab entry #krb_caseins_users = off # - TCP Keepalives - # see 'man 7 tcp' for details #tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds; # 0 selects the system default #tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds; # 0 selects the system default #tcp_keepalives_count = 0 # TCP_KEEPCNT; # 0 selects the system default #--------------------------------------------------------------------------- # RESOURCE USAGE (except WAL) #--------------------------------------------------------------------------- # - Memory - shared_buffers = 2000 # min 16 or max_connections*2, 8KB each temp_buffers = 200 # min 100, 8KB each max_prepared_transactions = 5 # can be 0 or more # note: increasing max_prepared_transactions costs ~600 bytes of shared memory # per transaction slot, plus lock space (see max_locks_per_transaction). work_mem = 4096 # min 64, size in KB maintenance_work_mem = 8192 # min 1024, size in KB max_stack_depth = 400 # min 100, size in KB # - Free Space Map - max_fsm_pages = 20000 # min max_fsm_relations*16, 6 bytes each max_fsm_relations = 200 # min 100, ~70 bytes each # - Kernel Resource Usage - #max_files_per_process = 25 # min 25 #preload_libraries = '' # - Cost-Based Vacuum Delay - #vacuum_cost_delay = 0 # 0-1000 milliseconds #vacuum_cost_page_hit = 1 # 0-10000 credits #vacuum_cost_page_miss = 10 # 0-10000 credits #vacuum_cost_page_dirty = 20 # 0-10000 credits #vacuum_cost_limit = 200 # 0-10000 credits # - Background writer - #bgwriter_delay = 200 # 10-10000 milliseconds between rounds #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round #bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round #bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round #bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round #--------------------------------------------------------------------------- # WRITE AHEAD LOG #--------------------------------------------------------------------------- # - Settings - #fsync = on # turns forced synchronization on or off #wal_sync_method = fsync # the default is the first option # supported by the operating system: # open_datasync # fdatasync # fsync # fsync_writethrough # open_sync #full_page_writes = on # recover from partial page writes #wal_buffers = 8 # min 4, 8KB each #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 # - Checkpoints - #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 300 # range 30-3600, in seconds #checkpoint_warning = 30 # in seconds, 0 is off # - Archiving - #archive_command = '' # command to use to archive a logfile # segment #--------------------------------------------------------------------------- # QUERY TUNING #--------------------------------------------------------------------------- # - Planner Method Configuration - #enable_bitmapscan = on #enable_hashagg = on #enable_hashjoin = on #enable_indexscan = on #enable_mergejoin = on #enable_nestloop = on #enable_seqscan = on #enable_sort = on #enable_tidscan = on # - Planner Cost Constants - effective_cache_size = 1000 # typically 8KB each #random_page_cost = 4 # units are one sequential page fetch # cost #cpu_tuple_cost = 0.01 # (same) #cpu_index_tuple_cost = 0.001 # (same) #cpu_operator_cost = 0.0025 # (same) # - Genetic Query Optimizer - #geqo = on #geqo_threshold = 12 #geqo_effort = 5 # range 1-10 #geqo_pool_size = 0 # selects default based on effort #geqo_generations = 0 # selects default based on effort #geqo_selection_bias = 2.0 # range 1.5-2.0 # - Other Planner Options - #default_statistics_target = 10 # range 1-1000 #constraint_exclusion = off #from_collapse_limit = 8 #join_collapse_limit = 8 # 1 disables collapsing of explicit # JOINs #--------------------------------------------------------------------------- # ERROR REPORTING AND LOGGING #--------------------------------------------------------------------------- # - Where to Log - #log_destination = 'stderr' # Valid values are combinations of # stderr, syslog and eventlog, # depending on platform. # This is used when logging to stderr: #redirect_stderr = off # Enable capturing of stderr into log # files # These are only used if redirect_stderr is on: #log_directory = 'pg_log' # Directory where log files are written # Can be absolute or relative to PGDATA #log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # Log file name pattern. # Can include strftime() escapes #log_truncate_on_rotation = off # If on, any existing log file of the same # name as the new log file will be # truncated rather than appended to. But # such truncation only occurs on # time-driven rotation, not on restarts # or size-driven rotation. Default is # off, meaning append to existing files # in all cases. #log_rotation_age = 1440 # Automatic rotation of logfiles will # happen after so many minutes. 0 to # disable. #log_rotation_size = 10240 # Automatic rotation of logfiles will # happen after so many kilobytes of log # output. 0 to disable. # These are relevant when logging to syslog: #syslog_facility = 'LOCAL0' #syslog_ident = 'postgres' # - When to Log - #client_min_messages = notice # Values, in order of decreasing detail: # debug5 # debug4 # debug3 # debug2 # debug1 # log # notice # warning # error #log_min_messages = notice # Values, in order of decreasing detail: # debug5 # debug4 # debug3 # debug2 # debug1 # info # notice # warning # error # log # fatal # panic #log_error_verbosity = default # terse, default, or verbose messages #log_min_error_statement = panic # Values in order of increasing severity: # debug5 # debug4 # debug3 # debug2 # debug1 # info # notice # warning # error # panic(off) #log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements # and their durations, in milliseconds. #silent_mode = off # DO NOT USE without syslog or # redirect_stderr # - What to Log - #debug_print_parse = off #debug_print_rewritten = off #debug_print_plan = off #debug_pretty_print = off #log_connections = off #log_disconnections = off #log_duration = off #log_line_prefix = '' # Special values: # %u = user name # %d = database name # %r = remote host and port # %h = remote host # %p = PID # %t = timestamp (no milliseconds) # %m = timestamp with milliseconds # %i = command tag # %c = session id # %l = session line number # %s = session start timestamp # %x = transaction id # %q = stop here in non-session # processes # %% = '%' # e.g. '<%u%%%d> ' #log_statement = 'none' # none, mod, ddl, all #log_hostname = off #--------------------------------------------------------------------------- # RUNTIME STATISTICS #--------------------------------------------------------------------------- # - Statistics Monitoring - #log_parser_stats = off #log_planner_stats = off #log_executor_stats = off #log_statement_stats = off # - Query/Index Statistics Collector - #stats_start_collector = on #stats_command_string = off #stats_block_level = off #stats_row_level = off #stats_reset_on_server_start = off #--------------------------------------------------------------------------- # AUTOVACUUM PARAMETERS #--------------------------------------------------------------------------- #autovacuum = off # enable autovacuum subprocess? #autovacuum_naptime = 60 # time between autovacuum runs, in secs #autovacuum_vacuum_threshold = 1000 # min # of tuple updates before # vacuum #autovacuum_analyze_threshold = 500 # min # of tuple updates before # analyze #autovacuum_vacuum_scale_factor = 0.4 # fraction of rel size before # vacuum #autovacuum_analyze_scale_factor = 0.2 # fraction of rel size before # analyze #autovacuum_vacuum_cost_delay = -1 # default vacuum cost delay for # autovac, -1 means use # vacuum_cost_delay #autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for # autovac, -1 means use # vacuum_cost_limit #--------------------------------------------------------------------------- # CLIENT CONNECTION DEFAULTS #--------------------------------------------------------------------------- # - Statement Behavior - #search_path = '$user,public' # schema names #default_tablespace = '' # a tablespace name, '' uses # the default #check_function_bodies = on #default_transaction_isolation = 'read committed' #default_transaction_read_only = off #statement_timeout = 0 # 0 is disabled, in milliseconds # - Locale and Formatting - #datestyle = 'iso, mdy' #timezone = unknown # actually, defaults to TZ # environment setting #australian_timezones = off #extra_float_digits = 0 # min -15, max 2 #client_encoding = sql_ascii # actually, defaults to database # encoding # These settings are initialized by initdb -- they might be changed #lc_messages = 'C' # locale for system error message # strings #lc_monetary = 'C' # locale for monetary formatting #lc_numeric = 'C' # locale for number formatting #lc_time = 'C' # locale for time formatting # - Other Defaults - #explain_pretty_print = on #dynamic_library_path = '$libdir' #--------------------------------------------------------------------------- # LOCK MANAGEMENT #--------------------------------------------------------------------------- #deadlock_timeout = 1000 # in milliseconds #max_locks_per_transaction = 64 # min 10 # note: each lock table slot uses ~220 bytes of shared memory, and there are # max_locks_per_transaction * (max_connections + max_prepared_transactions) # lock table slots. #--------------------------------------------------------------------------- # VERSION/PLATFORM COMPATIBILITY #--------------------------------------------------------------------------- # - Previous Postgres Versions - #add_missing_from = off #regex_flavor = advanced # advanced, extended, or basic #sql_inheritance = on #default_with_oids = off #escape_string_warning = off # - Other Platforms & Clients - #transform_null_equals = off #--------------------------------------------------------------------------- # CUSTOMIZED OPTIONS #--------------------------------------------------------------------------- #custom_variable_classes = '' # list of custom variable class names
On 10/7/07, Darren Reed <darrenr+postgres@fastmail.net> wrote: > Scott Marlowe wrote: > > On 10/7/07, Darren Reed <darrenr@fastmail.net> wrote: > > > Scott Marlowe wrote: > A few days ago I did: > pg_dumpall > foo > What I was doing yesterday was: > rm -rf /data/db/* > initdb -D /data/db > start > psql < foo > run for some period > stop > reboot > start > ...tables have gone but disk space is still in use. > I dont know if it was during the period of running that the > database got corrupted (interrupted insert/update/query?) > or what happened. Are you sure postgresql was starting up in the /data/db directory after reboot and not somewhere else like /var/lib/pgsql/data??? IF you're definitely hitting the right directory, then Is the database shutting down cleanly on reboot? It might be that it's getting killed during a write and you've got some kind of problem with fsync on your machine so the db is getting corrupted > > Can you be more specific on what exact query causes the problem to show up? > > > > It turned out that _any_ query on that table caused the problem to show up. > > I couldn't even do "DROP TABLE ifl;" without postgres growing until it > ran out of memory. definitely sounds like some kind of issue other just the size of the table, like some kind of corruption. > > So in the end, I wiped it clean and reloaded the data - this time > bounding all of the > work with BEGIN/COMMIT. So far things are looking better. All of the > data I've > been building the tables with is elsewhere, so I can reconstruct it. > Maybe adding > BEGIN/COMMIT makes no difference to not using them before, but I'm curious > to see if it does. Ideally I'd like to get to a place where I don't > need to use vacuum > at all. If nothing else, wrapping your load and building in begin;commit; should make it faster. > > Have you been analyzing your data before you start working on it? > > > > No. You should definitely run analyze after updating the table. It helps the query planner make the optimal choice for query plans. > > > Can we see your postgresql.conf file? > > > > Sure, I've attached it. > I've also run with the "default" .conf file without tuning it (down.) > > Darren > > Nothing odd here: > shared_buffers = 2000 # min 16 or max_connections*2, 8KB each > temp_buffers = 200 # min 100, 8KB each > max_prepared_transactions = 5 # can be 0 or more > # note: increasing max_prepared_transactions costs ~600 bytes of shared memory > # per transaction slot, plus lock space (see max_locks_per_transaction). > work_mem = 4096 # min 64, size in KB > maintenance_work_mem = 8192 # min 1024, size in KB > max_stack_depth = 400 # min 100, size in KB > > # - Free Space Map - > > max_fsm_pages = 20000 # min max_fsm_relations*16, 6 bytes each > max_fsm_relations = 200 # min 100, ~70 bytes each > This: > effective_cache_size = 1000 # typically 8KB each is a little low, but that's not a huge deal. effective cache size doesn't allocate anything, it just tells the planner about how much memory the OS is using to cache your database. And I don't see anything else in your postgresql.conf that looks suspicious. I'm leaning towards possible pilot error in shutting down or starting up the db.
Scott Marlowe wrote: > ... > > Any reasonably modern version of pgsql should simply stop accepting > requests rather than suffering loss due to txid wraparound.So,I can > think of two possibilities here. Bad hardware or operator error. > > Assuming you've checked out your machine thoroughly for bad hardware, > I can see a scenario where one does something like: > > begin; > create table xyz; > load 10,000,000 rows > manipulate rows > shutdown db without committing > start database > voila, table xyz is gone, and rightly so. > > Got more detailed info on what you're doing? That does describe what was happening (I haven't used BEGIN/COMMIT.) Nothing very special, I thought... But, doing "SELECT * FROM ifl LIMIT 1;" causes postgres to grow its process to 2GB and then die because the OS ran out of swap! Actually, this is a table that sees a lot of INSERT/DELETE (it's a place to store work to be done and bits get removed when completed) and I haven't been using BEGIN/COMMIT. This is how postgres currently handles it: LOG: database system was not properly shut down; automatic recovery in progress LOG: record with zero length at 0/891157C8 LOG: redo is not required LOG: database system is ready LOG: transaction ID wrap limit is 2147484146, limited by database "postgres" LOG: unexpected EOF on client connection LOG: server process (PID 7212) was terminated by signal 9 LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: all server processes terminated; reinitializing LOG: database system was interrupted at 2007-10-07 02:22:05 PDT LOG: checkpoint record is at 0/891157C8 LOG: redo record is at 0/891157C8; undo record is at 0/0; shutdown TRUE LOG: next transaction ID: 2715; next OID: 24576 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: database system was not properly shut down; automatic recovery in progress LOG: record with zero length at 0/8911580C LOG: redo is not required LOG: database system is ready LOG: transaction ID wrap limit is 2147484146, limited by database "postgres" I'm modifying the work to use BEGIN/COMMIT, but the ifl table worries me... I can't seem to do anything with it that doesn't cause postgres to crap out ;( Darren
Scott Marlowe wrote: > On 10/7/07, Darren Reed <darrenr+postgres@fastmail.net> wrote: > > Scott Marlowe wrote: > > > On 10/7/07, Darren Reed <darrenr@fastmail.net> wrote: > > > > Scott Marlowe wrote: > > > A few days ago I did: > > pg_dumpall > foo > > What I was doing yesterday was: > > rm -rf /data/db/* > > initdb -D /data/db > > start > > psql < foo > > run for some period > > stop > > reboot > > start > > ...tables have gone but disk space is still in use. > > I dont know if it was during the period of running that the > > database got corrupted (interrupted insert/update/query?) > > or what happened. > > Are you sure postgresql was starting up in the /data/db directory > after reboot and not somewhere else like /var/lib/pgsql/data??? > > IF you're definitely hitting the right directory, then Is the database > shutting down cleanly on reboot? It might be that it's getting killed > during a write and you've got some kind of problem with fsync on your > machine so the db is getting corrupted > > > > Can you be more specific on what exact query causes the problem to show up? > > > > > > > It turned out that _any_ query on that table caused the problem to show up. > > > > I couldn't even do "DROP TABLE ifl;" without postgres growing until it > > ran out of memory. > > definitely sounds like some kind of issue other just the size of the > table, like some kind of corruption. > > ... > And I don't see anything else in your postgresql.conf that looks > suspicious. I'm leaning towards possible pilot error in shutting down > or starting up the db. > Ok, I've had another reoccurance of this problem. The sequence of events was something like this: CREATE TABLESPACE foo LOCATION "/data/index/ext"; <wait> <machine hangs> <reboot> Of course postgresql didn't shut down cleanly because it was naughtly earlier and ate all my RAM, causing the box to hang. Now I'm back to the prior problem: entire tables are missing when postgresql starts back up again. Obviously there is some sort of corruption (caused by postgresql) and it isn't able to recover properly. So I'm moving quite squarely away from postgres being able to gobble up as much RAM is it desires. Darren
Darren Reed wrote: > Scott Marlowe wrote: >> On 10/7/07, Darren Reed <darrenr+postgres@fastmail.net> wrote: >> > Scott Marlowe wrote: >> > > On 10/7/07, Darren Reed <darrenr@fastmail.net> wrote: >> > > > Scott Marlowe wrote: >> >> > A few days ago I did: >> > pg_dumpall > foo >> > What I was doing yesterday was: >> > rm -rf /data/db/* >> > initdb -D /data/db >> > start >> > psql < foo >> > run for some period >> > stop >> > reboot >> > start >> > ...tables have gone but disk space is still in use. >> > I dont know if it was during the period of running that the >> > database got corrupted (interrupted insert/update/query?) >> > or what happened. >> >> Are you sure postgresql was starting up in the /data/db directory >> after reboot and not somewhere else like /var/lib/pgsql/data??? >> >> IF you're definitely hitting the right directory, then Is the database >> shutting down cleanly on reboot? It might be that it's getting killed >> during a write and you've got some kind of problem with fsync on your >> machine so the db is getting corrupted >> >> > > Can you be more specific on what exact query causes the problem >> to show up? >> > > >> > >> > It turned out that _any_ query on that table caused the problem to >> show up. >> > >> > I couldn't even do "DROP TABLE ifl;" without postgres growing until it >> > ran out of memory. >> >> definitely sounds like some kind of issue other just the size of the >> table, like some kind of corruption. >> >> ... >> And I don't see anything else in your postgresql.conf that looks >> suspicious. I'm leaning towards possible pilot error in shutting down >> or starting up the db. >> > > Ok, I've had another reoccurance of this problem. > > The sequence of events was something like this: > CREATE TABLESPACE foo LOCATION "/data/index/ext"; > <wait> > <machine hangs> > <reboot> > Of course postgresql didn't shut down cleanly because it was > naughtly earlier and ate all my RAM, causing the box to hang. > Now I'm back to the prior problem: entire tables are missing > when postgresql starts back up again. Obviously there is some > sort of corruption (caused by postgresql) and it isn't able to > recover properly. I suppose the obvious question here is how do I restore the missing tables? The data is all there, is there some sort of transaction log that can be unwound to restore visibility of the missing tables? Where do I start for trying to do some forensics? Or am I SOL? Later on... So having upgraded to 8.1.10, I thought I'd try out a few things... # /usr/pkg/bin/pg_dumpall -- -- PostgreSQL database cluster dump -- \connect postgres -- -- Roles -- CREATE ROLE postgres; ALTER ROLE postgres WITH SUPERUSER INHERIT CREATEROLE CREATEDB LOGIN; CREATE ROLE root; ALTER ROLE root WITH SUPERUSER INHERIT CREATEROLE CREATEDB LOGIN; -- -- Database creation -- REVOKE ALL ON DATABASE template1 FROM PUBLIC; REVOKE ALL ON DATABASE template1 FROM postgres; GRANT ALL ON DATABASE template1 TO postgres; \connect postgres pg_dump: SQL command failed pg_dump: Error message from server: ERROR: out of memory DETAIL: Failed on request of size 20. pg_dump: The command was: LOCK TABLE public.foo IN ACCESS SHARE MODE pg_dumpall: pg_dump failed on database "postgres", exiting # SELECT * FROM foo LIMIT 1; ERROR: out of memory DETAIL: Failed on request of size 20. I don't know if this is of any significance: # \l List of databases Name | Owner | Encoding -----------+----------+----------- postgres | postgres | SQL_ASCII template0 | postgres | SQL_ASCII template1 | postgres | SQL_ASCII (3 rows) # /usr/pkg/bin/psql -U postgres template1 psql: FATAL: out of memory DETAIL: Failed on request of size 20. What puzzles me is why the transaction log hasn't resulted in postgresql being able to restore itself to a known clean state. Darren
Darren Reed wrote: > Scott Marlowe wrote: >> On 10/7/07, Darren Reed <darrenr+postgres@fastmail.net> wrote: >> > Scott Marlowe wrote: >> > > On 10/7/07, Darren Reed <darrenr@fastmail.net> wrote: >> > > > Scott Marlowe wrote: >> >> > A few days ago I did: >> > pg_dumpall > foo >> > What I was doing yesterday was: >> > rm -rf /data/db/* >> > initdb -D /data/db >> > start >> > psql < foo >> > run for some period >> > stop >> > reboot >> > start >> > ...tables have gone but disk space is still in use. >> > I dont know if it was during the period of running that the >> > database got corrupted (interrupted insert/update/query?) >> > or what happened. >> >> Are you sure postgresql was starting up in the /data/db directory >> after reboot and not somewhere else like /var/lib/pgsql/data??? >> >> IF you're definitely hitting the right directory, then Is the database >> shutting down cleanly on reboot? It might be that it's getting killed >> during a write and you've got some kind of problem with fsync on your >> machine so the db is getting corrupted >> >> > > Can you be more specific on what exact query causes the problem >> to show up? >> > > >> > >> > It turned out that _any_ query on that table caused the problem to >> show up. >> > >> > I couldn't even do "DROP TABLE ifl;" without postgres growing until it >> > ran out of memory. >> >> definitely sounds like some kind of issue other just the size of the >> table, like some kind of corruption. >> >> ... >> And I don't see anything else in your postgresql.conf that looks >> suspicious. I'm leaning towards possible pilot error in shutting down >> or starting up the db. >> > > Ok, I've had another reoccurance of this problem. > > The sequence of events was something like this: > CREATE TABLESPACE foo LOCATION "/data/index/ext"; > <wait> > <machine hangs> > <reboot> > Of course postgresql didn't shut down cleanly because it was > naughtly earlier and ate all my RAM, causing the box to hang. > Now I'm back to the prior problem: entire tables are missing > when postgresql starts back up again. Obviously there is some > sort of corruption (caused by postgresql) and it isn't able to > recover properly. I suppose the obvious question here is how do I restore the missing tables? The data is all there, is there some sort of transaction log that can be unwound to restore visibility of the missing tables? Where do I start for trying to do some forensics? Or am I SOL? Darren
So having upgraded to 8.1.10, I thought I'd try out a few things... # /usr/pkg/bin/pg_dumpall -- -- PostgreSQL database cluster dump -- \connect postgres -- -- Roles -- CREATE ROLE postgres; ALTER ROLE postgres WITH SUPERUSER INHERIT CREATEROLE CREATEDB LOGIN; CREATE ROLE root; ALTER ROLE root WITH SUPERUSER INHERIT CREATEROLE CREATEDB LOGIN; -- -- Database creation -- REVOKE ALL ON DATABASE template1 FROM PUBLIC; REVOKE ALL ON DATABASE template1 FROM postgres; GRANT ALL ON DATABASE template1 TO postgres; \connect postgres pg_dump: SQL command failed pg_dump: Error message from server: ERROR: out of memory DETAIL: Failed on request of size 20. pg_dump: The command was: LOCK TABLE public.foo IN ACCESS SHARE MODE pg_dumpall: pg_dump failed on database "postgres", exiting # SELECT * FROM foo LIMIT 1; ERROR: out of memory DETAIL: Failed on request of size 20. I don't know if this is of any significance: # \l List of databases Name | Owner | Encoding -----------+----------+----------- postgres | postgres | SQL_ASCII template0 | postgres | SQL_ASCII template1 | postgres | SQL_ASCII (3 rows) # /usr/pkg/bin/psql -U postgres template1 psql: FATAL: out of memory DETAIL: Failed on request of size 20. What puzzles me is why the transaction log hasn't resulted in postgresql being able to restore itself to a known clean state. Darren
Darren Reed <darrenr@fastmail.net> writes: > # /usr/pkg/bin/psql -U postgres template1 > psql: FATAL: out of memory > DETAIL: Failed on request of size 20. I'm starting to think there is something very broken about your machine :-(. Have you run any hardware diagnostics on it lately? The level of flakiness you're seeing starts to suggest bad RAM to me. Anyway, the above error should have also produced a map of per-context memory usage in the postmaster log (ie, postmaster stderr). If you could show us that, it might be revealing. regards, tom lane
Tom Lane wrote: > Darren Reed <darrenr@fastmail.net> writes: > > # /usr/pkg/bin/psql -U postgres template1 > > psql: FATAL: out of memory > > DETAIL: Failed on request of size 20. > > I'm starting to think there is something very broken about your > machine :-(. > Have you run any hardware diagnostics on it lately? The level of > flakiness you're seeing starts to suggest bad RAM to me. > No, I haven't run any diagnostics. But I'm not convinced the hardware is a problem because the flakiness has only really been a problem when I started doing more than just inserts and updates. The table that has shown the most problems (ifl) is a table of work to do, so I'm inserting records, doing a select of random items out of it and also deleting records (once the work is complete.) Multiple processes can be trying to simultaneously be doing all of these, which should not be anything out of the ordinary. Or maybe this workload is just making the database stress the hardware more? > Anyway, the above error should have also produced a map of per-context > memory usage in the postmaster log (ie, postmaster stderr). If you > could show us that, it might be revealing. > I think what you're referring to is this: TopMemoryContext: 32768 total in 3 blocks; 7776 free (8 chunks); 24992 used Operator class cache: 8192 total in 1 blocks; 4936 free (0 chunks); 3256 used TopTransactionContext: 8192 total in 1 blocks; 7856 free (0 chunks); 336 used MessageContext: 267378688 total in 43 blocks; 1768 free (45 chunks); 267376920 used smgr relation table: 8192 total in 1 blocks; 3904 free (0 chunks); 4288 used Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used PortalMemory: 8192 total in 1 blocks; 8176 free (1 chunks); 16 used Relcache by OID: 8192 total in 1 blocks; 3896 free (0 chunks); 4296 used CacheMemoryContext: 253952 total in 5 blocks; 11880 free (1 chunks); 242072 used pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_type_typname_nsp_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_type_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_statistic_relid_att_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_auth_members_member_role_index: 1024 total in 1 blocks; 328 free (0 chunks);696 used pg_auth_members_role_member_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_proc_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 192 free (0 chunks); 832 used pg_operator_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_opclass_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_namespace_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_namespace_nspname_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_language_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_language_name_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_inherits_relid_seqno_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_index_indexrelid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_authid_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_authid_rolname_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_database_datname_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_conversion_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_conversion_name_nsp_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_conversion_default_index: 1024 total in 1 blocks; 192 free (0 chunks); 832 used pg_class_relname_nsp_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_class_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_cast_source_target_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_amproc_opc_proc_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_amop_opr_opc_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_amop_opc_strat_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_amop_opc_strat_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_aggregate_fnoid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used MdSmgr: 8192 total in 1 blocks; 8152 free (0 chunks); 40 used LockTable (locallock hash): 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used Timezones: 47592 total in 2 blocks; 5968 free (0 chunks); 41624 used ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used ERROR: out of memory DETAIL: Failed on request of size 20. TopMemoryContext: 32768 total in 3 blocks; 7776 free (8 chunks); 24992 used The standout problem is the "MessageContext" count. Darren
On 10/15/07, Darren Reed <darrenr@fastmail.net> wrote: > Tom Lane wrote: > > Darren Reed <darrenr@fastmail.net> writes: > > > # /usr/pkg/bin/psql -U postgres template1 > > > psql: FATAL: out of memory > > > DETAIL: Failed on request of size 20. > > > > I'm starting to think there is something very broken about your machine :-(. > > Have you run any hardware diagnostics on it lately? The level of > > flakiness you're seeing starts to suggest bad RAM to me. > > > > No, I haven't run any diagnostics. > > But I'm not convinced the hardware is a problem because the flakiness > has only really > been a problem when I started doing more than just inserts and updates. > The table that > has shown the most problems (ifl) is a table of work to do, so I'm > inserting records, > doing a select of random items out of it and also deleting records (once > the work is > complete.) Multiple processes can be trying to simultaneously be doing > all of these, > which should not be anything out of the ordinary. Or maybe this > workload is just making > the database stress the hardware more? So, I'm guessing you're doing something like: select * from ifl order by random() in several different threads? that means that the table has to be materialized twice in memory, and then most of the result thrown away. Generally, the preferred way to do a random select of a large table is to assign a random number to each row and then select a range based on that number. So, you have 100,000 rows, you assign the numbers 1 through 100,000 to each row at random, then you select them using something like a sequence to make sure that each process isn't bumping into each other. So, if you're going to process 10 records at a time, you create a sequence with an increment of 10 and use a select from it to get your "random" row to operate on Again, I'm kinda shooting in the dark here as you reveal more and more what you are doing a little at a time. A test case that can invoke this failure would be most useful.
Tom Lane wrote: > Darren Reed <darrenr@fastmail.net> writes: > > # /usr/pkg/bin/psql -U postgres template1 > > psql: FATAL: out of memory > > DETAIL: Failed on request of size 20. > > I'm starting to think there is something very broken about your machine :-(. > Have you run any hardware diagnostics on it lately? The level of > flakiness you're seeing starts to suggest bad RAM to me. > No, I haven't run any diagnostics. But I'm not convinced the hardware is a problem because the flakiness has only really been a problem when I started doing more than just inserts and updates. The table that has shown the most problems (ifl) is a table of work to do, so I'm inserting records, doing a select of random items out of it and also deleting records (once the work is complete.) Multiple processes can be trying to simultaneously be doing all of these, which should not be anything out of the ordinary. Or maybe this workload is just making the database stress the hardware more? > Anyway, the above error should have also produced a map of per-context > memory usage in the postmaster log (ie, postmaster stderr). If you > could show us that, it might be revealing. > I think what you're referring to is this: TopMemoryContext: 32768 total in 3 blocks; 7776 free (8 chunks); 24992 used Operator class cache: 8192 total in 1 blocks; 4936 free (0 chunks); 3256 used TopTransactionContext: 8192 total in 1 blocks; 7856 free (0 chunks); 336 used MessageContext: 267378688 total in 43 blocks; 1768 free (45 chunks); 267376920 used smgr relation table: 8192 total in 1 blocks; 3904 free (0 chunks); 4288 used Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used PortalMemory: 8192 total in 1 blocks; 8176 free (1 chunks); 16 used Relcache by OID: 8192 total in 1 blocks; 3896 free (0 chunks); 4296 used CacheMemoryContext: 253952 total in 5 blocks; 11880 free (1 chunks); 242072 used pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_type_typname_nsp_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_type_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_statistic_relid_att_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_auth_members_member_role_index: 1024 total in 1 blocks; 328 free (0 chunks);696 used pg_auth_members_role_member_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_proc_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 192 free (0 chunks); 832 used pg_operator_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_opclass_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_namespace_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_namespace_nspname_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_language_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_language_name_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_inherits_relid_seqno_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_index_indexrelid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_authid_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_authid_rolname_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_database_datname_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_conversion_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_conversion_name_nsp_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_conversion_default_index: 1024 total in 1 blocks; 192 free (0 chunks); 832 used pg_class_relname_nsp_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_class_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_cast_source_target_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_amproc_opc_proc_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_amop_opr_opc_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_amop_opc_strat_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_amop_opc_strat_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_aggregate_fnoid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used MdSmgr: 8192 total in 1 blocks; 8152 free (0 chunks); 40 used LockTable (locallock hash): 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used Timezones: 47592 total in 2 blocks; 5968 free (0 chunks); 41624 used ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used ERROR: out of memory DETAIL: Failed on request of size 20. TopMemoryContext: 32768 total in 3 blocks; 7776 free (8 chunks); 24992 used The standout problem is the "MessageContext" count. Darren
Darren Reed <darrenr@fastmail.net> writes: > Tom Lane wrote: >> Anyway, the above error should have also produced a map of per-context >> memory usage in the postmaster log (ie, postmaster stderr). If you >> could show us that, it might be revealing. > MessageContext: 267378688 total in 43 blocks; 1768 free (45 chunks); > 267376920 used > The standout problem is the "MessageContext" count. Indeed. And there shouldn't even be anything in MessageContext until the first client command has been received. Maybe you have something in ~/.psqlrc that you haven't told us about? regards, tom lane
Tom Lane wrote: > Darren Reed <darrenr@fastmail.net> writes: > > Tom Lane wrote: > >> Anyway, the above error should have also produced a map of per-context > >> memory usage in the postmaster log (ie, postmaster stderr). If you > >> could show us that, it might be revealing. > > > MessageContext: 267378688 total in 43 blocks; 1768 free (45 chunks); > > 267376920 used > > > The standout problem is the "MessageContext" count. > > Indeed. And there shouldn't even be anything in MessageContext until > the first client command has been received. Maybe you have something > in ~/.psqlrc that you haven't told us about? > That's easy - I don't even have one of these files! Darren
Darren Reed <darrenr+postgres@fastmail.net> writes: > Tom Lane wrote: >> Indeed. And there shouldn't even be anything in MessageContext until >> the first client command has been received. Maybe you have something >> in ~/.psqlrc that you haven't told us about? > That's easy - I don't even have one of these files! Then the behavior you showed is impossible ;-) There is *something* that is completely broken about your machine, and the rest of us really don't have enough context to tell what. You haven't told us anything about the hardware or operating system, or how you built or obtained the Postgres executables. I don't think you should dismiss the possibility of a hardware problem, especially since the failures aren't 100% reproducible (AFAICT from your previous remarks). We've seen more than one case where Postgres stressed a system more than anything else that was being run, and thereby exposed hardware problems that didn't manifest otherwise. For instance, a bit of bad RAM up near the end of physical memory might not get used at all until Postgres starts eating up memory. Another line of thought is that you built Postgres with a buggy compiler and thereby got buggy executables. Have you tried running the PG regression tests? regards, tom lane
Scott Marlowe wrote: > ... > > Again, I'm kinda shooting in the dark here as you reveal more and more > what you are doing a little at a time. A test case that can invoke > this failure would be most useful. > After seeing this today: ERROR: duplicate key violates unique constraint "ers_pkey" ERROR: duplicate key violates unique constraint "foo_pkey" ERROR: duplicate key violates unique constraint "foo_pkey" ERROR: duplicate key violates unique constraint "foo_pkey" ERROR: duplicate key violates unique constraint "foo_pkey" ERROR: duplicate key violates unique constraint "foo_pkey" ERROR: could not open segment 1 of relation 1663/10793/2659 (target block 858862642): No such file or directory ERROR: could not open segment 1 of relation 1663/10793/2659 (target block 858862642): No such file or directory ERROR: could not open segment 1 of relation 1663/10793/2659 (target block 858862642): No such file or directory ERROR: could not open segment 1 of relation 1663/10793/2659 (target block 858862642): No such file or directory ... ...there was little or no activity during this time, apart from some inserts, maybe some selects, etc. Nothing that should have caused this kind of upset. There is a file that matches this: -rw------- 1 postgres wheel 57344 Oct 14 22:57 /data/db/second/base/10793/2659 but it isn't in the directory where I moved most of the indexes to: ls /data/index/ext/10793/ 16390 16397 16399 16406 16407 16410 16414 16425 16434 16435 I don't know if the file numbers have any meaning? But in addition, the list of tables (\dp) is now fubar'd. I'm starting to wonder if it is a combination of: - the operating system (NetBSD 4.99.20) - the hardware (small HP box, not meant for hard work like this but shouldn't be impossible for it) - the way pkgsrc compiles postgresql for NetBSD I'm shying away from the hardware (or at least RAM/CPU) because I'd expect there to be some other kind kinds of faults show up, ultimately leading to a panic due to just random corruption of some kernel data structure. As it is, everything else seems to be functioning ok. Darren
On 10/15/07, Darren Reed <darrenr+postgres@fastmail.net> wrote: > Scott Marlowe wrote: > > ... > > > > Again, I'm kinda shooting in the dark here as you reveal more and more > > what you are doing a little at a time. A test case that can invoke > > this failure would be most useful. > > > After seeing this today: > ERROR: duplicate key violates unique constraint "ers_pkey" > ERROR: duplicate key violates unique constraint "foo_pkey" > ERROR: duplicate key violates unique constraint "foo_pkey" > ERROR: duplicate key violates unique constraint "foo_pkey" > ERROR: duplicate key violates unique constraint "foo_pkey" > ERROR: duplicate key violates unique constraint "foo_pkey" > ERROR: could not open segment 1 of relation 1663/10793/2659 (target > block 858862642): No such file or directory > ERROR: could not open segment 1 of relation 1663/10793/2659 (target > block 858862642): No such file or directory > ERROR: could not open segment 1 of relation 1663/10793/2659 (target > block 858862642): No such file or directory > ERROR: could not open segment 1 of relation 1663/10793/2659 (target > block 858862642): No such file or directory > ... > > ...there was little or no activity during this time, apart from > some inserts, maybe some selects, etc. Nothing that should > have caused this kind of upset. > > There is a file that matches this: > -rw------- 1 postgres wheel 57344 Oct 14 22:57 > /data/db/second/base/10793/2659 > but it isn't in the directory where I moved most of the indexes to: > ls /data/index/ext/10793/ > 16390 16397 16399 16406 16407 16410 16414 16425 16434 16435 How, exactly, did you move those indexes?
Darren Reed <darrenr+postgres@fastmail.net> writes: > I'm starting to wonder if it is a combination of: > - the operating system (NetBSD 4.99.20) Um ... what was your motivation for choosing that? According to http://www.netbsd.org/releases/release-map.html a .99 release number signifies "an alpha quality distribution. It isn't even guaranteed to compile." It looks like NetBSD 4 is currently up to an RC2 release, which is probably not what you've got there ... but even if you were running the RC2 code I'd question the sanity of insisting on a back-rev Postgres release on top of bleeding edge operating system. regards, tom lane
Scott Marlowe wrote: > On 10/15/07, Darren Reed <darrenr+postgres@fastmail.net> wrote: > > Scott Marlowe wrote: > > > ... > > > > > > Again, I'm kinda shooting in the dark here as you reveal more and more > > > what you are doing a little at a time. A test case that can invoke > > > this failure would be most useful. > > > > > After seeing this today: > > ERROR: duplicate key violates unique constraint "ers_pkey" > > ERROR: duplicate key violates unique constraint "foo_pkey" > > ERROR: duplicate key violates unique constraint "foo_pkey" > > ERROR: duplicate key violates unique constraint "foo_pkey" > > ERROR: duplicate key violates unique constraint "foo_pkey" > > ERROR: duplicate key violates unique constraint "foo_pkey" > > ERROR: could not open segment 1 of relation 1663/10793/2659 (target > > block 858862642): No such file or directory > > ERROR: could not open segment 1 of relation 1663/10793/2659 (target > > block 858862642): No such file or directory > > ERROR: could not open segment 1 of relation 1663/10793/2659 (target > > block 858862642): No such file or directory > > ERROR: could not open segment 1 of relation 1663/10793/2659 (target > > block 858862642): No such file or directory > > ... > > > > ...there was little or no activity during this time, apart from > > some inserts, maybe some selects, etc. Nothing that should > > have caused this kind of upset. > > > > There is a file that matches this: > > -rw------- 1 postgres wheel 57344 Oct 14 22:57 > > /data/db/second/base/10793/2659 > > but it isn't in the directory where I moved most of the indexes to: > > ls /data/index/ext/10793/ > > 16390 16397 16399 16406 16407 16410 16414 16425 16434 16435 > > How, exactly, did you move those indexes? > With "ALTER TABLE". Since then I recreated the database and after merging a chunk of data, I see this: (various errors about duplicate keys and values too wide for fields...) ERROR: duplicate key violates unique constraint "t_a_pkey" LOG: unexpected EOF on client connection ERROR: value too long for type character(12) LOG: unexpected EOF on client connection LOG: unexpected EOF on client connection LOG: could not send data to client: Broken pipe LOG: unexpected EOF on client connection ERROR: relation "t_a" does not exist I've attached the commands I've used to create the schema. It's called sanitised because I've culled all of the fields that aren't used. This script was used to create the database that I then imported records into before seeing the above. Darren CREATE TABLESPACE ext LOCATION '/data/index/ext'; CREATE ROLE root; CREATE LANGUAGE 'plpgsql' HANDLER plpgsql_call_handler LANCOMPILER 'PL/pgSQL'; CREATE TABLE t_a ( int1 INTEGER NOT NULL PRIMARY KEY, str1 CHARACTER(20), bool1 boolean ); CREATE VIEW a_v1 AS SELECT DISTINCT(str1),bool1 FROM t_a WHERE bool1 IS TRUE ORDER BY str1; CREATE VIEW a_v2 AS SELECT distinct(str1),count(*) FROM t_a GROUP BY t_a.str1; CREATE TABLE t_b ( int1 INTEGER NOT NULL, str2 CHARACTER VARYING(20) NOT NULL, bool1 BOOLEAN ); CREATE TABLE t_c ( str1 CHAR(20) NOT NULL, str2 VARCHAR(20) NOT NULL PRIMARY KEY ); CREATE TABLE ifl ( recno SERIAL PRIMARY KEY, int1 INTEGER NOT NULL ); DROP FUNCTION add_str1tot_a(); CREATE OR REPLACE FUNCTION add_str1tot_a() RETURNS TRIGGER AS $ptot_a$ DECLARE temp VARCHAR(20); BEGIN SELECT p.str1 INTO temp FROM t_c p WHERE p.str2=NEW.str2; UPDATE t_a SET str1=temp WHERE str1 IS NULL AND int1=NEW.int1; RETURN NULL; END; $ptot_a$ LANGUAGE plpgsql; DROP FUNCTION sett_astr1bool1(); CREATE OR REPLACE FUNCTION sett_astr1bool1() RETURNS TRIGGER as $sepi$ DECLARE ig BOOLEAN; BEGIN IF NEW.str1 IS NOT NULL AND NEW.bool1 IS NULL THEN SELECT b.bool1 INTO ig FROM a_v1 b WHERE b.str1=NEW.str1; IF ig IS NOT NULL THEN UPDATE t_a SET bool1=ig WHERE int1=NEW.int1; END IF; END IF; RETURN NULL; END; $sepi$ LANGUAGE plpgsql; DROP FUNCTION sett_abool1(); CREATE OR REPLACE FUNCTION sett_abool1() RETURNS TRIGGER as $sei$ DECLARE temp BOOLEAN; temp2 CHAR(20); BEGIN SELECT b.bool1 INTO temp FROM badt_b b WHERE (b.str2=NEW.str2) AND (bool1 IS NOT NULL); IF temp IS NOT NULL THEN UPDATE t_b SET bool1=temp WHERE str2=NEW.str2; END IF; SELECT t.str1 INTO temp2 FROM t_a t WHERE t.int1=NEW.int1; IF temp2 IS NULL THEN SELECT u.str1 INTO temp2 FROM t_c u WHERE u.str2=NEW.str2; IF temp2 IS NOT NULL THEN IF temp IS NOT NULL THEN UPDATE t_a SET str1=temp2,bool1=temp WHERE int1=NEW.int1; ELSE UPDATE t_a SET str1=temp2 WHERE int1=NEW.int1; END IF; ELSE IF temp IS NOT NULL THEN UPDATE t_a SET bool1=temp WHERE int1=NEW.int1; END IF; END IF; ELSE IF temp IS NOT NULL THEN UPDATE t_a SET bool1=temp WHERE int1=NEW.int1; END IF; END IF; RETURN NULL; END; $sei$ LANGUAGE plpgsql; CREATE INDEX t_a_str1 ON t_a USING btree (str1) TABLESPACE ext; CREATE INDEX str2_index ON t_b(str2); CREATE INDEX t_b_int1_index ON t_b(int1); CREATE INDEX t_c_str1_idx ON t_c(str1) TABLESPACE ext; CREATE INDEX t_c_str2_idx ON t_c(str2) TABLESPACE ext; ALTER INDEX t_c_pkey SET TABLESPACE ext; ALTER INDEX ifl_pkey SET TABLESPACE ext;
For better or worse, it seems to be behaving itself again for a while... There was, however, one change to my procedure and that was to drop the triggers/functions when restoring the data (using copy into). Cheers, Darren
Darren Reed wrote: > For better or worse, it seems to be behaving itself again for a while... > > There was, however, one change to my procedure and that was to > drop the triggers/functions when restoring the data (using copy into). Perhaps I spoke too soon... # /usr/pkg/bin/pg_dumpall -O -U postgres -a > /data/dumpall pg_dump: [archiver (db)] connection to database "template1" failed: FATAL: out of memory DETAIL: Failed on request of size 20. pg_dumpall: pg_dump failed on database "template1", exiting ...for better or worse, I'm interfacing to postgresql using perl's DBI interface. Are there likely to be any issues from that? (Although I can't see how a client should be able to cause the server to become corrupt unless there's a bug in the server.) Darren
On 10/17/07, Darren Reed <darrenr@fastmail.net> wrote: > Darren Reed wrote: > > For better or worse, it seems to be behaving itself again for a while... > > > > There was, however, one change to my procedure and that was to > > drop the triggers/functions when restoring the data (using copy into). > > Perhaps I spoke too soon... > > # /usr/pkg/bin/pg_dumpall -O -U postgres -a > /data/dumpall > pg_dump: [archiver (db)] connection to database "template1" failed: > FATAL: out of memory > DETAIL: Failed on request of size 20. > pg_dumpall: pg_dump failed on database "template1", exiting > > ...for better or worse, I'm interfacing to postgresql using perl's DBI > interface. > Are there likely to be any issues from that? (Although I can't see how > a client > should be able to cause the server to become corrupt unless there's a bug in > the server.) Perl and DBI are fine. You really need to stop using an alpha release of an OS though. There's no way to know if the problems you're experiencing are coming from that or something else. Once you're on a stable OS release, start testing again and let us know if anything breaks. But like Tom likes to say, if postgresql breaks running on an alpha OS release, you get to keep both pieces.