Thread: Help - corruption issue?
While doing a PG dump, I seem to have a problem: ERROR: invalid memory alloc request size 4294967293 Upon googling, this seems to be a data corruption issue! ( Came about while doing performance tuning as being discussed on the PG-PERFORMANCE list: http://postgresql.1045698.n5.nabble.com/REINDEX-takes-half-a-day-and-still-not-complete-td4005943.html ) One of the older messages suggests that I do "file level backup and restore the data". http://archives.postgresql.org/pgsql-admin/2008-05/msg00191.php How does one do this -- should I copy the data folder? What are the specific steps? I'm on PG 8.2.9, CentOS 5, with 8GB of RAM. The disks are four SATAII disks on RAID 1. Thanks!
Phoenix, how large (in total) is this database)? can you copy (cp -a) the data directory somewhere? I would do this just in case :-) regarding the manual recovery process: 1. you'll have to isolate corrupted table. you can do this by dumping all tables one-by-one (pg_dump -t TABLE) until you get the error. 2. find the record which is corupted... approach like this might work: select count(*) from the_corrupted_table where PK_column <= some_value. 3 .you should try to dump the table by chunks - skipping the corrupted row(s) if possible 4. if above method does not work, you can try manually hex-editing (zeroing) some bytes (with postgres shut down) to make dump work again. PS. obligatory note: 8.2.9 Release Date: 2008-06-12; 8.2.21 Release Date: 2011-04-18 seems like you were running almost three years without bugfixes. aside from fixing your current problem, I would first do the upgrade to avoid more corruption. 2011/4/18 Phoenix Kiula <phoenix.kiula@gmail.com> > > While doing a PG dump, I seem to have a problem: > > ERROR: invalid memory alloc request size 4294967293 > > Upon googling, this seems to be a data corruption issue! > > ( Came about while doing performance tuning as being discussed on the > PG-PERFORMANCE list: > http://postgresql.1045698.n5.nabble.com/REINDEX-takes-half-a-day-and-still-not-complete-td4005943.html > ) > > One of the older messages suggests that I do "file level backup and > restore the data". > http://archives.postgresql.org/pgsql-admin/2008-05/msg00191.php > > How does one do this -- should I copy the data folder? What are the > specific steps? > > I'm on PG 8.2.9, CentOS 5, with 8GB of RAM. The disks are four SATAII > disks on RAID 1. > > Thanks! > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
Thanks Filip. I know which table it is. It's my largest table with over 125 million rows. All the others are less than 100,000 rows. Most are in fact less than 25,000. Now, which specific part of the table is corrupted -- if it is row data, then can I dump specific parts of that table? How? Pg_dumpall does not seem to have an option to have a "WHERE" clause? If the lead index is corrupt, then issuing a reindex should work. So I disconnected all other users. The DB was doing nothing. And then I started a psql session and issued the command "reindex database MYDB". After 3 hours, I see this error: [QUOTE] server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. Failed. !> [/UNQUOTE] What am I to do now? Even reindex is not working. I can try to drop indexes and create them again. Will that help? 2011/4/18 Filip Rembiałkowski <plk.zuber@gmail.com>: > Phoenix, > > how large (in total) is this database)? > > can you copy (cp -a) the data directory somewhere? I would do this > just in case :-) > > > regarding the manual recovery process: > > 1. you'll have to isolate corrupted table. > you can do this by dumping all tables one-by-one (pg_dump -t TABLE) > until you get the error. > > 2. find the record which is corupted... approach like this might work: > select count(*) from the_corrupted_table where PK_column <= some_value. > > 3 .you should try to dump the table by chunks - skipping the corrupted > row(s) if possible > > 4. if above method does not work, you can try manually hex-editing > (zeroing) some bytes (with postgres shut down) to make dump work > again. > > > PS. obligatory note: > > 8.2.9 Release Date: 2008-06-12; 8.2.21 Release Date: 2011-04-18 > seems like you were running almost three years without bugfixes. > aside from fixing your current problem, I would first do the upgrade > to avoid more corruption. > > > > > > > 2011/4/18 Phoenix Kiula <phoenix.kiula@gmail.com> >> >> While doing a PG dump, I seem to have a problem: >> >> ERROR: invalid memory alloc request size 4294967293 >> >> Upon googling, this seems to be a data corruption issue! >> >> ( Came about while doing performance tuning as being discussed on the >> PG-PERFORMANCE list: >> http://postgresql.1045698.n5.nabble.com/REINDEX-takes-half-a-day-and-still-not-complete-td4005943.html >> ) >> >> One of the older messages suggests that I do "file level backup and >> restore the data". >> http://archives.postgresql.org/pgsql-admin/2008-05/msg00191.php >> >> How does one do this -- should I copy the data folder? What are the >> specific steps? >> >> I'm on PG 8.2.9, CentOS 5, with 8GB of RAM. The disks are four SATAII >> disks on RAID 1. >> >> Thanks! >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general >
> Thanks Filip. > > I know which table it is. It's my largest table with over 125 million > rows. > > All the others are less than 100,000 rows. Most are in fact less than > 25,000. > > Now, which specific part of the table is corrupted -- if it is row > data, then can I dump specific parts of that table? How? Pg_dumpall > does not seem to have an option to have a "WHERE" clause? > > If the lead index is corrupt, then issuing a reindex should work. So I > disconnected all other users. The DB was doing nothing. And then I > started a psql session and issued the command "reindex database MYDB". > After 3 hours, I see this error: > > > > [QUOTE] > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: WARNING: > terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back > the current transaction and exit, because another server process > exited abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > Failed. > !> > [/UNQUOTE] > > > What am I to do now? Even reindex is not working. I can try to drop > indexes and create them again. Will that help? It might help, but as someone already pointed out, you're running a version that's 3 years old. So do a hot file backup (stop the db and copy the data directory to another machine), check the hardware (especially the RAID controller and RAM), upgrade to the latest 8.2.x version and then try again. I'll post a bit more info into the other thread, as it's related to the reindex performance and not to this issue. regards Tomas
2011/4/18 Phoenix Kiula <phoenix.kiula@gmail.com>: > Thanks Filip. > > I know which table it is. It's my largest table with over 125 million rows. > > All the others are less than 100,000 rows. Most are in fact less than 25,000. > > Now, which specific part of the table is corrupted -- if it is row > data, then can I dump specific parts of that table? How? Pg_dumpall > does not seem to have an option to have a "WHERE" clause? > > If the lead index is corrupt, then issuing a reindex should work. So I > disconnected all other users. The DB was doing nothing. And then I > started a psql session and issued the command "reindex database MYDB". > After 3 hours, I see this error: > > > > [QUOTE] > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: WARNING: > terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back > the current transaction and exit, because another server process > exited abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > Failed. > !> > [/UNQUOTE] > > > What am I to do now? Even reindex is not working. I can try to drop > indexes and create them again. Will that help? it might. take a full file system backup first and drop the indexes. before recreating them, take a regular dump (with pg_dump) and if it goes through, you're golden, rebuild the indexes, *update the postmaster to latest 8.2*, and you can go back online. merllin
On Mon, Apr 18, 2011 at 11:02 PM, <tv@fuzzy.cz> wrote: >> Thanks Filip. >> >> I know which table it is. It's my largest table with over 125 million >> rows. >> >> All the others are less than 100,000 rows. Most are in fact less than >> 25,000. >> >> Now, which specific part of the table is corrupted -- if it is row >> data, then can I dump specific parts of that table? How? Pg_dumpall >> does not seem to have an option to have a "WHERE" clause? >> >> If the lead index is corrupt, then issuing a reindex should work. So I >> disconnected all other users. The DB was doing nothing. And then I >> started a psql session and issued the command "reindex database MYDB". >> After 3 hours, I see this error: >> >> >> >> [QUOTE] >> server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. >> The connection to the server was lost. Attempting reset: WARNING: >> terminating connection because of crash of another server process >> DETAIL: The postmaster has commanded this server process to roll back >> the current transaction and exit, because another server process >> exited abnormally and possibly corrupted shared memory. >> HINT: In a moment you should be able to reconnect to the database and >> repeat your command. >> Failed. >> !> >> [/UNQUOTE] >> >> >> What am I to do now? Even reindex is not working. I can try to drop >> indexes and create them again. Will that help? > > It might help, but as someone already pointed out, you're running a > version that's 3 years old. So do a hot file backup (stop the db and copy > the data directory to another machine), check the hardware (especially the > RAID controller and RAM), upgrade to the latest 8.2.x version and then try > again. > > I'll post a bit more info into the other thread, as it's related to the > reindex performance and not to this issue. > > regards > Tomas Thanks. For CentOS (RedHat?) the latest is 8.2.19 right? Not the 8.2.20 that's mentioned on front page of PG.org. http://www.pgrpms.org/8.2/redhat/rhel-4-i386/repoview/ Question: will upgrading from 8.2.9 to 8.2.19 have some repercussions in terms of huge changes or problems? I know 9.x had some new additions including "casting" etc (or is that irrelevant to me?) but if 8.2.19 is safe in terms of not requiring anything new from my side, then I can do the upgrade quickly. Welcome any advice. Thanks!
Dne 18.4.2011 20:27, Phoenix Kiula napsal(a): >>> >>> What am I to do now? Even reindex is not working. I can try to drop >>> indexes and create them again. Will that help? >> >> It might help, but as someone already pointed out, you're running a >> version that's 3 years old. So do a hot file backup (stop the db and copy >> the data directory to another machine), check the hardware (especially the >> RAID controller and RAM), upgrade to the latest 8.2.x version and then try >> again. >> >> I'll post a bit more info into the other thread, as it's related to the >> reindex performance and not to this issue. >> >> regards >> Tomas > > Thanks. For CentOS (RedHat?) the latest is 8.2.19 right? Not the > 8.2.20 that's mentioned on front page of PG.org. Centos is probably a bit delayed behind the source version. If you want to stick with the binary version, go with the 8.2.19. > http://www.pgrpms.org/8.2/redhat/rhel-4-i386/repoview/ > > Question: will upgrading from 8.2.9 to 8.2.19 have some repercussions > in terms of huge changes or problems? Those minor versions are mostly bugfixes and small improvements. So no, I wouldn't expect huge problems. > I know 9.x had some new additions including "casting" etc (or is that > irrelevant to me?) but if 8.2.19 is safe in terms of not requiring > anything new from my side, then I can do the upgrade quickly. Don't do that right now. When doing 'minor' upgrades, you don't need to dump/restore the database - you can just replace the binaries and it should work as the file format does not change between minor versions (and 8.2.9 -> 8.2.19 is a minor upgrade). Still, do the file backup as described in the previous posts. You could even do an online backup using pg_backup_start/pg_backup_stop etc. To upgrade from 8.2 to 9.0 you'd need to do pg_dump backup and then restore the database. Which is of scope right now, I guess.
On Mon, Apr 18, 2011 at 5:44 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > > Still, do the file backup as described in the previous posts. You could > even do an online backup using pg_backup_start/pg_backup_stop etc. As soon as you have a working file system backup, get the tw_cli utility for the 3ware cards downloaded and LOOK at what it has to say about your RAID controller, drives, and array health.
On Tue, Apr 19, 2011 at 8:35 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Mon, Apr 18, 2011 at 5:44 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> >> Still, do the file backup as described in the previous posts. You could >> even do an online backup using pg_backup_start/pg_backup_stop etc. > > As soon as you have a working file system backup, get the tw_cli > utility for the 3ware cards downloaded and LOOK at what it has to say > about your RAID controller, drives, and array health. I am with SoftLayer. They're a very professional bunch. They even changed my BBU last night. The RAID card is working. The memory and the hardware are also tested. I have now upgraded to 8.2.19. Then I restarted the server, and dropped indexes. When I recreate the first index, the same thing happens: ------ # CREATE INDEX idx_links_userid ON links (user_id); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. ------ There is nothing going on in the server other than this command. All other users are blocked! Logging is enabled but does not have anything! I am now worried. What is this problem?
On Mon, Apr 18, 2011 at 8:52 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On Tue, Apr 19, 2011 at 8:35 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> On Mon, Apr 18, 2011 at 5:44 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> >>> Still, do the file backup as described in the previous posts. You could >>> even do an online backup using pg_backup_start/pg_backup_stop etc. >> >> As soon as you have a working file system backup, get the tw_cli >> utility for the 3ware cards downloaded and LOOK at what it has to say >> about your RAID controller, drives, and array health. > > > > I am with SoftLayer. They're a very professional bunch. They even > changed my BBU last night. The RAID card is working. The memory and > the hardware are also tested. So, RAID is good for sure? As in someone logged into the machine, and went to the tw_cli utility and asked it about the status of the physical drives and virtual RAID array and the card said yes they're good? No bad sectors being remapped? Hmmm. One of my old tests when things were acting up was to see if the server could compile the linux kernel or pgsql back when it took 1.5 hours to do. If you keep getting sig 11s on production kernel compiles something's wrong with the system, software or hardware. > I have now upgraded to 8.2.19. > > Then I restarted the server, and dropped indexes. When I recreate the > first index, the same thing happens: > > ------ > # CREATE INDEX idx_links_userid ON links (user_id); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > ------ What do the Postgresql logs say at this time? oh wait... > There is nothing going on in the server other than this command. All > other users are blocked! > > Logging is enabled but does not have anything! System logs maybe? Something about a process getting killed? Have you tried turning up the verbosity of the pg logs? > I am now worried. What is this problem? We gotta check one thing at a time really. If you copy the dir off to another machine and run pgsql 8.2.latest or thereabouts, can you then create the index?
> System logs maybe? Something about a process getting killed? Have > you tried turning up the verbosity of the pg logs? Syslog has to be compiled with PG? How do I enable it? Where should I look for it? The documentation, whenever it mentions "syslog", always just assumes the expression "If syslog is enabled". Well where do I enable it? - http://www.postgresql.org/docs/8.2/static/runtime-config-logging.html Would appreciate some guidance on this. > We gotta check one thing at a time really. > > If you copy the dir off to another machine and run pgsql 8.2.latest or > thereabouts, can you then create the index? I will try this. Transferring 106GB of data, even zipped, is a huge ask and just the management will take over a day or so. I was hoping we could do without this.
On Mon, Apr 18, 2011 at 9:23 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: >> System logs maybe? Something about a process getting killed? Have >> you tried turning up the verbosity of the pg logs? > > > Syslog has to be compiled with PG? How do I enable it? Where should I > look for it? > > The documentation, whenever it mentions "syslog", always just assumes > the expression "If syslog is enabled". Well where do I enable it? - > http://www.postgresql.org/docs/8.2/static/runtime-config-logging.html > > Would appreciate some guidance on this. No I meant the system logs, the ones in /var/log/yadayada. Like /var/log/message, things like that. See if any of them have anything interesting happening when things go badly. syslog is logging using the syslog system which puts logs from various processes into the /var/log dir, like /var/log/pgsql. Assuming you have a stock RHEL install I'd expect the pgsql logs to be in /var/log/pgsql or thereabouts. >> We gotta check one thing at a time really. >> >> If you copy the dir off to another machine and run pgsql 8.2.latest or >> thereabouts, can you then create the index? > > > I will try this. Transferring 106GB of data, even zipped, is a huge > ask and just the management will take over a day or so. I was hoping > we could do without this. On a fast network it should only take a few minutes. Now rsyncing live 2.4 TB databases, that takes time. :) Your raptors, if they're working properly, should be able to transfer at around 80 to 100Megabytes a second. 10 to 15 seconds a gig. 30 minutes or so via gig ethernet. I'd run iostat and see how well my drive array was performing during a large, largely sequential copy.
> On a fast network it should only take a few minutes. Now rsyncing > live 2.4 TB databases, that takes time. :) Your raptors, if they're > working properly, should be able to transfer at around 80 to > 100Megabytes a second. 10 to 15 seconds a gig. 30 minutes or so via > gig ethernet. I'd run iostat and see how well my drive array was > performing during a large, largely sequential copy. OK. An update. We have changed all the hardware except disks. REINDEX still gave this problem: -- server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. -- So I rebooted and logged back in a single user mode. All services stopped. All networking stopped. Only postgresql started. I tried the REINDEX again. Same problem :( This means the problem is likely with data? I do have a "pg_dumpall" dump from 1 day before. Will lose some data, but should have most of it. Is it worth it for me to try and restore from there? What's the best thing to do right now?
Dne 20.4.2011 12:56, Phoenix Kiula napsal(a): >> On a fast network it should only take a few minutes. Now rsyncing >> live 2.4 TB databases, that takes time. :) Your raptors, if they're >> working properly, should be able to transfer at around 80 to >> 100Megabytes a second. 10 to 15 seconds a gig. 30 minutes or so via >> gig ethernet. I'd run iostat and see how well my drive array was >> performing during a large, largely sequential copy. > > > OK. An update. > > We have changed all the hardware except disks. OK, so the card is working and the drives are fine. Have you run the tw_cli tool to check the drives? Because it's probably the last thing that might be faulty and was not replaced. > REINDEX still gave this problem: > > -- > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > -- Hm, have you checked if there's something else in the logs? More details about the crash or something like that. I'd probably try to run strace on the backend, to get more details about where it crashes. Just find out the PID of the backend dedicated to your psql session, do $ strace -p PID > crash.log 2>&1 and then run the REINDEX. Once it crashes you can see the last few lines from the logfile. > So I rebooted and logged back in a single user mode. All services > stopped. All networking stopped. Only postgresql started. I tried the > REINDEX again. > > Same problem :( > > This means the problem is likely with data? Well, maybe. It might be a problem with the data, it might be a bug in postgres ... > I do have a "pg_dumpall" dump from 1 day before. Will lose some data, > but should have most of it. > > Is it worth it for me to try and restore from there? What's the best > thing to do right now? So have you done the file backup? That's the first thing I'd do. Anyway what's best depends on how important is the missing piece of data. We still don't know how to fix the problem, but it sure seems like a corrupted data. I think you already know which table is corrupted, right? In that case you may actually try to find the bad block and erase it (and maybe do a copy so that we can see what's wrong with it and how it might happen). There's a very nice guide on how to do that http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html It sure seems like the problem you have (invalid alloc request etc.). The really annoying part is locating the block, as you have to scan through the table (which sucks with such big table). And yes, if there's corruption, there might be more corrupted blocks. regards Tomas
Dne 20.4.2011 22:11, Tomas Vondra napsal(a): > There's a very nice guide on how to do that > > http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html > > It sure seems like the problem you have (invalid alloc request etc.). > The really annoying part is locating the block, as you have to scan > through the table (which sucks with such big table). > > And yes, if there's corruption, there might be more corrupted blocks. BTW, there's a setting 'zero_damaged_pages' that might help with this http://www.postgresql.org/docs/8.2/static/runtime-config-developer.html see this talk for more details how to use it http://www.casitconf.org/casitconf11/Tech_track_2_files/cascadia_postgres_rbernier.pdf Anyway don't play with this without the file backup, as this will zero the blocks. Tomas
On Thu, Apr 21, 2011 at 7:27 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 20.4.2011 22:11, Tomas Vondra napsal(a): >> There's a very nice guide on how to do that >> >> http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html >> >> It sure seems like the problem you have (invalid alloc request etc.). >> The really annoying part is locating the block, as you have to scan >> through the table (which sucks with such big table). >> >> And yes, if there's corruption, there might be more corrupted blocks. > > BTW, there's a setting 'zero_damaged_pages' that might help with this > > http://www.postgresql.org/docs/8.2/static/runtime-config-developer.html > > see this talk for more details how to use it > > http://www.casitconf.org/casitconf11/Tech_track_2_files/cascadia_postgres_rbernier.pdf > > Anyway don't play with this without the file backup, as this will zero > the blocks. > > Tomas Thanks Tomas. Very handy info. FIRST: is there anyone on this list who offers PG admin support? Please write to me directly. Second, for the strace, which process should I use? ps auxwww|grep ^postgres postgres 4320 0.0 0.1 440192 10824 ? Ss 08:49 0:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data postgres 4355 0.0 0.0 11724 964 ? Ss 08:49 0:00 postgres: logger process postgres 4365 0.0 0.0 440396 3268 ? Ss 08:49 0:00 postgres: writer process postgres 4366 0.0 0.0 11860 1132 ? Ss 08:49 0:00 postgres: stats collector process postgres 15795 0.0 0.0 7136 1440 pts/0 S 22:44 0:00 -bash postgres 15900 0.0 0.0 7860 1956 pts/0 S+ 22:44 0:00 psql -h localhost MYDOMAIN -E MYDOMAIN_MYDOMAIN postgres 15901 0.0 0.0 441124 3072 ? Ss 22:44 0:00 postgres: MYDOMAIN_MYDOMAIN MYDOMAIN 127.0.0.1(34346) idle Third, I have the backup in two ways: 1. I took a backup of the entire "/pgsql/data" folder. PG was shutdown at the time. 2. I have a pg_dumpall file but it is missing one day's data (still useful as last resort). Will #1 have corrupt data in it?
On Thu, Apr 21, 2011 at 11:49 AM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On Thu, Apr 21, 2011 at 7:27 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Dne 20.4.2011 22:11, Tomas Vondra napsal(a): >>> There's a very nice guide on how to do that >>> >>> http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html >>> >>> It sure seems like the problem you have (invalid alloc request etc.). >>> The really annoying part is locating the block, as you have to scan >>> through the table (which sucks with such big table). >>> >>> And yes, if there's corruption, there might be more corrupted blocks. >> >> BTW, there's a setting 'zero_damaged_pages' that might help with this >> >> http://www.postgresql.org/docs/8.2/static/runtime-config-developer.html >> >> see this talk for more details how to use it >> >> http://www.casitconf.org/casitconf11/Tech_track_2_files/cascadia_postgres_rbernier.pdf >> >> Anyway don't play with this without the file backup, as this will zero >> the blocks. >> >> Tomas > > > > > > > Thanks Tomas. Very handy info. > > FIRST: is there anyone on this list who offers PG admin support? > Please write to me directly. > > Second, for the strace, which process should I use? > > > ps auxwww|grep ^postgres > postgres 4320 0.0 0.1 440192 10824 ? Ss 08:49 0:00 > /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data > postgres 4355 0.0 0.0 11724 964 ? Ss 08:49 0:00 > postgres: logger process > postgres 4365 0.0 0.0 440396 3268 ? Ss 08:49 0:00 > postgres: writer process > postgres 4366 0.0 0.0 11860 1132 ? Ss 08:49 0:00 > postgres: stats collector process > postgres 15795 0.0 0.0 7136 1440 pts/0 S 22:44 0:00 -bash > postgres 15900 0.0 0.0 7860 1956 pts/0 S+ 22:44 0:00 psql -h > localhost MYDOMAIN -E MYDOMAIN_MYDOMAIN > postgres 15901 0.0 0.0 441124 3072 ? Ss 22:44 0:00 > postgres: MYDOMAIN_MYDOMAIN MYDOMAIN 127.0.0.1(34346) idle > > > Third, I have the backup in two ways: > > 1. I took a backup of the entire "/pgsql/data" folder. PG was shutdown > at the time. > 2. I have a pg_dumpall file but it is missing one day's data (still > useful as last resort). > > Will #1 have corrupt data in it? > Tomas, I did a crash log with the strace for PID of the index command as you suggested. Here's the output: http://www.heypasteit.com/clip/WNR Also including below, but because this will wrap etc, you can look at the link above. Thanks for any ideas or pointers! Process 15900 attached - interrupt to quit read(0, "r", 1) = 1 write(1, "r", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "e", 1) = 1 write(1, "e", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "i", 1) = 1 write(1, "i", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "n", 1) = 1 write(1, "n", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "d", 1) = 1 write(1, "d", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "e", 1) = 1 write(1, "e", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "x", 1) = 1 write(1, "x", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, " ", 1) = 1 write(1, " ", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "l", 1) = 1 write(1, "l", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "i", 1) = 1 write(1, "i", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "n", 1) = 1 write(1, "n", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(1, "\10\33[K", 4) = 4 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(2, "\7", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(2, "\7", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(2, "\7", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\177", 1) = 1 write(2, "\7", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\\", 1) = 1 write(1, "\\", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "d", 1) = 1 write(1, "d", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, " ", 1) = 1 write(1, " ", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "l", 1) = 1 write(1, "l", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "i", 1) = 1 write(1, "i", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "n", 1) = 1 write(1, "n", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "k", 1) = 1 write(1, "k", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "s", 1) = 1 write(1, "s", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\r", 1) = 1 write(1, "\n", 1) = 1 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo ...}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGINT, {0x804ddd2, [], SA_RESTORER|SA_RESTART, 0xda2a08}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGTERM, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGQUIT, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGALRM, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGTSTP, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGTTOU, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGTTIN, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGWINCH, {SIG_DFL}, {0x12afd0, [], SA_RESTORER, 0xda2a08}, 8) = 0 write(1, "********* QUERY **********\n", 27) = 27 write(1, "SELECT c.oid,\n n.nspname,\n c.r"..., 207) = 207 write(1, "ORDER BY 2, 3;\n*****************"..., 43) = 43 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "Q\0\0\0\342SELECT c.oid,\n n.nspname,\n"..., 227, 0) = 227 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "T\0\0\0P\0\3oid\0\0\0\4\353\377\376\0\0\0\32\0\4\377\377\377\377\0\0nsp"..., 16384, 0) = 134 write(1, "********* QUERY **********\n", 27) = 27 write(1, "SELECT relhasindex, relkind, rel"..., 95) = 95 write(1, "FROM pg_catalog.pg_class WHERE o"..., 73) = 73 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "Q\0\0\0\220SELECT relhasindex, relkind"..., 145, 0) = 145 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "T\0\0\0\323\0\7relhasindex\0\0\0\4\353\0\f\0\0\0\20\0\1\377"..., 16384, 0) = 272 write(1, "********* QUERY **********\n", 27) = 27 write(1, "SELECT a.attname,\n pg_catalog.f"..., 369) = 369 write(1, "ORDER BY a.attnum\n**************"..., 46) = 46 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "Q\0\0\1\207SELECT a.attname,\n pg_cata"..., 392, 0) = 392 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "T\0\0\0\217\0\5attname\0\0\0\4\341\0\2\0\0\0\23\0@\377\377\377\377\0"..., 16384, 0) = 1123 write(1, "********* QUERY **********\n", 27) = 27 write(1, "SELECT c2.relname, i.indisprimar"..., 295) = 295 write(1, "ORDER BY i.indisprimary DESC, i."..., 89) = 89 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "Q\0\0\1hSELECT c2.relname, i.indisp"..., 361, 0) = 361 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "T\0\0\0\335\0\7relname\0\0\0\4\353\0\1\0\0\0\23\0@\377\377\377\377\0"..., 16384, 0) = 629 write(1, "********* QUERY **********\n", 27) = 27 write(1, "SELECT r.conname, pg_catalog.pg_"..., 95) = 95 write(1, "WHERE r.conrelid = \'50002\' AND r"..., 86) = 86 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "Q\0\0\0\235SELECT r.conname, pg_catalo"..., 158, 0) = 158 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "T\0\0\0G\0\2conname\0\0\0\n.\0\1\0\0\0\23\0@\377\377\377\377\0"..., 16384, 0) = 330 write(1, "********* QUERY **********\n", 27) = 27 write(1, "SELECT t.tgname, pg_catalog.pg_g"..., 82) = 82 write(1, "WHERE t.tgrelid = \'50002\' AND (n"..., 328) = 328 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "Q\0\0\1\202SELECT t.tgname, pg_catalog"..., 387, 0) = 387 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "T\0\0\0C\0\2tgname\0\0\0\n<\0\2\0\0\0\23\0@\377\377\377\377\0\0"..., 16384, 0) = 86 write(1, "********* QUERY **********\n", 27) = 27 write(1, "SELECT conname,\n pg_catalog.pg_"..., 103) = 103 write(1, "WHERE r.conrelid = \'50002\' AND r"..., 86) = 86 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "Q\0\0\0\245SELECT conname,\n pg_catalo"..., 166, 0) = 166 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "T\0\0\0009\0\2conname\0\0\0\n.\0\1\0\0\0\23\0@\377\377\377\377\0"..., 16384, 0) = 76 write(1, "********* QUERY **********\n", 27) = 27 write(1, "SELECT c.relname FROM pg_catalog"..., 169) = 169 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "Q\0\0\0\221SELECT c.relname FROM pg_ca"..., 146, 0) = 146 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "T\0\0\0 \0\1relname\0\0\0\4\353\0\1\0\0\0\23\0@\377\377\377\377\0"..., 16384, 0) = 51 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(1, TIOCGWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0 write(1, " "..., 53) = 53 write(1, " Column | "..., 85) = 85 write(1, "-----------------------+--------"..., 85) = 85 write(1, " id | bigint "..., 64) = 64 write(1, " link_id | charact"..., 64) = 64 write(1, " alias | charact"..., 64) = 64 write(1, " aliasentered | charact"..., 75) = 75 write(1, " url | text "..., 64) = 64 write(1, " user_known | smallin"..., 74) = 74 write(1, " user_id | charact"..., 64) = 64 write(1, " url_encrypted | charact"..., 74) = 74 write(1, " title | charact"..., 56) = 56 write(1, " private | charact"..., 56) = 56 write(1, " private_key | charact"..., 56) = 56 write(1, " status | charact"..., 75) = 75 write(1, " create_date | timesta"..., 69) = 69 write(1, " modify_date | timesta"..., 56) = 56 write(1, " disable_in_statistics | charact"..., 84) = 84 write(1, " user_running_id | integer"..., 56) = 56 write(1, " url_host_long | integer"..., 56) = 56 write(1, "Indexes:\n", 9) = 9 write(1, " \"links2_pkey\" PRIMARY KEY, b"..., 42) = 42 write(1, " \"links2_alias_key\" UNIQUE, b"..., 66) = 66 write(1, " \"new_idx_userknown\" btree (u"..., 61) = 61 write(1, "Check constraints:\n", 19) = 19 write(1, " \"links2_id_check\" CHECK (id "..., 37) = 37 write(1, " \"links2_url_check\" CHECK (ur"..., 47) = 47 write(1, " \"links2_user_id_check\" CHECK"..., 61) = 61 write(1, " \"links_alias_check\" CHECK (a"..., 67) = 67 write(1, "\n", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 ioctl(0, TIOCGWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0 ioctl(0, TIOCSWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig -icanon -echo ...}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGINT, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {0x804ddd2, [], SA_RESTORER|SA_RESTART, 0xda2a08}, 8) = 0 rt_sigaction(SIGTERM, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGQUIT, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGALRM, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGTSTP, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGTTOU, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGTTIN, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGWINCH, {0x12afd0, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 write(1, "MYDOMAIN=# ", 10) = 10 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "C", 1) = 1 write(1, "C", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "R", 1) = 1 write(1, "R", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "E", 1) = 1 write(1, "E", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "A", 1) = 1 write(1, "A", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "T", 1) = 1 write(1, "T", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "E", 1) = 1 write(1, "E", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, " ", 1) = 1 write(1, " ", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "I", 1) = 1 write(1, "I", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "N", 1) = 1 write(1, "N", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "D", 1) = 1 write(1, "D", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "E", 1) = 1 write(1, "E", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "X", 1) = 1 write(1, "X", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, " ", 1) = 1 write(1, " ", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "i", 1) = 1 write(1, "i", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "d", 1) = 1 write(1, "d", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "x", 1) = 1 write(1, "x", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "_", 1) = 1 write(1, "_", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "l", 1) = 1 write(1, "l", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "i", 1) = 1 write(1, "i", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "n", 1) = 1 write(1, "n", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "k", 1) = 1 write(1, "k", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "s", 1) = 1 write(1, "s", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "_", 1) = 1 write(1, "_", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "u", 1) = 1 write(1, "u", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "s", 1) = 1 write(1, "s", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "e", 1) = 1 write(1, "e", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "r", 1) = 1 write(1, "r", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "i", 1) = 1 write(1, "i", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "d", 1) = 1 write(1, "d", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, " ", 1) = 1 write(1, " ", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "O", 1) = 1 write(1, "O", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "N", 1) = 1 write(1, "N", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, " ", 1) = 1 write(1, " ", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "l", 1) = 1 write(1, "l", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "i", 1) = 1 write(1, "i", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "n", 1) = 1 write(1, "n", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "k", 1) = 1 write(1, "k", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "s", 1) = 1 write(1, "s", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, " ", 1) = 1 write(1, " ", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "(", 1) = 1 write(1, "(", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "u", 1) = 1 write(1, "u", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "s", 1) = 1 write(1, "s", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "e", 1) = 1 write(1, "e", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "r", 1) = 1 write(1, "r", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "_", 1) = 1 write(1, "_", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "i", 1) = 1 write(1, "i", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "d", 1) = 1 write(1, "d", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, ")", 1) = 1 write(1, ")", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, ";", 1) = 1 write(1, ";", 1) = 1 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, "\r", 1) = 1 write(1, "\n", 1) = 1 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo ...}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGINT, {0x804ddd2, [], SA_RESTORER|SA_RESTART, 0xda2a08}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGTERM, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGQUIT, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGALRM, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGTSTP, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGTTOU, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGTTIN, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0 rt_sigaction(SIGWINCH, {SIG_DFL}, {0x12afd0, [], SA_RESTORER, 0xda2a08}, 8) = 0 gettimeofday({1303357859, 831087}, NULL) = 0 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "Q\0\0\0006CREATE INDEX idx_links_user"..., 55, 0) = 55 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "", 16384, 0) = 0 time(NULL) = 1303359051 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, 0) = 1 recv(3, "", 16384, 0) = 0 open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libpq.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libpq.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en_US/LC_MESSAGES/libpq.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libpq.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en.utf8/LC_MESSAGES/libpq.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en/LC_MESSAGES/libpq.mo", O_RDONLY) = -1 ENOENT (No such file or directory) close(3) = 0 write(2, "server closed the connection une"..., 137) = 137 write(2, "The connection to the server was"..., 57) = 57 open("/etc/hosts", O_RDONLY) = 3 fcntl64(3, F_GETFD) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 fstat64(3, {st_mode=S_IFREG|0644, st_size=220, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7c5b000 read(3, "#71.71.71.71\t\tMYHOST pkiula"..., 4096) = 220 read(3, "", 4096) = 0 close(3) = 0 munmap(0xb7c5b000, 4096) = 0 open("/etc/hosts", O_RDONLY) = 3 fcntl64(3, F_GETFD) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 fstat64(3, {st_mode=S_IFREG|0644, st_size=220, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7c5b000 read(3, "#71.71.71.71\t\tMYHOST pkiula"..., 4096) = 220 close(3) = 0 munmap(0xb7c5b000, 4096) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 fcntl64(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(5432), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=3, events=POLLOUT|POLLERR, revents=POLLOUT}], 1, -1) = 1 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 getsockname(3, {sa_family=AF_INET, sin_port=htons(35241), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 poll([{fd=3, events=POLLOUT|POLLERR, revents=POLLOUT}], 1, -1) = 1 rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0 send(3, "\0\0\0/\0\3\0\0user\0MYDOMAIN_MYDOMAIN\0dat"..., 47, 0) = 47 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "E\0\0\0aSFATAL\0C57P03\0Mthe database"..., 16384, 0) = 98 write(2, "Failed.\n", 8) = 8 close(3) = 0 gettimeofday({1303359052, 64835}, NULL) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 ioctl(0, TIOCGWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0 ioctl(0, TIOCSWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig -icanon -echo ...}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGINT, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {0x804ddd2, [], SA_RESTORER|SA_RESTART, 0xda2a08}, 8) = 0 rt_sigaction(SIGTERM, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGQUIT, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGALRM, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGTSTP, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGTTOU, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGTTIN, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 rt_sigaction(SIGWINCH, {0x12afd0, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0 write(1, "!> ", 3) = 3 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(0, <unfinished ...> Process 15900 detached
Dne 21.4.2011 07:16, Phoenix Kiula napsal(a): > Tomas, > > I did a crash log with the strace for PID of the index command as you > suggested. > > Here's the output: > http://www.heypasteit.com/clip/WNR > > Also including below, but because this will wrap etc, you can look at > the link above. > > Thanks for any ideas or pointers! > > > > Process 15900 attached - interrupt to quit Nope, that's the "psql" process - you need to attach to the backend process that's created to handle the connection. Whenever you create a connection (from a psql), a new backend process is forked to handle that single connection - this is the process you need to strace. You can either see that in 'ps ax' (the PID is usually +1 with respect to the psql process), or you can do this SELECT pg_backend_pid(); as that will give you PID of the backend for the current connection. regards Tomas
On Fri, Apr 22, 2011 at 12:51 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 21.4.2011 07:16, Phoenix Kiula napsal(a): >> Tomas, >> >> I did a crash log with the strace for PID of the index command as you >> suggested. >> >> Here's the output: >> http://www.heypasteit.com/clip/WNR >> >> Also including below, but because this will wrap etc, you can look at >> the link above. >> >> Thanks for any ideas or pointers! >> >> >> >> Process 15900 attached - interrupt to quit > > Nope, that's the "psql" process - you need to attach to the backend > process that's created to handle the connection. Whenever you create a > connection (from a psql), a new backend process is forked to handle that > single connection - this is the process you need to strace. > > You can either see that in 'ps ax' (the PID is usually +1 with respect > to the psql process), or you can do this > > SELECT pg_backend_pid(); > > as that will give you PID of the backend for the current connection. Thanks. Did that. The crash.log is a large-ish file, about 24KB. Here's the last 10 lines though. Does this help? ~ > tail -10 /root/crash.log read(58, "`\1\0\0\230\337\0\343\1\0\0\0P\0T\r\0 \3 \374\236\2\2T\215\312\1\354\235\32\2"..., 8192) = 8192 write(97, "213.156.60\0\0 \0\0\0\37\0\364P\3\0\34@\22\0\0\000210."..., 8192) = 8192 read(58, "`\1\0\0\274\362\0\343\1\0\0\0T\0\210\r\0 \3 0\217\352\1\240\236\272\0024\235\322\2"..., 8192) = 8192 read(58, "[\1\0\0\354)c*\1\0\0\0T\0\214\r\0 \3 \254\236\242\2\340\220\342\2\\\235\232\2"..., 8192) = 8192 read(58, "\\\1\0\0\200\245\207\32\1\0\0\0\\\0\340\r\0 \3 \237\272\1\304\235\262\2\340\215\322\1"..., 8192) = 8192 read(58, "\350\0\0\0\274\311x\323\1\0\0\0\\\0000\r\0 \3 \200\236\372\2(\235\252\2\34\234\22\2"..., 8192) = 8192 read(58, ";\1\0\0|#\265\30\1\0\0\0`\0h\r\0 \3 \324\236R\2\314\235\n\2h\215\362\1"..., 8192) = 8192 read(58, "c\1\0\0000\24%u\1\0\0\0\230\0\210\r\0 \3 \240\226\32\16\260\235\252\1p\222Z\10"..., 8192) = 8192 --- SIGSEGV (Segmentation fault) @ 0 (0) --- Process 17161 detached The full crash.log file is here if needed: https://www.yousendit.com/download/ VnBxcmxjNDJlM1JjR0E9PQ Btw, this happens when I try to create an index on one of the columns in my table. Just before this, I had created another index on modify_date (a timestamp column) and it went fine. Does that mean anything? Thanks
On Fri, Apr 22, 2011 at 12:06 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On Fri, Apr 22, 2011 at 12:51 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Dne 21.4.2011 07:16, Phoenix Kiula napsal(a): >>> Tomas, >>> >>> I did a crash log with the strace for PID of the index command as you >>> suggested. >>> >>> Here's the output: >>> http://www.heypasteit.com/clip/WNR >>> >>> Also including below, but because this will wrap etc, you can look at >>> the link above. >>> >>> Thanks for any ideas or pointers! >>> >>> >>> >>> Process 15900 attached - interrupt to quit >> >> Nope, that's the "psql" process - you need to attach to the backend >> process that's created to handle the connection. Whenever you create a >> connection (from a psql), a new backend process is forked to handle that >> single connection - this is the process you need to strace. >> >> You can either see that in 'ps ax' (the PID is usually +1 with respect >> to the psql process), or you can do this >> >> SELECT pg_backend_pid(); >> >> as that will give you PID of the backend for the current connection. > > > > > > Thanks. Did that. > > The crash.log is a large-ish file, about 24KB. Here's the last 10 > lines though. Does this help? > > > > ~ > tail -10 /root/crash.log > read(58, "`\1\0\0\230\337\0\343\1\0\0\0P\0T\r\0 \3 > \374\236\2\2T\215\312\1\354\235\32\2"..., 8192) = 8192 > write(97, "213.156.60\0\0 \0\0\0\37\0\364P\3\0\34@\22\0\0\000210."..., > 8192) = 8192 > read(58, "`\1\0\0\274\362\0\343\1\0\0\0T\0\210\r\0 \3 > 0\217\352\1\240\236\272\0024\235\322\2"..., 8192) = 8192 > read(58, "[\1\0\0\354)c*\1\0\0\0T\0\214\r\0 \3 > \254\236\242\2\340\220\342\2\\\235\232\2"..., 8192) = 8192 > read(58, "\\\1\0\0\200\245\207\32\1\0\0\0\\\0\340\r\0 \3 > \237\272\1\304\235\262\2\340\215\322\1"..., 8192) = 8192 > read(58, "\350\0\0\0\274\311x\323\1\0\0\0\\\0000\r\0 \3 > \200\236\372\2(\235\252\2\34\234\22\2"..., 8192) = 8192 > read(58, ";\1\0\0|#\265\30\1\0\0\0`\0h\r\0 \3 > \324\236R\2\314\235\n\2h\215\362\1"..., 8192) = 8192 > read(58, "c\1\0\0000\24%u\1\0\0\0\230\0\210\r\0 \3 > \240\226\32\16\260\235\252\1p\222Z\10"..., 8192) = 8192 > --- SIGSEGV (Segmentation fault) @ 0 (0) --- > Process 17161 detached > > > > The full crash.log file is here if needed: > https://www.yousendit.com/download/ VnBxcmxjNDJlM1JjR0E9PQ > > Btw, this happens when I try to create an index on one of the columns > in my table. > > Just before this, I had created another index on modify_date (a > timestamp column) and it went fine. > > Does that mean anything? > > Thanks > Probably a dumb and ignorant question, but should I be reseting the xlog? http://postgresql.1045698.n5.nabble.com/SIGSEGV-when-trying-to-start-in-single-user-mode-td1924418.html
> On Fri, Apr 22, 2011 at 12:06 PM, Phoenix Kiula <phoenix.kiula@gmail.com> > wrote: >> On Fri, Apr 22, 2011 at 12:51 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> Dne 21.4.2011 07:16, Phoenix Kiula napsal(a): >>>> Tomas, >>>> >>>> I did a crash log with the strace for PID of the index command as you >>>> suggested. >>>> >>>> Here's the output: >>>> http://www.heypasteit.com/clip/WNR >>>> >>>> Also including below, but because this will wrap etc, you can look at >>>> the link above. >>>> >>>> Thanks for any ideas or pointers! >>>> >>>> >>>> >>>> Process 15900 attached - interrupt to quit >>> >>> Nope, that's the "psql" process - you need to attach to the backend >>> process that's created to handle the connection. Whenever you create a >>> connection (from a psql), a new backend process is forked to handle >>> that >>> single connection - this is the process you need to strace. >>> >>> You can either see that in 'ps ax' (the PID is usually +1 with respect >>> to the psql process), or you can do this >>> >>> SELECT pg_backend_pid(); >>> >>> as that will give you PID of the backend for the current connection. >> >> >> >> >> >> Thanks. Did that. >> >> The crash.log is a large-ish file, about 24KB. Here's the last 10 >> lines though. Does this help? >> >> >> >> ~ > tail -10 /root/crash.log >> read(58, "`\1\0\0\230\337\0\343\1\0\0\0P\0T\r\0 \3 >> \374\236\2\2T\215\312\1\354\235\32\2"..., 8192) = 8192 >> write(97, "213.156.60\0\0 \0\0\0\37\0\364P\3\0\34@\22\0\0\000210."..., >> 8192) = 8192 >> read(58, "`\1\0\0\274\362\0\343\1\0\0\0T\0\210\r\0 \3 >> 0\217\352\1\240\236\272\0024\235\322\2"..., 8192) = 8192 >> read(58, "[\1\0\0\354)c*\1\0\0\0T\0\214\r\0 \3 >> \254\236\242\2\340\220\342\2\\\235\232\2"..., 8192) = 8192 >> read(58, "\\\1\0\0\200\245\207\32\1\0\0\0\\\0\340\r\0 \3 >> \237\272\1\304\235\262\2\340\215\322\1"..., 8192) = 8192 >> read(58, "\350\0\0\0\274\311x\323\1\0\0\0\\\0000\r\0 \3 >> \200\236\372\2(\235\252\2\34\234\22\2"..., 8192) = 8192 >> read(58, ";\1\0\0|#\265\30\1\0\0\0`\0h\r\0 \3 >> \324\236R\2\314\235\n\2h\215\362\1"..., 8192) = 8192 >> read(58, "c\1\0\0000\24%u\1\0\0\0\230\0\210\r\0 \3 >> \240\226\32\16\260\235\252\1p\222Z\10"..., 8192) = 8192 >> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >> Process 17161 detached >> >> >> >> The full crash.log file is here if needed: >> https://www.yousendit.com/download/ VnBxcmxjNDJlM1JjR0E9PQ >> >> Btw, this happens when I try to create an index on one of the columns >> in my table. >> >> Just before this, I had created another index on modify_date (a >> timestamp column) and it went fine. >> >> Does that mean anything? >> >> Thanks >> > > > > Probably a dumb and ignorant question, but should I be reseting the xlog? > http://postgresql.1045698.n5.nabble.com/SIGSEGV-when-trying-to-start-in-single-user-mode-td1924418.html Nope, that's a different problem I guess - you don't have problems with starting up a database (when the logs are replayed), so this would not help (and it might cause other issues). Anyway I haven't found anything useful in the strace output - it seems it works fine, reads about 500MB (each of the 'read' calls corresponds to 8kB of data) of data and then suddenly ends. A bit strange is the last line is not complete ... Anyway, this is where my current knowledge of how processes in PostgreSQL ends. If I was sitting at the terminal, I'd probably continue by try and error to find out more details about the segfault, but that's not very applicable over e-mail. So let's hope some of the pg gurus who read this list will enlighten us with a bit more knowledge. regards Tomas
On Fri, Apr 22, 2011 at 7:07 PM, <tv@fuzzy.cz> wrote: >> On Fri, Apr 22, 2011 at 12:06 PM, Phoenix Kiula <phoenix.kiula@gmail.com> >> wrote: >>> On Fri, Apr 22, 2011 at 12:51 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>> Dne 21.4.2011 07:16, Phoenix Kiula napsal(a): >>>>> Tomas, >>>>> >>>>> I did a crash log with the strace for PID of the index command as you >>>>> suggested. >>>>> >>>>> Here's the output: >>>>> http://www.heypasteit.com/clip/WNR >>>>> >>>>> Also including below, but because this will wrap etc, you can look at >>>>> the link above. >>>>> >>>>> Thanks for any ideas or pointers! >>>>> >>>>> >>>>> >>>>> Process 15900 attached - interrupt to quit >>>> >>>> Nope, that's the "psql" process - you need to attach to the backend >>>> process that's created to handle the connection. Whenever you create a >>>> connection (from a psql), a new backend process is forked to handle >>>> that >>>> single connection - this is the process you need to strace. >>>> >>>> You can either see that in 'ps ax' (the PID is usually +1 with respect >>>> to the psql process), or you can do this >>>> >>>> SELECT pg_backend_pid(); >>>> >>>> as that will give you PID of the backend for the current connection. >>> >>> >>> >>> >>> >>> Thanks. Did that. >>> >>> The crash.log is a large-ish file, about 24KB. Here's the last 10 >>> lines though. Does this help? >>> >>> >>> >>> ~ > tail -10 /root/crash.log >>> read(58, "`\1\0\0\230\337\0\343\1\0\0\0P\0T\r\0 \3 >>> \374\236\2\2T\215\312\1\354\235\32\2"..., 8192) = 8192 >>> write(97, "213.156.60\0\0 \0\0\0\37\0\364P\3\0\34@\22\0\0\000210."..., >>> 8192) = 8192 >>> read(58, "`\1\0\0\274\362\0\343\1\0\0\0T\0\210\r\0 \3 >>> 0\217\352\1\240\236\272\0024\235\322\2"..., 8192) = 8192 >>> read(58, "[\1\0\0\354)c*\1\0\0\0T\0\214\r\0 \3 >>> \254\236\242\2\340\220\342\2\\\235\232\2"..., 8192) = 8192 >>> read(58, "\\\1\0\0\200\245\207\32\1\0\0\0\\\0\340\r\0 \3 >>> \237\272\1\304\235\262\2\340\215\322\1"..., 8192) = 8192 >>> read(58, "\350\0\0\0\274\311x\323\1\0\0\0\\\0000\r\0 \3 >>> \200\236\372\2(\235\252\2\34\234\22\2"..., 8192) = 8192 >>> read(58, ";\1\0\0|#\265\30\1\0\0\0`\0h\r\0 \3 >>> \324\236R\2\314\235\n\2h\215\362\1"..., 8192) = 8192 >>> read(58, "c\1\0\0000\24%u\1\0\0\0\230\0\210\r\0 \3 >>> \240\226\32\16\260\235\252\1p\222Z\10"..., 8192) = 8192 >>> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >>> Process 17161 detached >>> >>> >>> >>> The full crash.log file is here if needed: >>> https://www.yousendit.com/download/ VnBxcmxjNDJlM1JjR0E9PQ >>> >>> Btw, this happens when I try to create an index on one of the columns >>> in my table. >>> >>> Just before this, I had created another index on modify_date (a >>> timestamp column) and it went fine. >>> >>> Does that mean anything? >>> >>> Thanks >>> >> >> >> >> Probably a dumb and ignorant question, but should I be reseting the xlog? >> http://postgresql.1045698.n5.nabble.com/SIGSEGV-when-trying-to-start-in-single-user-mode-td1924418.html > > Nope, that's a different problem I guess - you don't have problems with > starting up a database (when the logs are replayed), so this would not > help (and it might cause other issues). > > Anyway I haven't found anything useful in the strace output - it seems it > works fine, reads about 500MB (each of the 'read' calls corresponds to 8kB > of data) of data and then suddenly ends. A bit strange is the last line is > not complete ... > > Anyway, this is where my current knowledge of how processes in PostgreSQL > ends. If I was sitting at the terminal, I'd probably continue by try and > error to find out more details about the segfault, but that's not very > applicable over e-mail. > > So let's hope some of the pg gurus who read this list will enlighten us > with a bit more knowledge. > > regards > Tomas > > In the pg_dumpall backup process, I get this error. Does this help? pg_dump: SQL command failed pg_dump: Error message from server: ERROR: invalid memory alloc request size 4294967293 pg_dump: The command was: COPY public.links (id, link_id, alias, aliasentered, url, user_known, user_id, url_encrypted, title, private, private_key, status, create_date, modify_date, disable_in_statistics, user_running_id, url_host_long) TO stdout; pg_dumpall: pg_dump failed on database "snipurl", exiting Thanks!
> On Fri, Apr 22, 2011 at 7:07 PM, <tv@fuzzy.cz> wrote: > In the pg_dumpall backup process, I get this error. Does this help? > Well, not really - it's just another incarnation of the problem we've already seen. PostgreSQL reads the data, and at some point it finds out it needs to allocate 4294967293B of memory. Which is strange, because it's actually a negative number (-3 AFAIK). It's probably caused by data corruption (incorrect length for a field). There are ways to find out more about the cause, e.g. here: http://archives.postgresql.org/pgsql-hackers/2005-10/msg01198.php but you need to have a pg compiled with debug support. I guess the packaged version does not support that, but maybe you can get the sources and compile them on your own. If it really is a data corruption, you might try to locate the corrupted blocks like this: -- get number of blocks SELECT relpages FROM pg_class WHERE relname = 'table_name'; -- get items for each block (read the problematic column) FOR block IN 1..relpages LOOP SELECT AVG(length(colname)) FROM table_name WHERE ctid >= '(block,0)'::ctid AND ctid < '(block+1,0)'::ctid; and once it fails remember the block ID (and restart - there might be more). regards Tomas
On Fri, Apr 22, 2011 at 8:20 PM, <tv@fuzzy.cz> wrote: >> On Fri, Apr 22, 2011 at 7:07 PM, <tv@fuzzy.cz> wrote: >> In the pg_dumpall backup process, I get this error. Does this help? >> > > Well, not really - it's just another incarnation of the problem we've > already seen. PostgreSQL reads the data, and at some point it finds out it > needs to allocate 4294967293B of memory. Which is strange, because it's > actually a negative number (-3 AFAIK). > > It's probably caused by data corruption (incorrect length for a field). > > There are ways to find out more about the cause, e.g. here: > > http://archives.postgresql.org/pgsql-hackers/2005-10/msg01198.php > > but you need to have a pg compiled with debug support. I guess the > packaged version does not support that, but maybe you can get the sources > and compile them on your own. > > If it really is a data corruption, you might try to locate the corrupted > blocks like this: > > -- get number of blocks > SELECT relpages FROM pg_class WHERE relname = 'table_name'; > > -- get items for each block (read the problematic column) > FOR block IN 1..relpages LOOP > SELECT AVG(length(colname)) FROM table_name WHERE ctid >= > '(block,0)'::ctid AND ctid < '(block+1,0)'::ctid; Thanks for this. Very useful. What is this -- a function? How should I execute this query? Thanks!
> On Fri, Apr 22, 2011 at 8:20 PM, <tv@fuzzy.cz> wrote: >>> On Fri, Apr 22, 2011 at 7:07 PM, <tv@fuzzy.cz> wrote: >>> In the pg_dumpall backup process, I get this error. Does this help? >>> >> >> Well, not really - it's just another incarnation of the problem we've >> already seen. PostgreSQL reads the data, and at some point it finds out >> it >> needs to allocate 4294967293B of memory. Which is strange, because it's >> actually a negative number (-3 AFAIK). >> >> It's probably caused by data corruption (incorrect length for a field). >> >> There are ways to find out more about the cause, e.g. here: >> >> http://archives.postgresql.org/pgsql-hackers/2005-10/msg01198.php >> >> but you need to have a pg compiled with debug support. I guess the >> packaged version does not support that, but maybe you can get the >> sources >> and compile them on your own. >> >> If it really is a data corruption, you might try to locate the corrupted >> blocks like this: >> >> -- get number of blocks >> SELECT relpages FROM pg_class WHERE relname = 'table_name'; >> >> -- get items for each block (read the problematic column) >> FOR block IN 1..relpages LOOP >> SELECT AVG(length(colname)) FROM table_name WHERE ctid >= >> '(block,0)'::ctid AND ctid < '(block+1,0)'::ctid; > > > Thanks for this. Very useful. What is this -- a function? How should I > execute this query? It's a pseudocode - you need to implement that in whatever language you like. You could do that in PL/pgSQL but don't forget it's probably going to crash when you hit the problematic block so I'd probably implement that in outside the DB (with a logic to continue the loop once the connection dies). And 'ctid' is a pseudocolumn that means '(block#, row#)' i.e. it's something like a physical location of the row. regards Tomas
On Fri, Apr 22, 2011 at 8:35 PM, <tv@fuzzy.cz> wrote: >> On Fri, Apr 22, 2011 at 8:20 PM, <tv@fuzzy.cz> wrote: >>>> On Fri, Apr 22, 2011 at 7:07 PM, <tv@fuzzy.cz> wrote: >>>> In the pg_dumpall backup process, I get this error. Does this help? >>>> >>> >>> Well, not really - it's just another incarnation of the problem we've >>> already seen. PostgreSQL reads the data, and at some point it finds out >>> it >>> needs to allocate 4294967293B of memory. Which is strange, because it's >>> actually a negative number (-3 AFAIK). >>> >>> It's probably caused by data corruption (incorrect length for a field). >>> >>> There are ways to find out more about the cause, e.g. here: >>> >>> http://archives.postgresql.org/pgsql-hackers/2005-10/msg01198.php >>> >>> but you need to have a pg compiled with debug support. I guess the >>> packaged version does not support that, but maybe you can get the >>> sources >>> and compile them on your own. >>> >>> If it really is a data corruption, you might try to locate the corrupted >>> blocks like this: >>> >>> -- get number of blocks >>> SELECT relpages FROM pg_class WHERE relname = 'table_name'; >>> >>> -- get items for each block (read the problematic column) >>> FOR block IN 1..relpages LOOP >>> SELECT AVG(length(colname)) FROM table_name WHERE ctid >= >>> '(block,0)'::ctid AND ctid < '(block+1,0)'::ctid; >> >> >> Thanks for this. Very useful. What is this -- a function? How should I >> execute this query? > > It's a pseudocode - you need to implement that in whatever language you > like. You could do that in PL/pgSQL but don't forget it's probably going > to crash when you hit the problematic block so I'd probably implement that > in outside the DB (with a logic to continue the loop once the connection > dies). > > And 'ctid' is a pseudocolumn that means '(block#, row#)' i.e. it's > something like a physical location of the row. > > regards > Tomas A question. Is data dumped from "COPY TO" command any use? It has taken me days, but I have managed to COPY my large table in chunks. If I subsequently COPY FROM these files, would this be a workable solution? My fear based on my ignorance is that maybe the data corruption, if any exists, will also get COPY-ied and therefore transferred into the fresh database. Is this fear justified, or is COPY a viable alternative? Thanks!
On Mon, Apr 25, 2011 at 9:19 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On Fri, Apr 22, 2011 at 8:35 PM, <tv@fuzzy.cz> wrote: >>> On Fri, Apr 22, 2011 at 8:20 PM, <tv@fuzzy.cz> wrote: >>>>> On Fri, Apr 22, 2011 at 7:07 PM, <tv@fuzzy.cz> wrote: >>>>> In the pg_dumpall backup process, I get this error. Does this help? >>>>> >>>> >>>> Well, not really - it's just another incarnation of the problem we've >>>> already seen. PostgreSQL reads the data, and at some point it finds out >>>> it >>>> needs to allocate 4294967293B of memory. Which is strange, because it's >>>> actually a negative number (-3 AFAIK). >>>> >>>> It's probably caused by data corruption (incorrect length for a field). >>>> >>>> There are ways to find out more about the cause, e.g. here: >>>> >>>> http://archives.postgresql.org/pgsql-hackers/2005-10/msg01198.php >>>> >>>> but you need to have a pg compiled with debug support. I guess the >>>> packaged version does not support that, but maybe you can get the >>>> sources >>>> and compile them on your own. >>>> >>>> If it really is a data corruption, you might try to locate the corrupted >>>> blocks like this: >>>> >>>> -- get number of blocks >>>> SELECT relpages FROM pg_class WHERE relname = 'table_name'; >>>> >>>> -- get items for each block (read the problematic column) >>>> FOR block IN 1..relpages LOOP >>>> SELECT AVG(length(colname)) FROM table_name WHERE ctid >= >>>> '(block,0)'::ctid AND ctid < '(block+1,0)'::ctid; >>> >>> >>> Thanks for this. Very useful. What is this -- a function? How should I >>> execute this query? >> >> It's a pseudocode - you need to implement that in whatever language you >> like. You could do that in PL/pgSQL but don't forget it's probably going >> to crash when you hit the problematic block so I'd probably implement that >> in outside the DB (with a logic to continue the loop once the connection >> dies). >> >> And 'ctid' is a pseudocolumn that means '(block#, row#)' i.e. it's >> something like a physical location of the row. >> >> regards >> Tomas > > > > A question. > > Is data dumped from "COPY TO" command any use? > > It has taken me days, but I have managed to COPY my large table in chunks. > > If I subsequently COPY FROM these files, would this be a workable solution? > > My fear based on my ignorance is that maybe the data corruption, if > any exists, will also get COPY-ied and therefore transferred into the > fresh database. > > Is this fear justified, or is COPY a viable alternative? > > Thanks! > Sorry, spoke too soon. I can COPY individual chunks to files. Did that by year, and at least the dumping worked. Now I need to pull the data in at the destination server. If I COPY each individual file back into the table, it works. Slowly, but seems to work. I tried to combine all the files into one go, then truncate the table, and pull it all in in one go (130 million rows or so) but this time it gave the same error. However, it pointed out a specific row where the problem was: COPY links, line 15272357: "16426447 9s2q7 9s2q7 N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&i..." server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Is this any use at all? Would appreciate any pointers!
On 25 Apr 2011, at 18:16, Phoenix Kiula wrote: > If I COPY each individual file back into the table, it works. Slowly, > but seems to work. I tried to combine all the files into one go, then > truncate the table, and pull it all in in one go (130 million rows or > so) but this time it gave the same error. However, it pointed out a > specific row where the problem was: > > > COPY links, line 15272357: > "16426447 9s2q7 9s2q7 N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&i..." > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > > Is this any use at all? Would appreciate any pointers! I didn't follow the entire thread, so maybe someone mentioned this already, but... Usually if we see error messages like those it turns out the OS is killing the postgres process with it's equivalent of alow-on-memory-killer. I know Linux's got such a beast, and that you can turn it off. It's a frequently recurring issue on this list, there's bound to be some pointers in the archives ;) Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4db5b02411674566889782!
Dne 25.4.2011 19:31, Alban Hertroys napsal(a): > On 25 Apr 2011, at 18:16, Phoenix Kiula wrote: > >> If I COPY each individual file back into the table, it works. Slowly, >> but seems to work. I tried to combine all the files into one go, then >> truncate the table, and pull it all in in one go (130 million rows or >> so) but this time it gave the same error. However, it pointed out a >> specific row where the problem was: >> >> >> COPY links, line 15272357: >> "16426447 9s2q7 9s2q7 N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&i..." >> server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. >> The connection to the server was lost. Attempting reset: Failed. >> >> >> Is this any use at all? Would appreciate any pointers! > > > I didn't follow the entire thread, so maybe someone mentioned this already, but... > Usually if we see error messages like those it turns out the OS is killing the postgres process with it's equivalent ofa low-on-memory-killer. I know Linux's got such a beast, and that you can turn it off. > > It's a frequently recurring issue on this list, there's bound to be some pointers in the archives ;) Not sure if this COPY failure is caused by the same issue as before, but the original issue was caused by this pg_dump: SQL command failed pg_dump: Error message from server: ERROR: invalid memory alloc request size 4294967293 pg_dump: The command was: COPY public.links (id, link_id, alias, aliasentered, url, user_known, user_id, url_encrypted, title, private, private_key, status, create_date, modify_date, disable_in_statistics, user_running_id, url_host_long) TO stdout; pg_dumpall: pg_dump failed on database "snipurl", exiting i.e. a bad memory alloc request (with negative size). That does not seem like an OOM killing the backend. regards Tomas
On Tue, Apr 26, 2011 at 1:56 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 25.4.2011 19:31, Alban Hertroys napsal(a): >> On 25 Apr 2011, at 18:16, Phoenix Kiula wrote: >> >>> If I COPY each individual file back into the table, it works. Slowly, >>> but seems to work. I tried to combine all the files into one go, then >>> truncate the table, and pull it all in in one go (130 million rows or >>> so) but this time it gave the same error. However, it pointed out a >>> specific row where the problem was: >>> >>> >>> COPY links, line 15272357: >>> "16426447 9s2q7 9s2q7 N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&i..." >>> server closed the connection unexpectedly >>> This probably means the server terminated abnormally >>> before or while processing the request. >>> The connection to the server was lost. Attempting reset: Failed. >>> >>> >>> Is this any use at all? Would appreciate any pointers! >> >> >> I didn't follow the entire thread, so maybe someone mentioned this already, but... >> Usually if we see error messages like those it turns out the OS is killing the postgres process with it's equivalent ofa low-on-memory-killer. I know Linux's got such a beast, and that you can turn it off. >> >> It's a frequently recurring issue on this list, there's bound to be some pointers in the archives ;) > > Not sure if this COPY failure is caused by the same issue as before, but > the original issue was caused by this > > pg_dump: SQL command failed > pg_dump: Error message from server: ERROR: invalid memory alloc > request size 4294967293 > pg_dump: The command was: COPY public.links (id, link_id, alias, > aliasentered, url, user_known, user_id, url_encrypted, title, private, > private_key, status, create_date, modify_date, disable_in_statistics, > user_running_id, url_host_long) TO stdout; > pg_dumpall: pg_dump failed on database "snipurl", exiting > > i.e. a bad memory alloc request (with negative size). That does not seem > like an OOM killing the backend. Most likely you're right. I did a COPY FROM and populated the entire table. In my hard disk, the space consumption went up by 64GB. Yet, when I do a "SELECT * FROM mytable LIMIT 1" the entire DB crashes. There is no visible record. What's this?
Dne 25.4.2011 20:40, Phoenix Kiula napsal(a): > > I did a COPY FROM and populated the entire table. In my hard disk, the > space consumption went up by 64GB. So you have dumped the table piece by piece, it worked, and now you have a complete copy of the table? All the rows? > Yet, when I do a "SELECT * FROM mytable LIMIT 1" the entire DB > crashes. There is no visible record. > > What's this? Hmmmm, that's strange ... you're saying that's a freshly populated DB? Tomas
Phoenix Kiula <phoenix.kiula@gmail.com> writes: > I did a COPY FROM and populated the entire table. In my hard disk, the > space consumption went up by 64GB. > Yet, when I do a "SELECT * FROM mytable LIMIT 1" the entire DB > crashes. There is no visible record. There should certainly be a "visible record" somewhere, ie, the postmaster log. It might also be productive to look in the kernel log for events around the same time --- OOM kills would be recorded there, and if the true story here is hardware problems there might also be kernel log messages about that. regards, tom lane
Dne 25.4.2011 18:16, Phoenix Kiula napsal(a): > Sorry, spoke too soon. > > I can COPY individual chunks to files. Did that by year, and at least > the dumping worked. > > Now I need to pull the data in at the destination server. > > If I COPY each individual file back into the table, it works. Slowly, > but seems to work. I tried to combine all the files into one go, then > truncate the table, and pull it all in in one go (130 million rows or > so) but this time it gave the same error. However, it pointed out a > specific row where the problem was: > > COPY links, line 15272357: > "16426447 9s2q7 9s2q7 N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&i..." > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > Is this any use at all? Would appreciate any pointers! So the dump worked fina and it fails when loading it back into the DB? Have you checked the output file (just see the tail). Can you post the part that causes issues? Just the line 16426447 and few lines around. regards Tomas
On Tuesday, April 26, 2011, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 25.4.2011 18:16, Phoenix Kiula napsal(a): >> Sorry, spoke too soon. >> >> I can COPY individual chunks to files. Did that by year, and at least >> the dumping worked. >> >> Now I need to pull the data in at the destination server. >> >> If I COPY each individual file back into the table, it works. Slowly, >> but seems to work. I tried to combine all the files into one go, then >> truncate the table, and pull it all in in one go (130 million rows or >> so) but this time it gave the same error. However, it pointed out a >> specific row where the problem was: >> >> COPY links, line 15272357: >> "16426447 9s2q7 9s2q7 N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&i..." >> server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. >> The connection to the server was lost. Attempting reset: Failed. >> >> Is this any use at all? Would appreciate any pointers! > > So the dump worked fina and it fails when loading it back into the DB? > Have you checked the output file (just see the tail). Can you post the > part that causes issues? Just the line 16426447 and few lines around. > > regards > Tomas > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > Ok let me explain. Pg_dumpall did not work. It kept on crashing. So I did copy, with conditional commands, copying one year at a time. This process took me a day and a half but I now have files with copy dumps for last 11 years. On the fresh server, instead of 'copy from' with 11 files I cocatenated the files into one. Then in a transaction, I imported this file into the new database, which has: Begin Truncate table Copy from into table Commit This worked. I confirmed by checking for new disk usage in the ~/data folder. it has gone up by 64gig. Yet that SQL gives me no rows. -- Shashank Tripathi +1 646 755 9860 +65 932 55 600
> On Tuesday, April 26, 2011, Tomas Vondra <tv@fuzzy.cz> wrote: >> Dne 25.4.2011 18:16, Phoenix Kiula napsal(a): >>> Sorry, spoke too soon. >>> >>> I can COPY individual chunks to files. Did that by year, and at least >>> the dumping worked. >>> >>> Now I need to pull the data in at the destination server. >>> >>> If I COPY each individual file back into the table, it works. Slowly, >>> but seems to work. I tried to combine all the files into one go, then >>> truncate the table, and pull it all in in one go (130 million rows or >>> so) but this time it gave the same error. However, it pointed out a >>> specific row where the problem was: >>> >>> COPY links, line 15272357: >>> "16426447 9s2q7 9s2q7 N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&i..." >>> server closed the connection unexpectedly >>> This probably means the server terminated abnormally >>> before or while processing the request. >>> The connection to the server was lost. Attempting reset: Failed. >>> >>> Is this any use at all? Would appreciate any pointers! >> >> So the dump worked fina and it fails when loading it back into the DB? >> Have you checked the output file (just see the tail). Can you post the >> part that causes issues? Just the line 16426447 and few lines around. >> >> regards >> Tomas From the old server: Yearly COPY files worked. Pg_dumpall was giving problems. In the new server: COPY FROM worked. All files appear to have been copied. Then I create the primary key index, and another index. Many records are there, but many are not there! There's no error, just that some records/rows just didn't make it. I did the COPY FROM in a transaction block. If there had been an error, then "commit" would have rolledback, right? It didn't. It committed. No errors. Just that some data has not come in. How can I get more info on why? Tomas, the line where it crashed, here are the 10 or so lines around it: > head -15272350 /backup/links/links_all.txt | tail -20 16426422 9s2pi 9s2pi N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=Cannibal+Corpse+-+Split+Wide+Open&linkCode=ur2&tag=dmp3-20 0 121.214.194.133 7a69d5842739e20b56c0103d1a6ec172e58f9e07 \N Y 2009-01-10 20:59:31.135881 2009-01-10 20:59:31.135881 \N \N 16426423 9s2pj 9s2pj N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=Juana+Fe+-+la+murga+final&linkCode=ur2&tag=dmp3-20 0 201.215.6.104 5e2ae1f363c7854c13a101a60b32a9a1ade26767 \N Y 2009-01-10 20:59:31.593474 2009-01-10 20:59:31.593474 Y \N \N 15897862 9gqva 9gqva N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=Boyz+II+Men+-+Ill+Make+Love+To+You&linkCode=ur2&tag=dmp3-20 0 76.10.185.87 3c840fa5428c0464556dccb7d1013a6ec53d1743 N Y 2009-01-04 19:40:50.734967 2009-01-10 20:59:32.286937 N \N \N 15130149 90ahx 90ahx N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=The+Killers+-+All+The+Pretty+Faces&linkCode=ur2&tag=dmp3-20 0 65.25.74.141 5eb2a1bb48d4926d8eaf946fb544ce11c50a9e5b N Y 2008-12-22 14:54:20.813923 2009-01-10 20:59:33.896232 N \N \N 16426425 9s2pl 9s2pl N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=Freddy+Quinn+-+Junge%2C+Komm+Bald+Wieder&linkCode=ur2&tag=dmp3-20 0 123.100.137.226 fb7af64a4b886f074a6443b8d43f571c3083f51c \N Y 2009-01-10 20:59:33.986764 2009-01-10 20:59:33.986764 Y \N \N 16391756 9rbyk 9rbyk N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=Closure+In+Moscow+-+Ofelia...+Ofelia&linkCode=ur2&tag=dmp3-20 0 71.233.18.39 a4f95f246b89523785b736530fb4b3a335195c4b N Y 2009-01-10 13:20:54.86346 2009-01-10 20:59:34.641193 N \N \N 16229928 9nv3c 9nv3c N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=Ministry+of+Sound+-+Freestylers+%2F+Push+Up&linkCode=ur2&tag=dmp3-20 0 24.60.222.70 b455933eb976b39313f5da56afcd9db29d3f7bde N Y 2009-01-08 19:35:19.842463 2009-01-10 20:59:35.343552 N \N \N 16426427 9s2pn 9s2pn N http://www.annehelmond.nl/2007/11/26/celebrating-two-thousand-delicious-bookmarks/ 195.190.28.97 22a06537e25985273297471dbeb3fb6ae217cb90 \N Y 2009-01-10 20:59:36.125122 2009-01-10 20:59:36.125122 Y \N \N 16426428 9s2po 9s2po N http://twinkle.tapulous.com/index.php?hash=9c01cb7b216a7f8b66056d20dd218f67f52f433e 66.135.60.238 d60e7f2801c05422b4ef17a1ca63df13772c4692 \N Y 2009-01-10 20:59:36.249249 2009-01-10 20:59:36.249249 Y \N \N 16426426 9s2pm 9s2pm N http://www.bikinibeat.org/bikini-barista-alisha-erickson-of-java-girls/11322/ 0 67.205.21.208 40970475a84e9879a2659aedf821156e2aac7323 N Y 2009-01-10 20:59:34.190555 2009-01-10 20:59:36.538822 N \N \N 16426429 9s2pp 9s2pp N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=Chico+Trujillo+-+Cabildo&linkCode=ur2&tag=dmp3-20 0 201.215.6.104 820aa985ca7c1e98b9763914155b9f0cd583fc60 \N Y 2009-01-10 20:59:36.556744 2009-01-10 20:59:36.556744 Y \N \N 16426237 9s2kd 9s2kd N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=%E0%B8%81%E0%B8%A5%E0%B9%89%E0%B8%A7%E0%B8%A2+%E0%B9%81%E0%B8%AA%E0%B8%95%E0%B8%A1%E0%B8%9B%E0%B9%8C+-+%E0%B8%A2%E0%B8%B1%E0%B8%87%E0%B8%A3%E0%B8%B1%E0%B8%81%E0%B8%81%E0%B8%B1%E0%B8%99%E0%B8%AD%E0%B8%A2%E0%B8%B9%E0%B9%88%E0%B9%84%E0%B8%AB%E0%B8%A1&linkCode=ur2&tag=dmp3-20 0 125.26.153.157 dfd14418cb8ad8afc5843e7873ee271dcd05289b 2009-01-10 20:56:36.271531 2009-01-10 20:59:37.163608 N \N \N 16426431 9s2pr 9s2pr N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=+-+Amplify+SD&linkCode=ur2&tag=dmp3-20 0 41.235.241.185 9a7f63d3cc8455d8a89cf8b707e38eef10245a66 \N Y 2009-01-10 20:59:37.498966 2009-01-10 20:59:37.498966 Y \N \N 16426432 9s2ps 9s2ps N http://www.zoliblog.com/2008/08/06/what-are-a-million-users-worth-zoho-thinks-a-lot/ 207.58.136.202 aa7bfcc1bf1b2ca19c14b262f3bd7272eed09e87 \N Y 2009-01-10 20:59:37.779863 2009-01-10 20:59:37.779863 Y \N \N 16306150 9phwm 9phwm N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=Takeharu+Ishimoto+-+Holding+My+Thoughts+In+My+Heart&linkCode=ur2&tag=dmp3-20 0 118.137.44.94 445e020999b8ddfaf72cb16bded949c9cab0fc8f N Y 2009-01-09 15:26:04.80344 2009-01-10 20:59:41.717183 N \N \N 16426435 9s2pv 9s2pv N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=chico+trujillo+-+como+quisiera&linkCode=ur2&tag=dmp3-20 0 201.215.6.104 1e1d275525cd2f5215e19db22af08f4edbf3bae5 \N Y 2009-01-10 20:59:41.844667 2009-01-10 20:59:41.844667 \N \N 16426436 9s2pw 9s2pw N http://twinkle.tapulous.com/index.php?hash=e4a3bee3941130cae759dd51659d58848644ea07 66.135.60.241 334722bd7db9c30762f9d8d0c19bccbf55e16249 \N Y 2009-01-10 20:59:42.86758 2009-01-10 20:59:42.86758 Y \N \N 16426437 9s2px 9s2px N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=David+Friedman%2FPeabo+Bryson%2FRegina+Belle+-+The+Battle&linkCode=ur2&tag=dmp3-20 0 124.171.4.232 37985292fd5c6a46de49bea712f780e54b0c747c \N Y 2009-01-10 20:59:43.617785 2009-01-10 20:59:43.617785 Y \N \N 16426438 9s2py 9s2py N http://www.manuscrypts.com/?p=132 0 74.220.219.59 64246d90b7e3dd259f8b315211eeb44dcf6f661c \N Y 2009-01-10 20:59:43.92993 2009-01-10 20:59:43.92993 Y \N \N 16426439 9s2pz 9s2pz N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&index=digital-music&keywords=New+Riders+of+the+Purple+Sage+-+Panama+Red&linkCode=ur2&tag=dmp3-20 0 76.20.192.237 5bfee6de3bc012098df107e6967201eb7338949c \N Y 2009-01-10 20:59:44.341971 2009-01-10 20:59:44.341971 Y \N \N
On Mon, Apr 25, 2011 at 8:50 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: >> On Tuesday, April 26, 2011, Tomas Vondra <tv@fuzzy.cz> wrote: >>> Dne 25.4.2011 18:16, Phoenix Kiula napsal(a): >>>> Sorry, spoke too soon. >>>> >>>> I can COPY individual chunks to files. Did that by year, and at least >>>> the dumping worked. >>>> >>>> Now I need to pull the data in at the destination server. >>>> >>>> If I COPY each individual file back into the table, it works. Slowly, >>>> but seems to work. I tried to combine all the files into one go, then >>>> truncate the table, and pull it all in in one go (130 million rows or >>>> so) but this time it gave the same error. However, it pointed out a >>>> specific row where the problem was: >>>> >>>> COPY links, line 15272357: >>>> "16426447 9s2q7 9s2q7 N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&i..." >>>> server closed the connection unexpectedly >>>> This probably means the server terminated abnormally >>>> before or while processing the request. >>>> The connection to the server was lost. Attempting reset: Failed. >>>> >>>> Is this any use at all? Would appreciate any pointers! >>> >>> So the dump worked fina and it fails when loading it back into the DB? >>> Have you checked the output file (just see the tail). Can you post the >>> part that causes issues? Just the line 16426447 and few lines around. >>> >>> regards >>> Tomas > > From the old server: > Yearly COPY files worked. Pg_dumpall was giving problems. > > In the new server: > COPY FROM worked. All files appear to have been copied. Then I create > the primary key index, and another index. Many records are there, but > many are not there! There's no error, just that some records/rows just > didn't make it. Are you sure you're getting all the data out of the source (broken) database you think you are? Are you sure those rows are in the dump?
On Tue, Apr 26, 2011 at 3:24 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Mon, Apr 25, 2011 at 8:50 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: >>> On Tuesday, April 26, 2011, Tomas Vondra <tv@fuzzy.cz> wrote: >>>> Dne 25.4.2011 18:16, Phoenix Kiula napsal(a): >>>>> Sorry, spoke too soon. >>>>> >>>>> I can COPY individual chunks to files. Did that by year, and at least >>>>> the dumping worked. >>>>> >>>>> Now I need to pull the data in at the destination server. >>>>> >>>>> If I COPY each individual file back into the table, it works. Slowly, >>>>> but seems to work. I tried to combine all the files into one go, then >>>>> truncate the table, and pull it all in in one go (130 million rows or >>>>> so) but this time it gave the same error. However, it pointed out a >>>>> specific row where the problem was: >>>>> >>>>> COPY links, line 15272357: >>>>> "16426447 9s2q7 9s2q7 N http://www.amazon.com/gp/search?camp=1789&creative=9325&ie=UTF8&i..." >>>>> server closed the connection unexpectedly >>>>> This probably means the server terminated abnormally >>>>> before or while processing the request. >>>>> The connection to the server was lost. Attempting reset: Failed. >>>>> >>>>> Is this any use at all? Would appreciate any pointers! >>>> >>>> So the dump worked fina and it fails when loading it back into the DB? >>>> Have you checked the output file (just see the tail). Can you post the >>>> part that causes issues? Just the line 16426447 and few lines around. >>>> >>>> regards >>>> Tomas >> >> From the old server: >> Yearly COPY files worked. Pg_dumpall was giving problems. >> >> In the new server: >> COPY FROM worked. All files appear to have been copied. Then I create >> the primary key index, and another index. Many records are there, but >> many are not there! There's no error, just that some records/rows just >> didn't make it. > > Are you sure you're getting all the data out of the source (broken) > database you think you are? Are you sure those rows are in the dump? Actually I am not. Some rows are missing. Will a COUNT(*) on the two databases -- old and new -- be sufficient and reliable information about the number of rows that went AWOL?
Dne 26.4.2011 04:50, Phoenix Kiula napsal(a): > Tomas, the line where it crashed, here are the 10 or so lines around it: > >> > head -15272350 /backup/links/links_all.txt | tail -20 No, those lines are before the one that causes problems - line number is 15272357, and you've printed just 15272350 lines using head. Do this $ head -15272367 /backup/links/links_all.txt | tail -20 That should give us 10 lines before, 10 lines after. Tomas.
Dne 26.4.2011 14:41, Phoenix Kiula napsal(a): > On Tue, Apr 26, 2011 at 3:24 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> Are you sure you're getting all the data out of the source (broken) >> database you think you are? Are you sure those rows are in the dump? > > > > Actually I am not. Some rows are missing. > > Will a COUNT(*) on the two databases -- old and new -- be sufficient > and reliable information about the number of rows that went AWOL? That should give us at least some idea if the copy worked. Have you checked the postmaster.log (and kernel log in /var/log/messages) why the new DB crashed when you do "SELECT * FROM mytable LIMIT 1" (as TL recommended yesterday)? Tomas