Thread: Re: [HACKERS] tables > 1 gig

Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> >> Then we'd better fix the underlying problem.  We can't change
> >> RELSEG_SIZE for a minor release, unless you want to give up the
> >> principle of not forcing initdb at minor releases.
> 
> > Why can't we increase it?
> 
> Consider a 1.5-gig table.  6.5 will store it as one gig in file "table",
> one-half gig in file "table.1".  Now recompile with larger RELSEG_SIZE.
> The file manager will now expect to find all blocks of the relation in
> file "table", and will never go to "table.1" at all.  Presto, you lost
> a bunch of data.
> 
> Bottom line is just as it says in the config.h comments: you can't
> change either BLCKSZ or RELSEG_SIZE without doing initdb.

OK. I will reverse it out.  I never thought that far ahead.  Not sure
how we can fix this easily, nor do I understand why more people aren't
complaining about not being able to vacuum tables that are 1.5 gigs that
they used to be able to vacuum.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] tables > 1 gig

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
> ... nor do I understand why more people aren't
> complaining about not being able to vacuum tables that are 1.5 gigs that
> they used to be able to vacuum.

Most likely, not very many people with tables that big have adopted 6.5
yet ... if I were running a big site, I'd probably wait for 6.5.1 on
general principles ;-)

I think what we ought to do is finish working out how to make mdtruncate
safe for concurrent backends, and then do it.  That's the right
long-term answer anyway.
        regards, tom lane


Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > ... nor do I understand why more people aren't
> > complaining about not being able to vacuum tables that are 1.5 gigs that
> > they used to be able to vacuum.
> 
> Most likely, not very many people with tables that big have adopted 6.5
> yet ... if I were running a big site, I'd probably wait for 6.5.1 on
> general principles ;-)
> 
> I think what we ought to do is finish working out how to make mdtruncate
> safe for concurrent backends, and then do it.  That's the right
> long-term answer anyway.

Problem is, no one knows how right now.  I liked unlinking every
segment, but was told by Hiroshi that causes a problem with concurrent
access and vacuum because the old backends still think it is there.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] tables > 1 gig

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
>> I think what we ought to do is finish working out how to make mdtruncate
>> safe for concurrent backends, and then do it.  That's the right
>> long-term answer anyway.

> Problem is, no one knows how right now.  I liked unlinking every
> segment, but was told by Hiroshi that causes a problem with concurrent
> access and vacuum because the old backends still think it is there.

I haven't been paying much attention, but I imagine that what's really
going on here is that once vacuum has collected all the still-good
tuples at the front of the relation, it doesn't bother to go through
the remaining blocks of the relation and mark everything dead therein?
It just truncates the file after the last block that it put tuples into,
right?

If this procedure works correctly for vacuuming a simple one-segment
table, then it would seem that truncation of all the later segments to
zero length should work correctly.

You could truncate to zero length *and* then unlink the files if you
had a mind to do that, but I can see why unlink without truncate would
not work reliably.
        regards, tom lane


Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> >> I think what we ought to do is finish working out how to make mdtruncate
> >> safe for concurrent backends, and then do it.  That's the right
> >> long-term answer anyway.
> 
> > Problem is, no one knows how right now.  I liked unlinking every
> > segment, but was told by Hiroshi that causes a problem with concurrent
> > access and vacuum because the old backends still think it is there.
> 
> I haven't been paying much attention, but I imagine that what's really
> going on here is that once vacuum has collected all the still-good
> tuples at the front of the relation, it doesn't bother to go through
> the remaining blocks of the relation and mark everything dead therein?
> It just truncates the file after the last block that it put tuples into,
> right?
> 
> If this procedure works correctly for vacuuming a simple one-segment
> table, then it would seem that truncation of all the later segments to
> zero length should work correctly.
> 
> You could truncate to zero length *and* then unlink the files if you
> had a mind to do that, but I can see why unlink without truncate would
> not work reliably.

That seems like the issue.  The more complex problem is that when the
relation lookes a segment via vacuum, things go strange on the other
backends.  Hiroshi seems to have a good testbed for this, and I thought
it was fixed, so I didn't notice.

Unlinking allows other backends to keep their open segments of the
tables, but that causes some problems with backends opening segments
they think still exist and they can't be opened.

Truncating segments causes problems because backends are still accessing
their own copies of the tables, and truncate modified what is seen in
their open file descriptors.

We basically have two methods, and both have problems under certain
circumstances.  I wonder if we unlink the files, but then create
zero-length segments for the ones we unlink.  If people think that may
fix the problems, it is easy to do that, and we can do it atomically
using the rename() system call.  Create the zero-length file under a
temp name, then rename it to the segment file name.  That may do the
trick of allowing existing file descriptors to stay active, while having
segments in place for those that need to see them.

Comments?
--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
> I haven't been paying much attention, but I imagine that what's really
> going on here is that once vacuum has collected all the still-good
> tuples at the front of the relation, it doesn't bother to go through
> the remaining blocks of the relation and mark everything dead therein?
> It just truncates the file after the last block that it put tuples into,
> right?
> 
> If this procedure works correctly for vacuuming a simple one-segment
> table, then it would seem that truncation of all the later segments to
> zero length should work correctly.

Not sure about that.  When we truncate single segment file, the table is
being destroyed, so we invalidate it in the catalog cache and tell other
backends.  Also, we have a problem with DROP TABLE in a transaction
while others are using it as described by a bug report a few days ago,
so I don't think we have that 100% either.

> You could truncate to zero length *and* then unlink the files if you
> had a mind to do that, but I can see why unlink without truncate would
> not work reliably.

That is interesting.  I never thought of that.  Hiroshi, can you test
that idea?  If it is the non-existance of the file that other backends
are checking for, my earlier idea of rename() with truncated file kept
in place may be better.

Also, I see why we are not getting more bug reports.  They only get this
when the table looses a segment, so it is OK to vacuum large tables as
long as the table doesn't loose a segment during the vacuum.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
> > I haven't been paying much attention, but I imagine that what's really
> > going on here is that once vacuum has collected all the still-good
> > tuples at the front of the relation, it doesn't bother to go through
> > the remaining blocks of the relation and mark everything dead therein?
> > It just truncates the file after the last block that it put tuples into,
> > right?
> > 
> > If this procedure works correctly for vacuuming a simple one-segment
> > table, then it would seem that truncation of all the later segments to
> > zero length should work correctly.
> 
> Not sure about that.  When we truncate single segment file, the table is
> being destroyed, so we invalidate it in the catalog cache and tell other
> backends.  Also, we have a problem with DROP TABLE in a transaction
> while others are using it as described by a bug report a few days ago,
> so I don't think we have that 100% either.
> 
> That is interesting.  I never thought of that.  Hiroshi, can you test
> that idea?  If it is the non-existance of the file that other backends
> are checking for, my earlier idea of rename() with truncated file kept
> in place may be better.
> 
> Also, I see why we are not getting more bug reports.  They only get this
> when the table looses a segment, so it is OK to vacuum large tables as
> long as the table doesn't loose a segment during the vacuum.

OK, this is 100% wrong.  We truncate from vacuum any time the table size
changes, and vacuum of large tables will fail even if not removing a
segment.  I forgot vacuum does this to reduce disk table size.

I wonder if truncating a file to reduce its size will cause other table
readers to have problems.  I though vacuum had an exlusive lock on the
table during vacuum, and if so, why are other backends having troubles?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: [HACKERS] tables > 1 gig

From
"Hiroshi Inoue"
Date:

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Friday, June 18, 1999 12:54 AM
> To: Bruce Momjian
> Cc: PostgreSQL-development; Inoue@tpf.co.jp
> Subject: Re: [HACKERS] tables > 1 gig 
> 
> 
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> >> I think what we ought to do is finish working out how to make 
> mdtruncate
> >> safe for concurrent backends, and then do it.  That's the right
> >> long-term answer anyway.
> 
> > Problem is, no one knows how right now.  I liked unlinking every
> > segment, but was told by Hiroshi that causes a problem with concurrent
> > access and vacuum because the old backends still think it is there.
> 
> I haven't been paying much attention, but I imagine that what's really
> going on here is that once vacuum has collected all the still-good
> tuples at the front of the relation, it doesn't bother to go through
> the remaining blocks of the relation and mark everything dead therein?
> It just truncates the file after the last block that it put tuples into,
> right?
> 
> If this procedure works correctly for vacuuming a simple one-segment
> table, then it would seem that truncation of all the later segments to
> zero length should work correctly.
> 
> You could truncate to zero length *and* then unlink the files if you
> had a mind to do that, but I can see why unlink without truncate would
> not work reliably.
>

Unlinking unused segments after truncating to zero length may cause 
the result such as 
    Existent backends write to the truncated file to extend the relation    while new backends create a new segment
fileto extend the relation. 
 

Comments ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


RE: [HACKERS] tables > 1 gig

From
"Hiroshi Inoue"
Date:
>
> > > I haven't been paying much attention, but I imagine that what's really
> > > going on here is that once vacuum has collected all the still-good
> > > tuples at the front of the relation, it doesn't bother to go through
> > > the remaining blocks of the relation and mark everything dead therein?
> > > It just truncates the file after the last block that it put
> tuples into,
> > > right?
> > >
> > > If this procedure works correctly for vacuuming a simple one-segment
> > > table, then it would seem that truncation of all the later segments to
> > > zero length should work correctly.
> >
> > Not sure about that.  When we truncate single segment file, the table is
> > being destroyed, so we invalidate it in the catalog cache and tell other
> > backends.  Also, we have a problem with DROP TABLE in a transaction
> > while others are using it as described by a bug report a few days ago,
> > so I don't think we have that 100% either.
> >

The problem is that (virtual) file descriptors,relcache entries ... etc
are local to each process.  I don't know the certain way to tell other
processes just in time that target resources should be invalidated.

> > That is interesting.  I never thought of that.  Hiroshi, can you test
> > that idea?  If it is the non-existance of the file that other backends
> > are checking for, my earlier idea of rename() with truncated file kept
> > in place may be better.
> >
> > Also, I see why we are not getting more bug reports.  They only get this
> > when the table looses a segment, so it is OK to vacuum large tables as
> > long as the table doesn't loose a segment during the vacuum.
>
> OK, this is 100% wrong.  We truncate from vacuum any time the table size
> changes, and vacuum of large tables will fail even if not removing a
> segment.  I forgot vacuum does this to reduce disk table size.
>
> I wonder if truncating a file to reduce its size will cause other table
> readers to have problems.

Current implementation has a hidden bug.
Once the size of a segment reached RELSEG_SIZE,mdnblocks()
wouldn't check the real size of the segment any more.

I'm not sure such other bugs doesn't exist any more.
It's one of the reason why I don't recommend to apply my trial patch
to mdtruncate().

> I though vacuum had an exlusive lock on the
> table during vacuum, and if so, why are other backends having troubles?
>

We could not see any errors by unlinking segmented relations when
commands are executed sequentailly.
Vacuum calls RelationInvalidateHeapTuple() for a pg_class tuple and
and other backends could recognize that the relcachle entry must be
invalidated while executing StartTransaction() or CommandCounter
Increment().

Even though the target relation is locked exclusively by vacuum,other
backends could StartTransaction(),CommandCounterIncrement(),
parse,analyze,rewrite,optimize,start Executor Stage and open relations.
We could not rely on exclusive lock so much.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
> Unlinking unused segments after truncating to zero length may cause 
> the result such as 
> 
>      Existent backends write to the truncated file to extend the relation
>      while new backends create a new segment file to extend the relation. 

How about my idea of creating a truncated file, the renaming it to the
table file.  That keeps the table open for other open file descriptors,
but put a zero-length file in place in an atomic manner.

Fact is that the current code is really bad, so I request you do your
best, and let's get it in there for people to review and improve if
necessary.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: [HACKERS] tables > 1 gig

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Bruce Momjian [mailto:maillist@candle.pha.pa.us]
> Sent: Friday, June 18, 1999 11:31 AM
> To: Hiroshi Inoue
> Cc: PostgreSQL-development
> Subject: Re: [HACKERS] tables > 1 gig
> 
> 
> > Unlinking unused segments after truncating to zero length may cause 
> > the result such as 
> > 
> >      Existent backends write to the truncated file to extend 
> the relation
> >      while new backends create a new segment file to extend the 
> relation. 
> 
> How about my idea of creating a truncated file, the renaming it to the
> table file.  That keeps the table open for other open file descriptors,
> but put a zero-length file in place in an atomic manner.
>

Sorry,I couldn't understand what you mean.
What is differenct from truncating existent files to zero length ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
> > > Unlinking unused segments after truncating to zero length may cause 
> > > the result such as 
> > > 
> > >      Existent backends write to the truncated file to extend 
> > the relation
> > >      while new backends create a new segment file to extend the 
> > relation. 
> > 
> > How about my idea of creating a truncated file, the renaming it to the
> > table file.  That keeps the table open for other open file descriptors,
> > but put a zero-length file in place in an atomic manner.
> >
> 
> Sorry,I couldn't understand what you mean.
> What is differenct from truncating existent files to zero length ?

Glad to explain.  Here is the pseudocode:
create temp file, make it zero length, call it 'zz'rename(zz,tablename)

What this does is to create a zero length file, and the rename unlinks
the tablename file, and puts the zero-length file in it's place. 
rename() is atomic, so there is no time that the table file does not
exist.

It allows backends that have the table open via a descriptor to keep the
table unchanged, while new backends see a zero-length file.

Does this help?


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: [HACKERS] tables > 1 gig

From
"Hiroshi Inoue"
Date:
> > >
> > > How about my idea of creating a truncated file, the renaming it to the
> > > table file.  That keeps the table open for other open file
> descriptors,
> > > but put a zero-length file in place in an atomic manner.
> > >
> >
> > Sorry,I couldn't understand what you mean.
> > What is differenct from truncating existent files to zero length ?
>
> Glad to explain.  Here is the pseudocode:
>
>     create temp file, make it zero length, call it 'zz'
>     rename(zz,tablename)
>
> What this does is to create a zero length file, and the rename unlinks
> the tablename file, and puts the zero-length file in it's place.
> rename() is atomic, so there is no time that the table file does not
> exist.
>

Leti1 be the inode of zzi2 be the inode of tablename
before rename().

Does this mean
   New backends read/write i1 inode and   backends that have the table open read/write i2 inode ?

If so,it seems wrong.
All backends should see same data.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
> > What this does is to create a zero length file, and the rename unlinks
> > the tablename file, and puts the zero-length file in it's place.
> > rename() is atomic, so there is no time that the table file does not
> > exist.
> >
> 
> Let
>     i1 be the inode of zz
>     i2 be the inode of tablename
> before rename().
> 
> Does this mean
> 
>     New backends read/write i1 inode and
>     backends that have the table open read/write i2 inode ?
> 
> If so,it seems wrong.
> All backends should see same data.

Yes, I can see your point.  It would show them different views of the
table.

So, as you were saying, we have no way of invalidating file descriptors
of other backends for secondary segments.  Why does truncating the file
not work?  Any ideas?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: [HACKERS] tables > 1 gig

From
"Hiroshi Inoue"
Date:
>
> > > What this does is to create a zero length file, and the rename unlinks
> > > the tablename file, and puts the zero-length file in it's place.
> > > rename() is atomic, so there is no time that the table file does not
> > > exist.
> > >
> >
> > Let
> >     i1 be the inode of zz
> >     i2 be the inode of tablename
> > before rename().
> >
> > Does this mean
> >
> >     New backends read/write i1 inode and
> >     backends that have the table open read/write i2 inode ?
> >
> > If so,it seems wrong.
> > All backends should see same data.
>
> Yes, I can see your point.  It would show them different views of the
> table.
>
> So, as you were saying, we have no way of invalidating file descriptors
> of other backends for secondary segments.

It seems DROP TABLE has a similar problem.
It has been already solved ?

> Why does truncating the file
> not work?  Any ideas?
>

I have gotten no bug reports for my trial implementation.
AFAIK,only Ole Gjerde has tested my patch.
Is it sufficient ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
> > Yes, I can see your point.  It would show them different views of the
> > table.
> >
> > So, as you were saying, we have no way of invalidating file descriptors
> > of other backends for secondary segments.
> 
> It seems DROP TABLE has a similar problem.
> It has been already solved ?

Not solved.  Someone reported it recently.

> 
> > Why does truncating the file
> > not work?  Any ideas?
> >
> 
> I have gotten no bug reports for my trial implementation.
> AFAIK,only Ole Gjerde has tested my patch.
> Is it sufficient ?

Yes.  We need something, and maybe after we add it, people can do
testing and find any problems.  It is better to apply it than to leave
it as it currently exists, no?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: [HACKERS] tables > 1 gig

From
"Hiroshi Inoue"
Date:
> 
> > > Yes, I can see your point.  It would show them different views of the
> > > table.
> > >
> > > So, as you were saying, we have no way of invalidating file 
> descriptors
> > > of other backends for secondary segments.
> > 
> > > Why does truncating the file
> > > not work?  Any ideas?
> > >
> > 
> > I have gotten no bug reports for my trial implementation.
> > AFAIK,only Ole Gjerde has tested my patch.
> > Is it sufficient ?
> 
> Yes.  We need something, and maybe after we add it, people can do
> testing and find any problems.  It is better to apply it than to leave
> it as it currently exists, no?
>

OK,here is my patch for PostgreSQL6.5-release.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

*** storage/smgr/md.c.orig    Fri Jun 11 12:20:06 1999
--- storage/smgr/md.c    Fri Jun 18 15:10:54 1999
***************
*** 674,684 ****     segno = 0;     for (;;)     {
!         if (v->mdfd_lstbcnt == RELSEG_SIZE
!             || (nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ)) == RELSEG_SIZE)         {
- 
-             v->mdfd_lstbcnt = RELSEG_SIZE;             segno++;              if (v->mdfd_chain == (MdfdVec *) NULL)
--- 674,685 ----     segno = 0;     for (;;)     {
!         nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ);
!         if (nblocks > RELSEG_SIZE)
!             elog(FATAL, "segment too big in mdnblocks!");
!         v->mdfd_lstbcnt = nblocks;
!         if (nblocks == RELSEG_SIZE)         {             segno++;              if (v->mdfd_chain == (MdfdVec *)
NULL)
***************
*** 711,732 ****     MdfdVec    *v;  #ifndef LET_OS_MANAGE_FILESIZE
!     int            curnblk;      curnblk = mdnblocks(reln);
!     if (curnblk / RELSEG_SIZE > 0)
!     {
!         elog(NOTICE, "Can't truncate multi-segments relation %s",
!             reln->rd_rel->relname.data);
!         return curnblk;
!     } #endif      fd = RelationGetFile(reln);     v = &Md_fdvec[fd];      if (FileTruncate(v->mdfd_vfd, nblocks *
BLCKSZ)< 0)         return -1;      return nblocks; 
 
--- 712,766 ----     MdfdVec    *v;  #ifndef LET_OS_MANAGE_FILESIZE
!     int            curnblk,
!                 i,
!                 oldsegno,
!                 newsegno,
!                 lastsegblocks;
!     MdfdVec            **varray;      curnblk = mdnblocks(reln);
!     if (nblocks > curnblk)
!         return -1;
!     oldsegno = curnblk / RELSEG_SIZE;
!     newsegno = nblocks / RELSEG_SIZE;
!  #endif      fd = RelationGetFile(reln);     v = &Md_fdvec[fd]; 
+ #ifndef LET_OS_MANAGE_FILESIZE
+     varray = (MdfdVec **)palloc((oldsegno + 1) * sizeof(MdfdVec *));
+     for (i = 0; i <= oldsegno; i++)
+     {
+         if (!v)
+             elog(ERROR,"segment isn't open in mdtruncate!");
+         varray[i] = v;
+         v = v->mdfd_chain;
+     }
+     for (i = oldsegno; i > newsegno; i--)
+     {
+         v = varray[i];
+         if (FileTruncate(v->mdfd_vfd, 0) < 0)
+         {
+             pfree(varray);
+             return -1;
+         }
+         v->mdfd_lstbcnt = 0;
+     }
+     /* Calculate the # of blocks in the last segment */
+     lastsegblocks = nblocks - (newsegno * RELSEG_SIZE);
+     v = varray[i];
+     pfree(varray);
+     if (FileTruncate(v->mdfd_vfd, lastsegblocks * BLCKSZ) < 0)
+         return -1;
+     v->mdfd_lstbcnt = lastsegblocks;
+ #else     if (FileTruncate(v->mdfd_vfd, nblocks * BLCKSZ) < 0)         return -1;
+     v->mdfd_lstbcnt = nblocks;
+ #endif      return nblocks;


Re: [HACKERS] tables > 1 gig

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> Unlinking unused segments after truncating to zero length may cause 
> the result such as 
>      Existent backends write to the truncated file to extend the relation
>      while new backends create a new segment file to extend the relation. 

Ooh, good point.  So, unless we want to invent some way for the process
that's running vacuum to force other backends to close their FDs for
segment files, the *only* correct solution is to truncate to zero length
but leave the files in place.

I still don't quite see why there is such a big problem, however, unless
you're asserting that vacuum is broken for single-segment tables too.
Surely vacuum acquires a lock over the whole table, not just a segment?
        regards, tom lane


Re: [HACKERS] tables > 1 gig

From
Bruce Momjian
Date:
Thank you.  Applied.

[Charset iso-8859-1 unsupported, filtering to ASCII...]
> > 
> > > > Yes, I can see your point.  It would show them different views of the
> > > > table.
> > > >
> > > > So, as you were saying, we have no way of invalidating file 
> > descriptors
> > > > of other backends for secondary segments.
> > > 
> > > > Why does truncating the file
> > > > not work?  Any ideas?
> > > >
> > > 
> > > I have gotten no bug reports for my trial implementation.
> > > AFAIK,only Ole Gjerde has tested my patch.
> > > Is it sufficient ?
> > 
> > Yes.  We need something, and maybe after we add it, people can do
> > testing and find any problems.  It is better to apply it than to leave
> > it as it currently exists, no?
> >
> 
> OK,here is my patch for PostgreSQL6.5-release.
> 
> Regards.
> 
> Hiroshi Inoue
> Inoue@tpf.co.jp
> 
> *** storage/smgr/md.c.orig    Fri Jun 11 12:20:06 1999
> --- storage/smgr/md.c    Fri Jun 18 15:10:54 1999
> ***************
> *** 674,684 ****
>       segno = 0;
>       for (;;)
>       {
> !         if (v->mdfd_lstbcnt == RELSEG_SIZE
> !             || (nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ)) == RELSEG_SIZE)
>           {
> - 
> -             v->mdfd_lstbcnt = RELSEG_SIZE;
>               segno++;
>   
>               if (v->mdfd_chain == (MdfdVec *) NULL)
> --- 674,685 ----
>       segno = 0;
>       for (;;)
>       {
> !         nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ);
> !         if (nblocks > RELSEG_SIZE)
> !             elog(FATAL, "segment too big in mdnblocks!");
> !         v->mdfd_lstbcnt = nblocks;
> !         if (nblocks == RELSEG_SIZE)
>           {
>               segno++;
>   
>               if (v->mdfd_chain == (MdfdVec *) NULL)
> ***************
> *** 711,732 ****
>       MdfdVec    *v;
>   
>   #ifndef LET_OS_MANAGE_FILESIZE
> !     int            curnblk;
>   
>       curnblk = mdnblocks(reln);
> !     if (curnblk / RELSEG_SIZE > 0)
> !     {
> !         elog(NOTICE, "Can't truncate multi-segments relation %s",
> !             reln->rd_rel->relname.data);
> !         return curnblk;
> !     }
>   #endif
>   
>       fd = RelationGetFile(reln);
>       v = &Md_fdvec[fd];
>   
>       if (FileTruncate(v->mdfd_vfd, nblocks * BLCKSZ) < 0)
>           return -1;
>   
>       return nblocks;
>   
> --- 712,766 ----
>       MdfdVec    *v;
>   
>   #ifndef LET_OS_MANAGE_FILESIZE
> !     int            curnblk,
> !                 i,
> !                 oldsegno,
> !                 newsegno,
> !                 lastsegblocks;
> !     MdfdVec            **varray;
>   
>       curnblk = mdnblocks(reln);
> !     if (nblocks > curnblk)
> !         return -1;
> !     oldsegno = curnblk / RELSEG_SIZE;
> !     newsegno = nblocks / RELSEG_SIZE;
> ! 
>   #endif
>   
>       fd = RelationGetFile(reln);
>       v = &Md_fdvec[fd];
>   
> + #ifndef LET_OS_MANAGE_FILESIZE
> +     varray = (MdfdVec **)palloc((oldsegno + 1) * sizeof(MdfdVec *));
> +     for (i = 0; i <= oldsegno; i++)
> +     {
> +         if (!v)
> +             elog(ERROR,"segment isn't open in mdtruncate!");
> +         varray[i] = v;
> +         v = v->mdfd_chain;
> +     }
> +     for (i = oldsegno; i > newsegno; i--)
> +     {
> +         v = varray[i];
> +         if (FileTruncate(v->mdfd_vfd, 0) < 0)
> +         {
> +             pfree(varray);
> +             return -1;
> +         }
> +         v->mdfd_lstbcnt = 0;
> +     }
> +     /* Calculate the # of blocks in the last segment */
> +     lastsegblocks = nblocks - (newsegno * RELSEG_SIZE);
> +     v = varray[i];
> +     pfree(varray);
> +     if (FileTruncate(v->mdfd_vfd, lastsegblocks * BLCKSZ) < 0)
> +         return -1;
> +     v->mdfd_lstbcnt = lastsegblocks;
> + #else
>       if (FileTruncate(v->mdfd_vfd, nblocks * BLCKSZ) < 0)
>           return -1;
> +     v->mdfd_lstbcnt = nblocks;
> + #endif
>   
>       return nblocks;
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] tables > 1 gig

From
Ole Gjerde
Date:
On Fri, 18 Jun 1999, Bruce Momjian wrote:
[snip - mdtruncate patch]

While talking about this whole issue, there is one piece missing.
Currently there is no way to dump a database/table over 2 GB.
When it hits the 2GB OS limit, it just silently stops and gives no
indication that it didn't finish.

It's not a problem for me yet, but I'm getting very close.  I have one
database with 3 tables over 2GB(in postgres space), but they still come
out under 2GB after a dump.  I can't do a pg_dump on the whole database
however, which would be very nice.

I suppose it wouldn't be overly hard to have pg_dump/pg_dumpall do
something similar to what postgres does with segments.  I haven't looked
at it yet however, so I can't say for sure.

Comments?

Ole Gjerde



Re: [HACKERS] tables > 1 gig

From
Hannu Krosing
Date:
Ole Gjerde wrote:
> 
> On Fri, 18 Jun 1999, Bruce Momjian wrote:
> [snip - mdtruncate patch]
> 
> While talking about this whole issue, there is one piece missing.
> Currently there is no way to dump a database/table over 2 GB.
> When it hits the 2GB OS limit, it just silently stops and gives no
> indication that it didn't finish.
> 
> It's not a problem for me yet, but I'm getting very close.  I have one
> database with 3 tables over 2GB(in postgres space), but they still come
> out under 2GB after a dump.  I can't do a pg_dump on the whole database
> however, which would be very nice.
> 
> I suppose it wouldn't be overly hard to have pg_dump/pg_dumpall do
> something similar to what postgres does with segments.  I haven't looked
> at it yet however, so I can't say for sure.
> 
> Comments?

As pg_dump writes to stdout, you can just use standard *nix tools:

1. use compressed dumps

pg_dump really_big_db | gzip > really_big_db.dump.gz

reload with

gunzip -c really_big_db.dump.gz | psql newdb
or
cat really_big_db.dump.gz | gunzip | psql newdb

2. use split

pg_dump really_big_db | split -b 1m - really_big_db.dump.

reload with

cat really_big_db.dump.* | pgsql newdb

-----------------------
Hannu