Thread: Turning off transactions completely.

Turning off transactions completely.

From
"Arsalan Zaidi"
Date:
Hi.

I'm writing a very specialised app which has absolutely NO need for
transactions. However, I have to milk every drop of performance out of it
cause I'm performing some fairly complex queries on tens of millions of
rows.

Is there any way I can completely avoid the overhead of the implicit
transaction per SQL statement? Is there some switch or some (fairly easy)
editting that can be done to the code? My transaction log files often reach
800 MB in size and I'd like to avoid the overhead if at all possible...

Thanks in advance.

--Arsalan.



-------------------------------------------------------------------
People often hate those things which they do not know, or cannot understand.
--Ali Ibn Abi Talib (AS)


Re: Turning off transactions completely.

From
Martijn van Oosterhout
Date:
On Sat, Jan 05, 2002 at 02:21:44PM +0530, Arsalan Zaidi wrote:
> Hi.
>
> I'm writing a very specialised app which has absolutely NO need for
> transactions. However, I have to milk every drop of performance out of it
> cause I'm performing some fairly complex queries on tens of millions of
> rows.
>
> Is there any way I can completely avoid the overhead of the implicit
> transaction per SQL statement? Is there some switch or some (fairly easy)
> editting that can be done to the code? My transaction log files often reach
> 800 MB in size and I'd like to avoid the overhead if at all possible...

Well, every statement must be within a transaction, so to reduce the number
of transactions start one explicitly and commit when you're done. Then you
only have one transaction for the whole thing.

HTH,

--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> Terrorists can only take my life. Only my government can take my freedom.

Re: Turning off transactions completely.

From
Tom Lane
Date:
On Sat, Jan 05, 2002 at 02:21:44PM +0530, Arsalan Zaidi wrote:
>> My transaction log files often reach
>> 800 MB in size and I'd like to avoid the overhead if at all possible...

They shouldn't get that big.  What PG version are you running?
(Any answer less than 7.1.3 is the wrong answer...)

            regards, tom lane

Re: Turning off transactions completely.

From
Martijn van Oosterhout
Date:
On Mon, Jan 07, 2002 at 11:56:12AM +0530, Arsalan Zaidi wrote:
> > Well, every statement must be within a transaction, so to reduce the
> number
> > of transactions start one explicitly and commit when you're done. Then you
> > only have one transaction for the whole thing.
> >
>
> I already do this. However, I don't want *any* transactions at all. I don't
> care if I lose data in the middle of, say an update. The nature of the app
> is such, that that data can be reconstructed.

Well, no transactions is not possible. The whole data storage system is
built around it almost. Besides, as long as everything is in one
transaction, there is *no* overhead IIRC.

> However, as it currently stands, the app takes around 30 hrs to finish it's
> run. I wish to reduce that to 24hr or less.

Wow. I can insert hundreds of rows per second within a transaction and my
hardware is not even particulatly good. That would be 21 million rows in
that time. How big is your data set? Are you using COPY or INSERT to insert
the data?

> I found a comment from 1999 where someone asked a similiar Q and Mimijian
> responded that that was not possible in pg. Is that still true? Can it be
> easily changed in the code?

I think you're working under the assumption that transactions == overhead
whereas I don't beleive that's true. Work out where the bottleneck is.
--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> Terrorists can only take my life. Only my government can take my freedom.

Re: Turning off transactions completely.

From
"Arsalan Zaidi"
Date:
> Well, no transactions is not possible. The whole data storage system is
> built around it almost. Besides, as long as everything is in one
> transaction, there is *no* overhead IIRC.
>

How can that be? You're doing *some* record keeping after all; so there has
to be some over-head!

> > However, as it currently stands, the app takes around 30 hrs to finish
it's
> > run. I wish to reduce that to 24hr or less.
>
> Wow. I can insert hundreds of rows per second within a transaction and my
> hardware is not even particulatly good. That would be 21 million rows in
> that time. How big is your data set? Are you using COPY or INSERT to
insert
> the data?
>

Using COPY for the init load and insert for subsequent handling. Can't use
COPY there cause there are WHERE clauses in there I need.

I start off with 3.3GB of data and things just grow from that with a fresh
3GB file every day! Plenty of dup data in it though. Lots of complex queries
get run on this data, which generates even more info... At the end of the
initial run, I get ~50GB of data in the pg data dir (course that's got  a
lot of additional info than just a plain text file...) with ~10GB per
subsequent run.

Haven't checked yet, but I think I get ~30 million rows in the init run.
BTW, I've got a dual proc machine with a RAID-0 array and 1 GB of RAM, but
pg only uses one CPU at a time. Would have been great if it had been
multi-threaded or something.


> > I found a comment from 1999 where someone asked a similiar Q and
Mimijian
> > responded that that was not possible in pg. Is that still true? Can it
be
> > easily changed in the code?
>
> I think you're working under the assumption that transactions == overhead
> whereas I don't beleive that's true. Work out where the bottleneck is.

I've got too much data? :-)

--Arsalan.



Re: Turning off transactions completely.

From
"Arsalan Zaidi"
Date:
> Well, every statement must be within a transaction, so to reduce the
number
> of transactions start one explicitly and commit when you're done. Then you
> only have one transaction for the whole thing.
>

I already do this. However, I don't want *any* transactions at all. I don't
care if I lose data in the middle of, say an update. The nature of the app
is such, that that data can be reconstructed.

However, as it currently stands, the app takes around 30 hrs to finish it's
run. I wish to reduce that to 24hr or less.

I found a comment from 1999 where someone asked a similiar Q and Mimijian
responded that that was not possible in pg. Is that still true? Can it be
easily changed in the code?

--Arsalan.


Re: Turning off transactions completely.

From
Alvar Freude
Date:
Hi,

> Using COPY for the init load and insert for subsequent handling. Can't use
> COPY there cause there are WHERE clauses in there I need.
>
> I start off with 3.3GB of data and things just grow from that with a fresh
> 3GB file every day! Plenty of dup data in it though. Lots of complex
> queries get run on this data, which generates even more info... At the
> end of the initial run, I get ~50GB of data in the pg data dir (course
> that's got  a lot of additional info than just a plain text file...) with
> ~10GB per subsequent run.

I would suggest to optimize your Application!

Try to:

  - cache previously inserted IDs and other things
    (eg. use Perl hashes)

  - create simpler Where clauses

  - look on your indexes, perhaps you can create an index on two columns?

And:
  use vacuum analyze after the Database is freshly build and filled
  with the first ~ 100 000 rows.
  Later vacuum analyze every 1 000 000 or 10 000 000 rows ...

> BTW, I've got a dual proc machine with a RAID-0 array and 1 GB of RAM, but
> pg only uses one CPU at a time. Would have been great if it had been
> multi-threaded or something.

if you use two inserting processes, Postgres also should use two. AFAIK! :)


Ciao
  Alvar


--
http://www.teletrust.info/
http://www.odem.org/ || http://www.odem.org/insert_coin/imkp2001.html


--
AGI :: Hohnerstrasse 23, 70469 Stuttgart
Fon +49 (0)711.490 320-0, Fax +49 (0)711.490 320-150
AGI auf Platz 3 im neuen Multimedia-Kreativranking
http://www.agi.de/tagebuch/


Re: Turning off transactions completely.

From
Bruce Momjian
Date:
> I found a comment from 1999 where someone asked a similiar Q and Mimijian
> responded that that was not possible in pg. Is that still true? Can it be
> easily changed in the code?

It is possible to turn off transactions.  The problem is would the
system be any faster, and would a single INSERT failure due to a
constraint check make the system unusable.  You can turn off fsync in
postgresql.conf and maybe there is a way to prevent pre-change image
logging to WAL, but other than that, I can't think of any more speedups
that keep the system usable.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Turning off transactions completely.

From
Tom Lane
Date:
"Arsalan Zaidi" <azaidi@directi.com> writes:
> I already do this. However, I don't want *any* transactions at all.

Perhaps you want MySQL ;-)

Seriously, transactions per se are not your problem, and you can't turn
them off anyway, so there's no point in continuing to bark up that tree.

What I think you are really unhappy about is the WAL/checkpoint
activity, and that you can control to some extent.  For starters, have
you turned off fsync?  Another thing to consider is increasing the
CHECKPOINT_SEGMENTS parameter (and possibly WAL_FILES too), which will
increase the amount of disk space used for the WAL log, but should
reduce the amount of datafile disk traffic.  If the postmaster log shows
a lot of snippy little notices about "consider increasing WAL_FILES",
then you should do that.  If you have multiple disk drives, see if you
can put the WAL files (pg_xlog subdirectory) on a different disk spindle
than the data files.

There are also the usual tuning questions about whether you've set the
number of shared buffers to something reasonable, etc. etc.

            regards, tom lane

Re: Turning off transactions completely.

From
"Arsalan Zaidi"
Date:
> I would suggest to optimize your Application!

First thing I did! :-)

>   - cache previously inserted IDs and other things
>     (eg. use Perl hashes)

I'm using Java/JDBC. Also, there's very little data I can cache and reuse.

>
>   - create simpler Where clauses
>

Sorry, I need to be specific! Besides, I've used EXPLAIN extensively and
these are the fastest possible queries (atleast I hope so) which do what I
want them to do.

>   - look on your indexes, perhaps you can create an index on two columns?
>

Got them up the wazoo. Two column ones as well...

Just want to know, is an index on (foo,bar) different from (bar,foo)? Does
the order in which they appear in the index creation statement and in
subsequent queries make a difference?


> And:
>   use vacuum analyze after the Database is freshly build and filled
>   with the first ~ 100 000 rows.
>   Later vacuum analyze every 1 000 000 or 10 000 000 rows ...
>

I'm vacuuming like crazy (after the initial data COPY and then once the run
ends(to prepare for the next run)) but there are two issues...

1. VACUUM is dead slow. The CPU monitor shows long stretches when there's
barely any activity at all while the vacuum is going on.

2. I do vacuum analyze's to help the database figure out how to best run my
queries. But it *still* doesnt use some of them...

> > BTW, I've got a dual proc machine with a RAID-0 array and 1 GB of RAM,
but
> > pg only uses one CPU at a time. Would have been great if it had been
> > multi-threaded or something.
>
> if you use two inserting processes, Postgres also should use two. AFAIK!
:)
>

Ah, interesting point. I spent the last two (working) days converting my
single process app into a multi threaded one, to allow me to fire off
multiple queries at the same time. I was hoping this would lead to a more
optimum use of system resources... Didn't work. Best case, the
multi-threaded app is just as fast as the single -threaded one and in the
worst case; it's much slower.

I'm guessing that the various complex queries being fired at the same time
are forcing the use of Swap (yes, it on another HDD entirely) ; which is
slowing down the machine.

--Arsalan.


Re: Turning off transactions completely.

From
"Arsalan Zaidi"
Date:
> Perhaps you want MySQL ;-)

I plan to bench-mark it today. Pg performs fine for normal business use and
we're using it for our other needs, but this app is rather unusual...

>
> Seriously, transactions per se are not your problem, and you can't turn
> them off anyway, so there's no point in continuing to bark up that tree.
>

'Kay; it was just a thought.

<info about checkpoint tuning>

Am already doing this and have increased the checkpointing segs, wal files
and buffers. I no longer get any advice in the logs about increasing this...

>
> There are also the usual tuning questions about whether you've set the
> number of shared buffers to something reasonable, etc. etc.
>

Already done...



Re: Turning off transactions completely.

From
"Arguile"
Date:
Arsalan writes:
> >   - look on your indexes, perhaps you can create an index on
> two columns?
> >
>
> Got them up the wazoo. Two column ones as well...
>

Make sure you don't have any you don't absolutely need. Each index you have
adds overhead to any DML statement as it needs to be adjusted.

> Just want to know, is an index on (foo,bar) different from (bar,foo)? Does
> the order in which they appear in the index creation statement and in
> subsequent queries make a difference?

Yes, very much so. An index on (foo, bar, qux) will only be used for queries
in which the WHERE clause contains one, two, or three of the fields starting
from the first (left). So,

  "foo = ?", or "bar = ? AND foo = ?", or "foo = ? AND qux = ? AND bar = ?"

will use the index. The textual order in the query is irrelevant as long as
the fields themselves are there. These,

  "bar = ?", "qux = ?", "qux = ? AND bar = ?"

will not use the index, and this

  "foo = ? AND qux = ?"

will only partially use the index (for the foo lookup). Unfortunately I
can't explain it better, hence the long example. Also remeber the optimiser
will only choose a single index for use in a query and discard any others
you have. So plan wisely as you're balancing INSERT/UPDATE overhead with
SELECT speed.



Re: Turning off transactions completely.

From
"Arsalan Zaidi"
Date:
> Arsalan writes:
> > >   - look on your indexes, perhaps you can create an index on
> > two columns?
> > >
> >
> > Got them up the wazoo. Two column ones as well...
> >
>
> Make sure you don't have any you don't absolutely need. Each index you
have
> adds overhead to any DML statement as it needs to be adjusted.
>

DML?

If you're talking about slower inserts because of index's; I drop them them
on the really heavy tables before I do a COPY/INSERT.

> > Just want to know, is an index on (foo,bar) different from (bar,foo)?
Does
> > the order in which they appear in the index creation statement and in
> > subsequent queries make a difference?
>

<snipped nice explaination>

Cool. My index's are in order and so that's ok. I just wanted to be sure
about this.

>
>


Re: Turning off transactions completely.

From
Maarten.Boekhold@reuters.com
Date:

On 01/08/2002 10:02:39 AM Arsalan Zaidi wrote:
>
> >   - look on your indexes, perhaps you can create an index on two columns?
>
> Got them up the wazoo. Two column ones as well...

You know that too many indices hurts insert/update/delete performance, right? For each of these actions, all related indices would need to be updated. So if you have any indices that are not used or you could do without, remove them.
 
> Just want to know, is an index on (foo,bar) different from (bar,foo)? Does
> the order in which they appear in the index creation statement and in
> subsequent queries make a difference?

Yes, the order does make a difference. For one, if you have an index on (foo,bar) and you have a where-clause that only restricts on bar (but not foo), this index won't be used.

> Ah, interesting point. I spent the last two (working) days converting my
> single process app into a multi threaded one, to allow me to fire off
> multiple queries at the same time. I was hoping this would lead to a more
> optimum use of system resources... Didn't work. Best case, the
> multi-threaded app is just as fast as the single -threaded one and in the
> worst case; it's much slower.

Are you using the same database connection for each thread in your multi-threaded approach? Postgresql will only benefit from multiple processors if there are multiple postgres processes running. In application speach that means that you need to have multiple database connections open (i.e. one backend process per connection).

Maarten

----

Maarten Boekhold, maarten.boekhold@reuters.com

Reuters Consulting / TIBCO Finance Technology Inc.
Dubai Media City
Building 1, 5th Floor
PO Box 1426
Dubai, United Arab Emirates
tel:+971(0)4 3918300 ext 249
fax:+971(0)4 3918333
mob:+971(0)505526539


-------------------------------------------------------------- --
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.

Re: Turning off transactions completely.

From
Martijn van Oosterhout
Date:
On Tue, Jan 08, 2002 at 11:32:39AM +0530, Arsalan Zaidi wrote:
> I'm vacuuming like crazy (after the initial data COPY and then once the run
> ends(to prepare for the next run)) but there are two issues...

Quick question: you are creating the indices *after* you insert the data,
right?

> 1. VACUUM is dead slow. The CPU monitor shows long stretches when there's
> barely any activity at all while the vacuum is going on.
>
> 2. I do vacuum analyze's to help the database figure out how to best run my
> queries. But it *still* doesnt use some of them...

A vacuum analyze does a vacuum also so you only need to do the former.

> Ah, interesting point. I spent the last two (working) days converting my
> single process app into a multi threaded one, to allow me to fire off
> multiple queries at the same time. I was hoping this would lead to a more
> optimum use of system resources... Didn't work. Best case, the
> multi-threaded app is just as fast as the single -threaded one and in the
> worst case; it's much slower.

Strange, that would indicate a serious bottleneck. Are you saturating the
disks? You can use vmstat to work out the amount of disk activity.

> I'm guessing that the various complex queries being fired at the same time
> are forcing the use of Swap (yes, it on another HDD entirely) ; which is
> slowing down the machine.

You say you had 1 GB of memory? Could you give an example of a query that
uses this amount of memory.

HTH,
--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> Terrorists can only take my life. Only my government can take my freedom.

Re: Turning off transactions completely.

From
"Arsalan Zaidi"
Date:
> You know that too many indices hurts insert/update/delete performance,
> right? For each of these actions, all related indices would need to be
> updated. So if you have any indices that are not used or you could do
> without, remove them.

As mentioned, tables which see a lot of data modification have their index's
removed and then re-created. I don't bother doing this with tables which
aren't so heavily modified.

> Are you using the same database connection for each thread in your
> multi-threaded approach? Postgresql will only benefit from multiple
> processors if there are multiple postgres processes running. In
> application speach that means that you need to have multiple database
> connections open (i.e. one backend process per connection).

Each thread opens up a new connection. Using (g)top I have already verified
that there are several postgres instances running. The machine is entirely
dedicated to this task and there were no other users.

Just to re-iterate.

1. I've done quite a bit of tweaking with the WAL*, shared buffers and sort
mem over several weeks. They're about as optimum as I can get them. Turning
off fsync was one of the first things I did BTW.
2. The executable was compiled on the machine with as many optimisations as
gcc could reasonably support.
3. The queries are just about as good as I can get them and have been
throughly EXPLAIN'ed with live/large amounts of data in the tables. No IN's
used, only EXISTS (where required). This one bit me right at the start. I
had to kill one query after it ran for ~36hrs! BTW, EXISTS seems to be far
more efficient than a JOIN. Is this always true?
4. Temp tables are used to simplify complex queries (and speed them up I
hope).
5. RAID-0 (SCSI/HW) + Dual Proc + 1GB RAM. Linux 2.4.17(smp) (pure Linus. No
other patches). Swap is on a seperate IDE drive.
6. COPY is used where-ever it can be. Index's are dropped before heavy
modification and then recreated. I don't use CLUSTER.
7. Driver app was multi-threaded. It made things worse. BTW, the apps jobs
consists largely of firing off SQL queries in the correct sequence; so you'd
better not go blaming my code!

Is there anything I've missed out?

--Arsalan.





Re: Turning off transactions completely.

From
"Arsalan Zaidi"
Date:
> Quick question: you are creating the indices *after* you insert the data,
> right?

That is correct.


> > 1. VACUUM is dead slow. The CPU monitor shows long stretches when
there's
> > barely any activity at all while the vacuum is going on.
> >
> > 2. I do vacuum analyze's to help the database figure out how to best run
my
> > queries. But it *still* doesnt use some of them...
>
> A vacuum analyze does a vacuum also so you only need to do the former.

Am aware of that. I only do 'analyzes'. Sorry if I didn't make that clear.
:-)


> Strange, that would indicate a serious bottleneck. Are you saturating the
> disks? You can use vmstat to work out the amount of disk activity.

Right. I guess I'll do that.

>
> > I'm guessing that the various complex queries being fired at the same
time
> > are forcing the use of Swap (yes, it on another HDD entirely) ; which is
> > slowing down the machine.
>
> You say you had 1 GB of memory? Could you give an example of a query that
> uses this amount of memory.
>

OKay....

When I create a double index on a table with ~3.3 GB of Data (size of the
initial plain text file). Course, on the second run, it'll have ~6.6 GB of
data... I haven't reached that far yet. I ran out of space on my array on
the first run itself. BTW, that's production data. For testing I use a 300
MB file. The sort mem is set so...

-S 350000

So that could be the reason. I don't mind the mem being used in the
single-threaded app (there's no one else there), but with multi-threads, it
could be a problem...

--Arsalan



Re: Turning off transactions completely.

From
Alvar Freude
Date:
Hi,

>> I would suggest to optimize your Application!
>
> First thing I did! :-)

OK, then sorry :)

I've a application where the tome consumption and amount of data is
similar, so this was my first idea.


> 1. VACUUM is dead slow. The CPU monitor shows long stretches when there's
> barely any activity at all while the vacuum is going on.

yes :-(

I hope for the 7.2, it seems that Vacuum is much more faster there (but I
didn't tested it yet).


> 2. I do vacuum analyze's to help the database figure out how to best run
> my queries. But it *still* doesnt use some of them...

I had the same problem; a simple (but dirty) solution was to disable
sequential scans in postgresql.conf:

enable_seqscan = false

Ciao
 Alvar


--
AGI :: Hohnerstrasse 23, 70469 Stuttgart
Fon +49 (0)711.490 320-0, Fax +49 (0)711.490 320-150
AGI auf Platz 3 im neuen Multimedia-Kreativranking
http://www.agi.de/tagebuch/


Re: Turning off transactions completely.

From
Francisco Reyes
Date:
On Tue, 8 Jan 2002, Arsalan Zaidi wrote:
> Just to re-iterate.
> 1. I've done quite a bit of tweaking with the WAL*, shared buffers and sort
> mem over several weeks. They're about as optimum as I can get them.

Could you expand on that. What makes you say they are "optimun".


> 3. The queries are just about as good as I can get them and have been
> throughly EXPLAIN'ed with live/large amounts of data in the tables. No IN's
> used, only EXISTS (where required).

I am far from an SQL expert, but it may not hurt to share with the list
what are these queries. Perhaps even post them.

> 4. Temp tables are used to simplify complex queries (and speed them up I
> hope).

Could you expand on that?

> 5. RAID-0 (SCSI/HW) + Dual Proc + 1GB RAM. Linux 2.4.17(smp) (pure Linus. No
> other patches). Swap is on a seperate IDE drive.

How about more memory? This would certainly help substantially.
What is the speed of the HDs? 10K RPM? 15K RPM?
Is the Swap ever hit? Hitting Swap always hurts, but given that you put an IDE for swap it would slow you
down even more.
What is the speed of the CPUs?
What type of memory PC100/PC133/Rambus/DDR???
How many disks on your Raid 0?

Although there is much you could perhaps do through software configuration
once you get to the levels you are discribing there is no substitute for
fast hardware. Probably the cheapest upgrade you can do is adding more
memory. This will likely help.

You also need to pay close attention to your schema. You can to have your
most heavily used data in a small table. To give you an example let me
make up an scenario simmilar to something I did.

Let's say I have a people's list(not particularly correct SQL just to give
you an idea)
id    serial
name     varchar(20)
last    varchar(20)
addres    varchar(30)
addres2 varchar(30)
state    char(2)
zip    int
comment varchar(30)
education char(1)

Now let's say that I use this table extremely heavily when doing joins
with other tables (i.e. OLAP type of analyses against other tables).
What I did was that I broke it off into two tables and kept only the
fields which I used %90+ of the time. something like
id
name
zip

Then did another table with the rest and linked by ID. I didn't do any
particular time difference analyses, but I got a substantial improvement.
On my actual case I had about 50 fields which totalled something on the
neighborhood of 300 bytes. After breaking it up the abbreviated
version of the table was less than 50 bytes.


> 7. Driver app was multi-threaded. It made things worse.

Expand on this. This is where the number of buffers vs the amount of
memory comes into play. If by firing multiple copies you ended up hitting
swap this would have made things much worse. Remember postgresql fires up
processes. It doesn't use threads.

>BTW, the apps jobs consists largely of firing off SQL queries
>in the correct sequence;


It may help if you tell us more about these queries. In particular the one
you mentioned that takes 36 hours.