Thread: stored proc and inserting hundreds of thousands of rows
I have a stored proc that potentially inserts hundreds of thousands, potentially millions, of rows (below). This stored proc is part of the the sequence of creating an ad campaign and links an ad to documents it should be displayedwith. A few of these stored procs can run concurrently as users create ad campaigns. We have 2 million documents now and linking an ad to all of them takes 5 minutes on my top-of-the-line SSD MacBook Pro. Last but not least, the system has to quickly serve ads while documents are being linked which is a problem at the moment. What can I do to make linking documents to ads faster or have less impact on the system. I would like the system to be asresponsive with serving ads while the linking itself is allowed to take a few minutes. One thing I'm concerned with, for example, is the whole multi-million row insert running within the stored proc transaction.I think inserting rows one by one or in small batches may be an improvement. I don't know how to accomplish this,though. Thanks, Joel --- CREATE DOMAIN doc_id AS varchar(64); CREATE DOMAIN id AS int; CREATE TABLE doc_ads ( doc_id id NOT NULL REFERENCES docs, ad_id id NOT NULL REFERENCES ads, distance float NOT NULL ); CREATE INDEX doc_ads_idx ON doc_ads(doc_id); CREATE OR REPLACE FUNCTION link_doc_to_ads(doc id, threshold float) RETURNS void AS $$ BEGIN INSERT INTO doc_ads (doc_id, ad_id, distance) SELECT doc, (t).ad_id, (t).distance FROM (SELECT ads_within_distance(topics, threshold) AS t FROM docs WHERE id = doc) AS x; ANALYZE doc_ads; END; $$ LANGUAGE plpgsql; -------------------------------------------------------------------------- - for hire: mac osx device driver ninja, kernel extensions and usb drivers ---------------------+------------+--------------------------------------- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont ---------------------+------------+---------------------------------------
Joel Reymont <joelr1@gmail.com> wrote: > We have 2 million documents now and linking an ad to all of them > takes 5 minutes on my top-of-the-line SSD MacBook Pro. How long does it take to run just the SELECT part of the INSERT by itself? -Kevin
Calculating distance involves giving an array of 150 float8 to a pgsql function, then calling a C function 2 million times (at the moment), giving it two arrays of 150 float8. Just calculating distance for 2 million rows and extracting the distance takes less than a second. I think that includes sorting by distance and sending 100 rows to the client. Are you suggesting eliminating the physical linking and calculating matching documents on the fly? Is there a way to speed up my C function by giving it all the float arrays, calling it once and having it return a set of matches? Would this be faster than calling it from a select, once for each array? Sent from my comfortable recliner On 30/04/2011, at 18:28, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Joel Reymont <joelr1@gmail.com> wrote: > >> We have 2 million documents now and linking an ad to all of them >> takes 5 minutes on my top-of-the-line SSD MacBook Pro. > > How long does it take to run just the SELECT part of the INSERT by > itself? > > -Kevin
If you want to search by geographical coordinates, you could use a gist index which can optimize that sort of things (like retrieving all rows which fit in a box).
I'm calculating distance between probability vectors, e.g. topics that a document belongs to and the topics of an ad. The distance function is already a C function. Topics are float8[150]. Distance is calculated against all documents in the database so it's arable scan. Sent from my comfortable recliner On 30/04/2011, at 19:04, Pierre C <lists@peufeu.com> wrote: > > If you want to search by geographical coordinates, you could use a gist index which can optimize that sort of things (likeretrieving all rows which fit in a box).
[rearranging to correct for top-posting] Joel Reymont <joelr1@gmail.com> wrote: > Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> Joel Reymont <joelr1@gmail.com> wrote: >> >>> We have 2 million documents now and linking an ad to all of them >>> takes 5 minutes on my top-of-the-line SSD MacBook Pro. >> >> How long does it take to run just the SELECT part of the INSERT >> by itself? > Are you suggesting eliminating the physical linking and > calculating matching documents on the fly? I'm not suggesting anything other than it being a good idea to determine where the time is being spent before trying to make it faster. You showed this as the apparent source of the five minute delay: INSERT INTO doc_ads (doc_id, ad_id, distance) SELECT doc, (t).ad_id, (t).distance FROM (SELECT ads_within_distance(topics, threshold) AS t FROM docs WHERE id = doc) AS x; What we don't know is how much of that time is due to writing to the doc_ads table, and how much is due to reading the other tables. We can find that out by running this: SELECT doc, (t).ad_id, (t).distance FROM (SELECT ads_within_distance(topics, threshold) AS t FROM docs WHERE id = doc) AS x; If this is where most of the time is, the next thing is to run it with EXPLAIN ANALYZE, and post the output. It's a whole different set of things to try to tune if that part is fast and the INSERT itself is slow. Of course, be aware of caching effects when you time this. -Kevin
Joel Reymont <joelr1@gmail.com> wrote: > I'm calculating distance between probability vectors, e.g. topics > that a document belongs to and the topics of an ad. > > The distance function is already a C function. Topics are > float8[150]. > > Distance is calculated against all documents in the database There's probably a way to index that so that you don't need to do a full calculation against all documents in the database each time. It may even be amenable to knnGiST indexing (a new feature coming in 9.1), which would let you do your select with an ORDER BY on the distance. PostgreSQL has a lot of very cool features you just don't have in any other product! :-) -Kevin
On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote: > If this is where most of the time is, the next thing is to run it > with EXPLAIN ANALYZE, and post the output. I was absolutely wrong about the calculation taking < 1s, it actually takes about 30s for 2 million rows. Still, the difference between 5 minutes and 30s must be the insert. SELECT (t).doc_id, (t).distance FROM (SELECT docs_within_distance('{ 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.586099770475, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.167233562858, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,0.00166666666667 }', 50.0) as t) AS x; This takes 27.44 seconds EXPLAIN ANALYZE VERBOSE Subquery Scan on x (cost=0.00..0.27 rows=1 width=32) (actual time=22422.418..23835.468 rows=2000002 loops=1) Output: (t).doc_id, (t).distance -> Result (cost=0.00..0.26 rows=1 width=0) (actual time=22422.410..23184.086 rows=2000002 loops=1) Output: docs_within_distance(('{<array above goes here>}'::double precision[])::topics, 50::double precision) Total runtime: 23948.563 ms Topics is defined thusly: CREATE DOMAIN topics AS float[150]; Digging deeper into the distance function, EXPLAIN ANALYZE VERBOSE SELECT * FROM (SELECT id, divergence(<array above>, topics) AS distance FROM docs) AS tab WHERE tab.distance <= 50.0; Subquery Scan on tab (cost=0.00..383333.00 rows=666653 width=12) (actual time=0.027..20429.299 rows=2000002 loops=1) Output: tab.id, tab.distance Filter: (tab.distance <= 50::double precision) -> Seq Scan on public.docs (cost=0.00..358333.50 rows=1999960 width=36) (actual time=0.025..19908.200 rows=2000002 loops=1) Output: docs.id, divergence((<array above>::double precision[])::topics, docs.topics) Total runtime: 20550.019 ms I can't dig any deeper because divergence is a C function. Thanks, Joel -------------------------------------------------------------------------- - for hire: mac osx device driver ninja, kernel extensions and usb drivers ---------------------+------------+--------------------------------------- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont ---------------------+------------+---------------------------------------
On Apr 30, 2011, at 7:36 PM, Kevin Grittner wrote: > It may even be amenable to knnGiST indexing (a new feature coming in > 9.1), which would let you do your select with an ORDER BY on the > distance. I don't think I can wait for 9.1, need to go live in a month, with PostgreSQL or without. > PostgreSQL has a lot of very cool features you just don't have in any other product! :-) There's a strong NoSQL lobby here and I'm trying my best to show that PG can handle the job! -------------------------------------------------------------------------- - for hire: mac osx device driver ninja, kernel extensions and usb drivers ---------------------+------------+--------------------------------------- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont ---------------------+------------+---------------------------------------
On Sat, Apr 30, 2011 at 2:15 PM, Joel Reymont <joelr1@gmail.com> wrote: > > On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote: > >> If this is where most of the time is, the next thing is to run it >> with EXPLAIN ANALYZE, and post the output. > > I was absolutely wrong about the calculation taking < 1s, it actually takes about 30s for 2 million rows. > > Still, the difference between 5 minutes and 30s must be the insert. But what exactly are you inserting? The queries you reported below are not the same as the ones you originally described. In particular, they do not seem to use the "threshold" parameter that the original ones did, whose job is presumably to cut the 2 million down to a much smaller number that meet the threshold. But how much smaller is that number? This will have a large effect on how long the insert takes. ... > Digging deeper into the distance function, > > EXPLAIN ANALYZE VERBOSE > SELECT * > FROM (SELECT id, divergence(<array above>, topics) AS distance FROM docs) AS tab > WHERE tab.distance <= 50.0; > > Subquery Scan on tab (cost=0.00..383333.00 rows=666653 width=12) (actual time=0.027..20429.299 rows=2000002 loops=1) > Output: tab.id, tab.distance > Filter: (tab.distance <= 50::double precision) > -> Seq Scan on public.docs (cost=0.00..358333.50 rows=1999960 width=36) (actual time=0.025..19908.200 rows=2000002 loops=1) > Output: docs.id, divergence((<array above>::double precision[])::topics, docs.topics) It looks like "WHERE tab.distance <= 50.0;" is not accomplishing anything. Are you sure the parameter shouldn't be <=0.50 instead? Also, you previously said you didn't mind of this process took a couple minutes, as long as it didn't interfere with other things going on in the database. So you probably need to describe what those other things going on in the database are. Also, you might have a data correctness problem. If the plan is to scan new ads against all docs, and new docs against all ads; then if new rows are added to each table during overlapping transaction, the new ads against new docs comparison will not actually happen. You will probably need to add manual locking to get around this problem. Cheers Jeff
On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote: > But what exactly are you inserting? The queries you reported below > are not the same as the ones you originally described. I posted the wrong query initially. The only difference is in the table that holds the probability array. I'm inserting document id and ad id pairs to show that this ad is not linked to this document. The mapping table has a primarykey on the serial document id. > In particular, they do not seem to use the "threshold" parameter that > the original ones did, whose job is presumably to cut the 2 million > down to a much smaller number that meet the threshold. But how much > smaller is that number? The 5 minutes is with a threshold large enough to be irrelevant. I would like to optimize the process before I apply thethreshold to cut down the number of rows. > It looks like "WHERE tab.distance <= 50.0;" is not accomplishing > anything. Are you sure the parameter shouldn't be <=0.50 instead? No, ignore the threshold for now. > Also, you previously said you didn't mind of this process took a > couple minutes, as long as it didn't interfere with other things going > on in the database. So you probably need to describe what those other > things going on in the database are. Those other things are ad serving which boils down to a lookup of ad ids linked to the document. This is a lookup from the mapping table using the primary key that goes on at the same time as a large number of <doc,ad>mappings are being inserted into the same table. Documents are uploaded into the system at a rate of 10k per day, once every couple of seconds. I wish I could get rid ofstoring the <doc,ad> mapping as that table is gonna grow absolutely huge when each new ad matches tens or hundreds of thousandsof documents. I don't think I can do the matching when serving an ad, though, as I will still need to scan millions of probability vectors(one per doc) to calculate the distance between current document and existing ads. Then again, the number of ads in the system will always be a fraction of the number of documents so, perhaps, the matchingof document to ads can be done at runtime. > Also, you might have a data correctness problem. If the plan is to > scan new ads against all docs, and new docs against all ads; That's basically it. As new ads are entered, they need to be matched with existing documents. As new documents are entered, they need to be matched with existing ads. Both ads and docs are represented by probability vectors of 150 floats so it's the same distance calculation. > then if new rows are added to each table during overlapping transaction, the > new ads against new docs comparison will not actually happen. You > will probably need to add manual locking to get around this problem. I'll ponder this, thanks for pointing it out! -------------------------------------------------------------------------- - for hire: mac osx device driver ninja, kernel extensions and usb drivers ---------------------+------------+--------------------------------------- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont ---------------------+------------+---------------------------------------
On Sat, Apr 30, 2011 at 3:29 PM, Joel Reymont <joelr1@gmail.com> wrote: > > On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote: > >> But what exactly are you inserting? The queries you reported below >> are not the same as the ones you originally described. > > I posted the wrong query initially. The only difference is in the table that holds the probability array. > > I'm inserting document id and ad id pairs to show that this ad is not linked to this document. The mapping table has aprimary key on the serial document id. Having the (doc_id, ad_id) pair be missing from the table is a far more efficient way to show that the ad is not linked to the document (i.e. that it is below the threshold). Provided that you are careful that there are no other reasons that the pair could be missing--but if you are not careful about that, then I don't see how storing the full matrix will save you anyway. > >> In particular, they do not seem to use the "threshold" parameter that >> the original ones did, whose job is presumably to cut the 2 million >> down to a much smaller number that meet the threshold. But how much >> smaller is that number? > > The 5 minutes is with a threshold large enough to be irrelevant. I would like to optimize the process before I apply thethreshold to cut down the number of rows. > >> It looks like "WHERE tab.distance <= 50.0;" is not accomplishing >> anything. Are you sure the parameter shouldn't be <=0.50 instead? > > No, ignore the threshold for now. OK, but it seems to me that you are starting out by ruling out the one optimization that is most likely to work. >> Also, you previously said you didn't mind of this process took a >> couple minutes, as long as it didn't interfere with other things going >> on in the database. So you probably need to describe what those other >> things going on in the database are. > > Those other things are ad serving which boils down to a lookup of ad ids linked to the document. > > This is a lookup from the mapping table using the primary key that goes on at the same time as a large number of <doc,ad>mappings are being inserted into the same table. What numbers do you get for lookups per second when inserts are also going on, versus when they are not going on? The way I would approach this is by making two independent tasks, one that insert records at your anticipated rate "insert into foo select generate_series from generate_series(1,100000);" in a loop, and another than generates select load against a separate table (like pgbench -S) and see how the two interact with each other by competing for CPU and IO. You could throttle the insert process by adding pg_sleep(<some fraction of a second>) as a column in one of your selects, so it pauses at every row. But due to granularity of pg_sleep, you might have to put it in a CASE expression so it is invoked on only a random subset of the rows rather than each row. But once throttled, will it be able to keep up with the flow of new docs and ads? > > I don't think I can do the matching when serving an ad, though, as I will still need to scan millions of probability vectors(one per doc) to calculate the distance between current document and existing ads. gist indices are designed to make this type of thing fast, by using techniques to rule out most of those comparisons without actually performing them. I don't know enough about the guts of either your distance function or the gist indexes to know if you can do it this way, but if you can it would certainly be the way to go. Cheers, Jeff
Calculating distance involves giving an array of 150 float8 to a pgsql function, then calling a C function 2 million times (at the moment), giving it two arrays of 150 float8. Just calculating distance for 2 million rows and extracting the distance takes less than a second. I think that includes sorting by distance and sending 100 rows to the client. Are you suggesting eliminating the physical linking and calculating matching documents on the fly? Is there a way to speed up my C function by giving it all the float arrays, calling it once and having it return a set of matches? Would this be faster than calling it from a select, once for each array? Sent from my comfortable recliner On 30/04/2011, at 18:28, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Joel Reymont <joelr1@gmail.com> wrote: > >> We have 2 million documents now and linking an ad to all of them >> takes 5 minutes on my top-of-the-line SSD MacBook Pro. > > How long does it take to run just the SELECT part of the INSERT by > itself? > > -Kevin
On Sat, Apr 30, 2011 at 5:12 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
gist indices are designed to make this type of thing fast, by using
techniques to rule out most of those comparisons without actually
performing them. I don't know enough about the
guts of either your distance function or the gist indexes to know if
you can do it this way, but if you can it would certainly be the way
to go.
It is definitely a good idea to consider a gist index for eliminating most of a large dataset, if applicable. Do a little reading on the topic and, hopefully, it's applicability (or not) will become apparent.
However, as someone who has built a number of ad servers over the years, for several of the larger ad networks, the first thing I'd do is separate your ad serving from real-time interaction with your database, no matter what the underlying technology. If you ad serving is dependent on your database, it means that hiccups in the database will have an impact on ad serving, which is rarely tolerable. And god forbid you should need to take the db down for a period of maintenance. The reliability and performance required of most ad servers is often greater than what should reasonably be expected of a relational database, particularly if there are other processes accessing the database, as is the case with your system. The first rule of ad serving is that no outage of backend systems should ever be able to prevent or impact front end ad serving. Some kind of in-memory cache of doc/ad mappings which the ad server interacts with will serve you in good stead and will be much easier to scale horizontally than most relational db architectures lend themselves to. If you have an ever increasing set of documents and ads, you'll inevitably wind up 'sharding' your dataset across multiple db hosts in order to maintain performance - which creates a number of maintenance complexities. Easier to maintain a single database and do analytics over a single data source, but insulate it from the real-time performance requirements of your ad serving. Even something as simple as a process that pushes the most recent doc/ad mappings into a memcache instance could be sufficient - and you can scale your memcache across as many hosts as is necessary to deliver the lookup latencies that you require no matter how large the dataset. Similarly, if you are updating the database from the ad server with each ad served in order to record an impression or click, you'll be far better off logging those and then processing the logs in bulk on a periodic basis. If subsequent impressions are dependent upon what has already been served historically, then use your memcache instance (or whatever structure you eventually choose to utilize) to handle those lookups. This gives you the convenience and flexibility of a relational system with SQL for access, but without the constraints of the capabilities of a single host limiting real-time performance of the system as a whole.
I'm calculating distance between probability vectors, e.g. topics that a document belongs to and the topics of an ad. The distance function is already a C function. Topics are float8[150]. Distance is calculated against all documents in the database so it's arable scan. Sent from my comfortable recliner On 30/04/2011, at 19:04, Pierre C <lists@peufeu.com> wrote: > > If you want to search by geographical coordinates, you could use a gist index which can optimize that sort of things (likeretrieving all rows which fit in a box).
On 04/30/2011 09:00 PM, Samuel Gendler wrote: > Some kind of in-memory cache of doc/ad mappings which the ad server > interacts with will serve you in good stead and will be much easier to > scale horizontally than most relational db architectures lend > themselves to...Even something as simple as a process that pushes the > most recent doc/ad mappings into a memcache instance could be > sufficient - and you can scale your memcache across as many hosts as > is necessary to deliver the lookup latencies that you require no > matter how large the dataset. Many of the things I see people switching over to NoSQL key/value store solutions would be served equally well on the performance side by a memcache layer between the application and the database. If you can map the problem into key/value pairs for NoSQL, you can almost certainly do that using a layer above PostgreSQL instead. The main downside of that, what people seem to object to, is that it makes for two pieces of software that need to be maintained; the NoSQL solutions can do it with just one. If you have more complicated queries to run, too, the benefit to using a more complicated database should outweigh that extra complexity though. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books