Thread: number of loaded/unloaded COPY rows
I prepared a patch for "Have COPY return the number of rows loaded/unloaded?" TODO. (Sorry for disturbing list with such a simple topic, but per warning from Bruce Momjian, I send my proposal to -hackers first.) I used the "appending related information to commandTag" method which is used for INSERT/UPDATE/DELETE/FETCH commands too. Furthermore, I edited libpq to make PQcmdTuples() interpret affected rows from cmdStatus value for COPY command. (Changes don't cause any compatibility problems for API and seems like work with triggers too.) One of the problems related with the used concept is trying to encapsulate processed number of rows within an uint32 variable. This causes an internal limit for counting COPY when we think it can process billions of rows. I couldn't find a solution for this. (Maybe, two uint32 can be used to store row count.) But other processed row counters (like INSERT/UPDATE) uses uint32 too. What's your suggestions and comments? Regards.
Attachment
Volkan YAZICI wrote: > I prepared a patch for "Have COPY return the number of rows > loaded/unloaded?" TODO. (Sorry for disturbing list with such a simple > topic, but per warning from Bruce Momjian, I send my proposal to -hackers > first.) > > I used the "appending related information to commandTag" method which is > used for INSERT/UPDATE/DELETE/FETCH commands too. Furthermore, I edited > libpq to make PQcmdTuples() interpret affected rows from cmdStatus value > for COPY command. (Changes don't cause any compatibility problems for API > and seems like work with triggers too.) > > One of the problems related with the used concept is trying to encapsulate > processed number of rows within an uint32 variable. This causes an internal > limit for counting COPY when we think it can process billions of rows. I > couldn't find a solution for this. (Maybe, two uint32 can be used to store > row count.) But other processed row counters (like INSERT/UPDATE) uses > uint32 too. > > What's your suggestions and comments? Good question. The libpq interface returns the count as a character string using PQcmdTuples(), meaning libpq doesn't have any size limitation. The problem is only the counter you are using, and I think an int64 is the proper solution. If int64 isn't really 64-bits, the code should still work though. In fact we have this TODO, which is related: * Change LIMIT/OFFSET and FETCH/MOVE to use int8 This requires the same type of change. I have added this TODO: * Allow the count returned by SELECT, etc to be to represent as an int64 to allow a higher range of values This requires a change to es_processed, I think. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Dec 16 08:47, Bruce Momjian wrote: > I think an int64 is the proper solution. If int64 isn't really > 64-bits, the code should still work though. (I'd prefer uint64 instead of an int64.) In src/include/c.h, in this or that way it'll assign a type for uint64, so there won't be any problem for both 64-bit and non-64-bit architectures. I've attached the updated patch. This one uses uint64 and UINT64_FORMAT while printing uint64 value inside string. I used char[20+1] as buffer size to store uint64 value's string representation. (Because, AFAIK, maximum decimal digit length of an [u]int64 equals to 2^64 - 1 = 20.) In this context, when I looked at the example usages (for instance as in backend/commands/sequence.c) it's stored in a char[100] buffer. Maybe we should define a constant in pg_config.h like INT64_PRINT_LEN. This will define a standard behaviour with INT64_FORMAT for using integers inside strings. For instance: char buf[INT64_PRINT_LEN+1]; snprintf(buf, sizeof(buf), INT64_FORMAT, var); > In fact we have this TODO, which is related: > > * Change LIMIT/OFFSET and FETCH/MOVE to use int8 > > This requires the same type of change. > > I have added this TODO: > > * Allow the count returned by SELECT, etc to be to represent > as an int64 to allow a higher range of values > > This requires a change to es_processed, I think. I think so. es_processed is defined as uint32. It should be uint64 too. I tried to prepare a patch for es_processed issue. But when I look further in the code, found that there're lots of mixed usages of "uint32" and "long" for row count related trackings. (Moreover, as you can see from the patch, there's a problem with ULLONG_MAX usage in there.) I'm aware of the patch's out-of-usability, but I just tried to underline some (IMHO) problems. Last minute edit: Proposal: Maybe we can define a (./configure controlled) type like pg_int (with bounds like PG_INT_MAX) to use in counter related processes. - * - AFAIK, there're two ways to implement a counter: 1. Using integer types supplied by the compiler, like uint64 as we discussed above. Pros: Whole mathematical operations are handled by the compiler. Cons: Implementation is bounded by the system architecture. 2. Using arrays to hold numeric values, like we did in the implementation of SQL numeric types. Pros: Value lengths bounded by available memory. Cons: Mathematical operations have to be handled by software. Therefore, this will cause a small overhead in performance aspect compared to previous implementation. I'm not sure if we can use the second implementation (in the performance point of view) for the COPY command's counter. But IMHO it can be agreeable for SELECT/INSERT/UPDATE/DELETE operations' counters. OTOH, by using this way, we'll form a proper method for counting without any (logical) bounds. What's your opinion? If you aggree, I'll try to use the second implementation for counters - except COPY. Regards. -- "We are the middle children of history, raised by television to believe that someday we'll be millionaires and movie stars and rock stars, but we won't. And we're just learning this fact," Tyler said. "So don't fuck with us."
Attachment
Volkan YAZICI wrote: > On Dec 16 08:47, Bruce Momjian wrote: > > I think an int64 is the proper solution. If int64 isn't really > > 64-bits, the code should still work though. > > (I'd prefer uint64 instead of an int64.) In src/include/c.h, in > this or that way it'll assign a type for uint64, so there won't > be any problem for both 64-bit and non-64-bit architectures. Right. > I've attached the updated patch. This one uses uint64 and You should use the patches email list instead for patches. What I usually do is to post the discussion to hackers, and just send a second email to patches with the patch, or put the patch at a URL and reference the URL in the email message. > UINT64_FORMAT while printing uint64 value inside string. Sounds good. > I used char[20+1] as buffer size to store uint64 value's string > representation. (Because, AFAIK, maximum decimal digit length of > an [u]int64 equals to 2^64 - 1 = 20.) In this context, when I > looked at the example usages (for instance as in > backend/commands/sequence.c) it's stored in a char[100] buffer. > Maybe we should define a constant in pg_config.h like > INT64_PRINT_LEN. This will define a standard behaviour with > INT64_FORMAT for using integers inside strings. We could do that, but it might be overkill. I don't think we use a #define for int32 to string either, but if you want to clean it up just define the int32 and int64 defines in c.h and patch the code to use them consistently. That might be nice. > For instance: > char buf[INT64_PRINT_LEN+1]; > snprintf(buf, sizeof(buf), INT64_FORMAT, var); Yep, that is nice. I am thinking you should submit your COPY patch using a hard-coded constant, then submit a followup patch that fixes this for int32 and int64 in all places. > > In fact we have this TODO, which is related: > > > > * Change LIMIT/OFFSET and FETCH/MOVE to use int8 > > > > This requires the same type of change. > > > > I have added this TODO: > > > > * Allow the count returned by SELECT, etc to be to represent > > as an int64 to allow a higher range of values > > > > This requires a change to es_processed, I think. > > I think so. es_processed is defined as uint32. It should be > uint64 too. Yep. > I tried to prepare a patch for es_processed issue. But when I look > further in the code, found that there're lots of mixed usages of > "uint32" and "long" for row count related trackings. (Moreover, > as you can see from the patch, there's a problem with ULLONG_MAX > usage in there.) Agreed, it needs cleanup. We have tried to fix it as we worked on other things, but we need a full review of what is happening. However, let's do one thing in each patch so we can easily evaluate things. > I'm aware of the patch's out-of-usability, but I just tried to > underline some (IMHO) problems. > > Last minute edit: Proposal: Maybe we can define a (./configure > controlled) type like pg_int (with bounds like PG_INT_MAX) to use > in counter related processes. Seems we are best just defining a constant in c.h that is larger than anything we might need on any platform, rather than run configure tests for this. > AFAIK, there're two ways to implement a counter: > > 1. Using integer types supplied by the compiler, like uint64 as we > discussed above. > Pros: Whole mathematical operations are handled by the compiler. > Cons: Implementation is bounded by the system architecture. I think we are at a point that people running on systems with no int64 support should not expect to get valid return values for >2 billion row COPY operations. There is no way on their platform for them even to work with those values, so why be concerned about platforms that do not support it. Certainly converting everything to int64 from int32 isn't going to make things any _worse_ for them than what they have now, and it will fix the majority of our platforms. We are not getting lots of complaints about our current behavior so I think we are OK just improving the platforms that support int64. > 2. Using arrays to hold numeric values, like we did in the > implementation of SQL numeric types. > Pros: Value lengths bounded by available memory. > Cons: Mathematical operations have to be handled by software. > Therefore, this will cause a small overhead in performance > aspect compared to previous implementation. > > I'm not sure if we can use the second implementation (in the > performance point of view) for the COPY command's counter. But IMHO > it can be agreeable for SELECT/INSERT/UPDATE/DELETE operations' > counters. OTOH, by using this way, we'll form a proper method for > counting without any (logical) bounds. > > What's your opinion? If you aggree, I'll try to use the second > implementation for counters - except COPY. No, please use int64 at this point. I think the number of platforms without int64 support is small, and dwindling, _and_ very few of those non-int64-supporting platforms are doing operations of the size that would hit this limit. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Also, can I get a context diff for the patch? See the developers FAQ for information on how our patch process works. Thanks. --------------------------------------------------------------------------- Volkan YAZICI wrote: > On Dec 16 08:47, Bruce Momjian wrote: > > I think an int64 is the proper solution. If int64 isn't really > > 64-bits, the code should still work though. > > (I'd prefer uint64 instead of an int64.) In src/include/c.h, in > this or that way it'll assign a type for uint64, so there won't > be any problem for both 64-bit and non-64-bit architectures. > > I've attached the updated patch. This one uses uint64 and > UINT64_FORMAT while printing uint64 value inside string. > > I used char[20+1] as buffer size to store uint64 value's string > representation. (Because, AFAIK, maximum decimal digit length of > an [u]int64 equals to 2^64 - 1 = 20.) In this context, when I > looked at the example usages (for instance as in > backend/commands/sequence.c) it's stored in a char[100] buffer. > Maybe we should define a constant in pg_config.h like > INT64_PRINT_LEN. This will define a standard behaviour with > INT64_FORMAT for using integers inside strings. > > For instance: > char buf[INT64_PRINT_LEN+1]; > snprintf(buf, sizeof(buf), INT64_FORMAT, var); > > > In fact we have this TODO, which is related: > > > > * Change LIMIT/OFFSET and FETCH/MOVE to use int8 > > > > This requires the same type of change. > > > > I have added this TODO: > > > > * Allow the count returned by SELECT, etc to be to represent > > as an int64 to allow a higher range of values > > > > This requires a change to es_processed, I think. > > I think so. es_processed is defined as uint32. It should be > uint64 too. > > I tried to prepare a patch for es_processed issue. But when I look > further in the code, found that there're lots of mixed usages of > "uint32" and "long" for row count related trackings. (Moreover, > as you can see from the patch, there's a problem with ULLONG_MAX > usage in there.) > > I'm aware of the patch's out-of-usability, but I just tried to > underline some (IMHO) problems. > > Last minute edit: Proposal: Maybe we can define a (./configure > controlled) type like pg_int (with bounds like PG_INT_MAX) to use > in counter related processes. > > - * - > > AFAIK, there're two ways to implement a counter: > > 1. Using integer types supplied by the compiler, like uint64 as we > discussed above. > Pros: Whole mathematical operations are handled by the compiler. > Cons: Implementation is bounded by the system architecture. > > 2. Using arrays to hold numeric values, like we did in the > implementation of SQL numeric types. > Pros: Value lengths bounded by available memory. > Cons: Mathematical operations have to be handled by software. > Therefore, this will cause a small overhead in performance > aspect compared to previous implementation. > > I'm not sure if we can use the second implementation (in the > performance point of view) for the COPY command's counter. But IMHO > it can be agreeable for SELECT/INSERT/UPDATE/DELETE operations' > counters. OTOH, by using this way, we'll form a proper method for > counting without any (logical) bounds. > > What's your opinion? If you aggree, I'll try to use the second > implementation for counters - except COPY. > > > Regards. > > -- > "We are the middle children of history, raised by television to believe > that someday we'll be millionaires and movie stars and rock stars, but > we won't. And we're just learning this fact," Tyler said. "So don't > fuck with us." [ Attachment, skipping... ] [ Attachment, skipping... ] -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I think we are at a point that people running on systems with no int64 > support should not expect to get valid return values for >2 billion row > COPY operations. I agree, there's no need to work harder on this than changing the datatype to uint64. There are some places (vacuum and index build IIRC) that use "double" counters instead of either int32 or int64. That was a good compromise a few years ago, but I'm not sure we still need it. I think we can reasonably say that our goals for backward-compatibility to systems with no int64 datatype do not include working nicely with tables exceeding 4G rows. I didn't like that the patch introduced new buffer variables that were not there before --- that is just adding complexity to no point. This patch should not need to do more than change some variables' datatypes and adjust printf formats and string buffer lengths to match. regards, tom lane