Seg fault when processing large SPI cursor (PG9.13) - Mailing list pgsql-hackers
| From | Fields, Zachary J. (MU-Student) |
|---|---|
| Subject | Seg fault when processing large SPI cursor (PG9.13) |
| Date | |
| Msg-id | CF524B464816BB469FEBFD9AD660E92E3EF88ACF@CH1PRD0102MB112.prod.exchangelabs.com Whole thread Raw |
| Responses |
Re: Seg fault when processing large SPI cursor (PG9.13)
|
| List | pgsql-hackers |
I'm working on PostgreSQL 9.13 (waiting for admin to push upgrades next week), in the meanwhile, I was curious if there are any known bugs regarding large cursor fetches, or if I am to blame.
My cursor has 400 million records, and I'm fetching in blocks of 2^17 (approx. 130K). When I fetch the next block after processing the 48,889,856th record, then the DB seg faults. It should be noted, I have processed tables with 23 million+ records several times and everything appears to work great.
I have watched top, and the system memory usage gets up to 97.6% (from approx 30 million records onward - then sways up and down), but ultimately crashes when I try to get past the 48,889,856th record. I have tried odd and various block sizes, anything greater than 2^17 crashes at the fetch that would have it surpassed 48,889,856 records, 2^16 hits the same sweet spot, and anything less than 2^16 actually crashes slightly earlier (noted in comments in code below).
To me, it appears to be an obvious memory leak, the question is who caused it. I would typically assume I am to blame (and I may be), but the code is so simple (shown below) that I can't see how it could be me - unless I am misusing pg-sql (which is totally possible).
Here is the code segment that is crashing...
<code>
// Cursor variables
const char *cursor_name = NULL; // Postgres will self-assign a name
const int arg_count = 0; // No arguments will be passed
Oid *arg_types = NULL; // n/a
Datum *arg_values = NULL; // n/a
const char *null_args = NULL; // n/a
bool read_only = true; // read_only allows for optimization
const int cursor_opts = CURSOR_OPT_NO_SCROLL; // default cursor options
bool forward = true;
//const long fetch_count = FETCH_ALL;
//const long fetch_count = 1048576; // 2^20 - last processed = 48,234,496
//const long fetch_count = 524288; // 2^19 - last processed = 48,758,784
//const long fetch_count = 262144; // 2^18 - last processed = 48,758,784
const long fetch_count = 131072; // 2^17 - last processed = 48,889,856
//const long fetch_count = 65536; // 2^16 - last processed = 48,889,856
//const long fetch_count = 32768; // 2^15 - last processed = 48,857,088
//const long fetch_count = 16384; // 2^14 - last processed = 48,791,552
//const long fetch_count = 8192; // 2^13 - last processed = 48,660,480
//const long fetch_count = 4096; // 2^12 - last processed = 48,398,336
//const long fetch_count = 2048; // 2^11
//const long fetch_count = 1024; // 2^10
//const long fetch_count = 512; // 2^9
//const long fetch_count = 256; // 2^8
//const long fetch_count = 128; // 2^7
//const long fetch_count = 64; // 2^6
//const long fetch_count = 32; // 2^5
//const long fetch_count = 16; // 2^4
//const long fetch_count = 8; // 2^3
//const long fetch_count = 4; // 2^2
//const long fetch_count = 2; // 2^1
//const long fetch_count = 1; // 2^0
unsigned int i, j, end, stored;
unsigned int result_counter = 0;
float8 l1_norm;
bool is_null = true;
bool nulls[4];
Datum result_tuple_datum[4];
HeapTuple new_tuple;
MemoryContext function_context;
ResultCandidate *candidate, **candidates, *top, *free_candidate = NULL;
KSieve<ResultCandidate *> sieve(result_cnt_);
/*********************/
/** Init SPI_cursor **/
/*********************/
// Connect to SPI
if ( SPI_connect() != SPI_OK_CONNECT ) { return; }
// Prepare and open SPI cursor
Portal signature_cursor = SPI_cursor_open_with_args(cursor_name, sql_stmt_, arg_count, arg_types, arg_values, null_args, read_only, cursor_opts);
do {
// Fetch rows for processing (Populates SPI_processed and SPI_tuptable)
SPI_cursor_fetch(signature_cursor, forward, fetch_count);
/************************/
/** Process SPI_cursor **/
/************************/
// Iterate cursor and perform calculations
for (i = 0 ; i < SPI_processed ; ++i) {
// Transfer columns to work array
for ( j = 1 ; j < 4 ; ++j ) {
result_tuple_datum[j-1] = SPI_getbinval(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, j, &is_null);
nulls[j-1] = is_null;
}
// Special Handling for final column
Datum raw_double_array = SPI_getbinval(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, 4, &is_null);
nulls[3] = is_null;
if ( is_null ) {
l1_norm = FLT_MAX;
result_tuple_datum[3] = PointerGetDatum(NULL);
} else {
// Transform binary into double array
ArrayType *pg_double_array = DatumGetArrayTypeP(raw_double_array);
l1_norm = meanAbsoluteError(signature_, (double *)ARR_DATA_PTR(pg_double_array), (ARR_DIMS(pg_double_array))[0], 0);
result_tuple_datum[3] = Float8GetDatum(l1_norm);
}
// Create and test candidate
if ( free_candidate ) {
candidate = free_candidate;
free_candidate = NULL;
} else {
candidate = (ResultCandidate *)palloc(sizeof(ResultCandidate));
}
(*candidate).lat = DatumGetFloat8(result_tuple_datum[0]);
(*candidate).null_lat = nulls[0];
(*candidate).lon = DatumGetFloat8(result_tuple_datum[1]);
(*candidate).null_lon = nulls[1];
(*candidate).orientation = DatumGetFloat8(result_tuple_datum[2]);
(*candidate).null_orientation = nulls[2];
(*candidate).rank = l1_norm;
(*candidate).null_rank = nulls[3];
// Run candidate through sieve
top = sieve.top();
if ( !sieve.siftItem(candidate) ) {
// Free non-filtered candidates
free_candidate = candidate;
} else if ( sieve.size() == result_cnt_ ) {
// Free non-filtered candidates
free_candidate = top;
}
}
result_counter += i;
} while ( SPI_processed );
SPI_finish();
</code>
Is there an obvious error I'm overlooking, or is there a known bug (PG9.13) for large fetch sizes?
Thanks,
Zak
P.S. KSieve is POD encapsulating an array that has been allocated with palloc().
My cursor has 400 million records, and I'm fetching in blocks of 2^17 (approx. 130K). When I fetch the next block after processing the 48,889,856th record, then the DB seg faults. It should be noted, I have processed tables with 23 million+ records several times and everything appears to work great.
I have watched top, and the system memory usage gets up to 97.6% (from approx 30 million records onward - then sways up and down), but ultimately crashes when I try to get past the 48,889,856th record. I have tried odd and various block sizes, anything greater than 2^17 crashes at the fetch that would have it surpassed 48,889,856 records, 2^16 hits the same sweet spot, and anything less than 2^16 actually crashes slightly earlier (noted in comments in code below).
To me, it appears to be an obvious memory leak, the question is who caused it. I would typically assume I am to blame (and I may be), but the code is so simple (shown below) that I can't see how it could be me - unless I am misusing pg-sql (which is totally possible).
Here is the code segment that is crashing...
<code>
// Cursor variables
const char *cursor_name = NULL; // Postgres will self-assign a name
const int arg_count = 0; // No arguments will be passed
Oid *arg_types = NULL; // n/a
Datum *arg_values = NULL; // n/a
const char *null_args = NULL; // n/a
bool read_only = true; // read_only allows for optimization
const int cursor_opts = CURSOR_OPT_NO_SCROLL; // default cursor options
bool forward = true;
//const long fetch_count = FETCH_ALL;
//const long fetch_count = 1048576; // 2^20 - last processed = 48,234,496
//const long fetch_count = 524288; // 2^19 - last processed = 48,758,784
//const long fetch_count = 262144; // 2^18 - last processed = 48,758,784
const long fetch_count = 131072; // 2^17 - last processed = 48,889,856
//const long fetch_count = 65536; // 2^16 - last processed = 48,889,856
//const long fetch_count = 32768; // 2^15 - last processed = 48,857,088
//const long fetch_count = 16384; // 2^14 - last processed = 48,791,552
//const long fetch_count = 8192; // 2^13 - last processed = 48,660,480
//const long fetch_count = 4096; // 2^12 - last processed = 48,398,336
//const long fetch_count = 2048; // 2^11
//const long fetch_count = 1024; // 2^10
//const long fetch_count = 512; // 2^9
//const long fetch_count = 256; // 2^8
//const long fetch_count = 128; // 2^7
//const long fetch_count = 64; // 2^6
//const long fetch_count = 32; // 2^5
//const long fetch_count = 16; // 2^4
//const long fetch_count = 8; // 2^3
//const long fetch_count = 4; // 2^2
//const long fetch_count = 2; // 2^1
//const long fetch_count = 1; // 2^0
unsigned int i, j, end, stored;
unsigned int result_counter = 0;
float8 l1_norm;
bool is_null = true;
bool nulls[4];
Datum result_tuple_datum[4];
HeapTuple new_tuple;
MemoryContext function_context;
ResultCandidate *candidate, **candidates, *top, *free_candidate = NULL;
KSieve<ResultCandidate *> sieve(result_cnt_);
/*********************/
/** Init SPI_cursor **/
/*********************/
// Connect to SPI
if ( SPI_connect() != SPI_OK_CONNECT ) { return; }
// Prepare and open SPI cursor
Portal signature_cursor = SPI_cursor_open_with_args(cursor_name, sql_stmt_, arg_count, arg_types, arg_values, null_args, read_only, cursor_opts);
do {
// Fetch rows for processing (Populates SPI_processed and SPI_tuptable)
SPI_cursor_fetch(signature_cursor, forward, fetch_count);
/************************/
/** Process SPI_cursor **/
/************************/
// Iterate cursor and perform calculations
for (i = 0 ; i < SPI_processed ; ++i) {
// Transfer columns to work array
for ( j = 1 ; j < 4 ; ++j ) {
result_tuple_datum[j-1] = SPI_getbinval(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, j, &is_null);
nulls[j-1] = is_null;
}
// Special Handling for final column
Datum raw_double_array = SPI_getbinval(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, 4, &is_null);
nulls[3] = is_null;
if ( is_null ) {
l1_norm = FLT_MAX;
result_tuple_datum[3] = PointerGetDatum(NULL);
} else {
// Transform binary into double array
ArrayType *pg_double_array = DatumGetArrayTypeP(raw_double_array);
l1_norm = meanAbsoluteError(signature_, (double *)ARR_DATA_PTR(pg_double_array), (ARR_DIMS(pg_double_array))[0], 0);
result_tuple_datum[3] = Float8GetDatum(l1_norm);
}
// Create and test candidate
if ( free_candidate ) {
candidate = free_candidate;
free_candidate = NULL;
} else {
candidate = (ResultCandidate *)palloc(sizeof(ResultCandidate));
}
(*candidate).lat = DatumGetFloat8(result_tuple_datum[0]);
(*candidate).null_lat = nulls[0];
(*candidate).lon = DatumGetFloat8(result_tuple_datum[1]);
(*candidate).null_lon = nulls[1];
(*candidate).orientation = DatumGetFloat8(result_tuple_datum[2]);
(*candidate).null_orientation = nulls[2];
(*candidate).rank = l1_norm;
(*candidate).null_rank = nulls[3];
// Run candidate through sieve
top = sieve.top();
if ( !sieve.siftItem(candidate) ) {
// Free non-filtered candidates
free_candidate = candidate;
} else if ( sieve.size() == result_cnt_ ) {
// Free non-filtered candidates
free_candidate = top;
}
}
result_counter += i;
} while ( SPI_processed );
SPI_finish();
</code>
Is there an obvious error I'm overlooking, or is there a known bug (PG9.13) for large fetch sizes?
Thanks,
Zak
P.S. KSieve is POD encapsulating an array that has been allocated with palloc().
pgsql-hackers by date: