Re: Double insertion from scala spark job - Mailing list pgsql-jdbc

From Antoine DUBOIS
Subject Re: Double insertion from scala spark job
Date
Msg-id 1856232242.3494395.1612947743989.JavaMail.zimbra@cc.in2p3.fr
Whole thread Raw
In response to Re: Double insertion from scala spark job  (Dave Cramer <davecramer@postgres.rocks>)
List pgsql-jdbc
Hi Dave
I do agree that without more code or data it's hard,
I just wanted to share with you my issue.
It is an edge case even in my case and only happen with some of my data sample.
It was just to tell you I happen to encounter such an edge case. If I'm not the only one.
Also it might be a bug in spark not in jdbc drive.
Thank you for your answer and have the best of day.


De: "Dave Cramer" <davecramer@postgres.rocks>
À: "antoine dubois" <antoine.dubois@cc.in2p3.fr>
Cc: "List" <pgsql-jdbc@postgresql.org>
Envoyé: Mercredi 10 Février 2021 01:44:44
Objet: Re: Double insertion from scala spark job



On Tue, 9 Feb 2021 at 06:48, Antoine DUBOIS <antoine.dubois@cc.in2p3.fr> wrote:
Hello

I'm working with spark and postgresql to compute stat.
I came to encounter a strange behaviour in my job, when working with postgresql output I sometime have a double insertion happenning into my table (and violating constraint).
Detail: Key (xxx, xxx, xxx, xxx, xxx, xxxx, xxxx)=(2021-02-05 00:00:00, data, moredate, evenmoredata, somuchmoredata, dataagain, somuchofit) already exists.  Call getNextException to see other errors in the batch.
My data are generated as duplicate if I write the same data into mysql or into a parquet file with the same input and treatment I don't observe this behaviour.
Dev spec:
Scala 2.12
Spark Version 3.0.1
JDK 8
jdbc  "org.postgresql" % "postgresql" % "42.2.18"

PostgreSQL 12.5

My code is pretty simple and apply a SQL request to a parquet file and write the result like this :

outputDF.write.format("jdbc").option("driver", "org.postgresql.Driver").option("url", "jdbc:postgresql://<HOST>:<PORT>/<SCHEMA>?user=<USERNAME>&password=<PASSWORD>").option("dbtable", "mytable").mode(append).save()

What lead me to think it's a postgres jdbc bug more than anything else is the fact that this same command to output in mysql or in a parquet file produce no duplicate in this particular edge case i have with only some of my input files.
If any of you had any idea what could cause such a behavior (special char in the input, misconfigured something, maybe an option I don't know could help solving this issue )

I came to a point where I'm not sure of anything any longer.
Hope anyone will have some though about it.

You are the first person to report such a problem.

without additional information such as your code, there's little we can do.


Dave Cramer
www.postgres.rocks

Attachment

pgsql-jdbc by date:

Previous
From: Dave Cramer
Date:
Subject: Re: Double insertion from scala spark job
Next
From: Jorge Solórzano
Date:
Subject: [pgjdbc/pgjdbc] b48000: fix: Use SASLprep normalization for SCRAM authenti...