Home > mailing lists

Re: Double insertion from scala spark job - Mailing list pgsql-jdbc

From	Antoine DUBOIS
Subject	Re: Double insertion from scala spark job
Date	February 10, 2021 09:02:23
Msg-id	1856232242.3494395.1612947743989.JavaMail.zimbra@cc.in2p3.fr Whole thread Raw
In response to	Re: Double insertion from scala spark job (Dave Cramer <davecramer@postgres.rocks>)
List	pgsql-jdbc

Tree view

Hi Dave

I do agree that without more code or data it's hard,
I just wanted to share with you my issue.
It is an edge case even in my case and only happen with some of my data sample.

It was just to tell you I happen to encounter such an edge case. If I'm not the only one.
Also it might be a bug in spark not in jdbc drive.

Thank you for your answer and have the best of day.

De: "Dave Cramer" <davecramer@postgres.rocks>
À: "antoine dubois" <antoine.dubois@cc.in2p3.fr>
Cc: "List" <pgsql-jdbc@postgresql.org>
Envoyé: Mercredi 10 Février 2021 01:44:44
Objet: Re: Double insertion from scala spark job

On Tue, 9 Feb 2021 at 06:48, Antoine DUBOIS <antoine.dubois@cc.in2p3.fr> wrote:

Hello

I'm working with spark and postgresql to compute stat.
I came to encounter a strange behaviour in my job, when working with postgresql output I sometime have a double insertion happenning into my table (and violating constraint).
Detail: Key (xxx, xxx, xxx, xxx, xxx, xxxx, xxxx)=(2021-02-05 00:00:00, data, moredate, evenmoredata, somuchmoredata, dataagain, somuchofit) already exists.  Call getNextException to see other errors in the batch.
My data are generated as duplicate if I write the same data into mysql or into a parquet file with the same input and treatment I don't observe this behaviour.
Dev spec:
Scala 2.12
Spark Version 3.0.1
JDK 8
jdbc "org.postgresql" % "postgresql" % "42.2.18"

PostgreSQL 12.5

My code is pretty simple and apply a SQL request to a parquet file and write the result like this :

outputDF.write.format("jdbc").option("driver", "org.postgresql.Driver").option("url", "jdbc:postgresql://<HOST>:<PORT>/<SCHEMA>?user=<USERNAME>&password=<PASSWORD>").option("dbtable", "mytable").mode(append).save()

What lead me to think it's a postgres jdbc bug more than anything else is the fact that this same command to output in mysql or in a parquet file produce no duplicate in this particular edge case i have with only some of my input files.
If any of you had any idea what could cause such a behavior (special char in the input, misconfigured something, maybe an option I don't know could help solving this issue )

I came to a point where I'm not sure of anything any longer.
Hope anyone will have some though about it.

You are the first person to report such a problem.

without additional information such as your code, there's little we can do.

Dave Cramer

www.postgres.rocks

Attachment

smime.p7s

pgsql-jdbc by date:

From: Dave Cramer
Date: 10 February 2021, 00:44:44
Subject: Re: Double insertion from scala spark job

From: Jorge Solórzano
Date: 10 February 2021, 13:20:15
Subject: [pgjdbc/pgjdbc] b48000: fix: Use SASLprep normalization for SCRAM authenti...

Re: Double insertion from scala spark job - Mailing list pgsql-jdbc

Attachment

Previous

Next