Home > mailing lists

Double insertion from scala spark job - Mailing list pgsql-jdbc

From	Antoine DUBOIS
Subject	Double insertion from scala spark job
Date	February 9, 2021 13:42:33
Msg-id	1094480618.3358882.1612867353140.JavaMail.zimbra@cc.in2p3.fr Whole thread Raw
Responses	Re: Double insertion from scala spark job
List	pgsql-jdbc

Tree view

Hello

I'm working with spark and postgresql to compute stat.

I came to encounter a strange behaviour in my job, when working with postgresql output I sometime have a double insertion happenning into my table (and violating constraint).

Detail: Key (xxx, xxx, xxx, xxx, xxx, xxxx, xxxx)=(2021-02-05 00:00:00, data, moredate, evenmoredata, somuchmoredata, dataagain, somuchofit) already exists.  Call getNextException to see other errors in the batch.

My data are generated as duplicate if I write the same data into mysql or into a parquet file with the same input and treatment I don't observe this behaviour.

Dev spec:

Scala 2.12

Spark Version 3.0.1

JDK 8

jdbc "org.postgresql" % "postgresql" % "42.2.18"

PostgreSQL 12.5

My code is pretty simple and apply a SQL request to a parquet file and write the result like this :

outputDF.write.format("jdbc").option("driver", "org.postgresql.Driver").option("url", "jdbc:postgresql://<HOST>:<PORT>/<SCHEMA>?user=<USERNAME>&password=<PASSWORD>").option("dbtable", "mytable").mode(append).save()

What lead me to think it's a postgres jdbc bug more than anything else is the fact that this same command to output in mysql or in a parquet file produce no duplicate in this particular edge case i have with only some of my input files.

If any of you had any idea what could cause such a behavior (special char in the input, misconfigured something, maybe an option I don't know could help solving this issue )

I came to a point where I'm not sure of anything any longer.

Hope anyone will have some though about it.

Hope your doing fine

Antoine

Attachment

smime.p7s

pgsql-jdbc by date:

From: Roman Kozlov
Date: 26 January 2021, 17:27:28
Subject: [pgjdbc/pgjdbc] 138d7e: test: Add TestUtils.closeQuietly(...)

From: Dave Cramer
Date: 10 February 2021, 03:44:44
Subject: Re: Double insertion from scala spark job

Double insertion from scala spark job - Mailing list pgsql-jdbc

Attachment

Previous

Next