How to use pandas.to_sql but only add row if row doesn't exist yet

Question

I have some experience with python but very new to the SQL thing and trying to use pandas.to_sql to add table data into my database, but when I add I want it to check if the data exists before append

This are my 2 dataframes

>>> df0.to_markdown()
|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |     00 |
|  1 |      1 |     11 |

>>> df1.to_markdown()
|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |     00 |
|  1 |      1 |     11 |
|  2 |      2 |     22 |

So here I use the pandas to_sql

>>> df0.to_sql(con=con, name='test_db', if_exists='append', index=False)
>>> df1.to_sql(con=con, name='test_db', if_exists='append', index=False)

Here I check my data inside the database file

>>> df_out = pd.read_sql("""SELECT * FROM test_db""", con)
>>> df_out.to_markdown()
|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |      0 |
|  1 |      1 |     11 |
|  2 |      0 |      0 | # Duplicate
|  3 |      1 |     11 | # Duplicate
|  4 |      2 |     22 |

But I want my database to look like this, so I don't want to add the duplicate data to my database

|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |      0 |
|  1 |      1 |     11 |
|  3 |      2 |     22 |

Is there any option I can set or some line of code to add to make this happend?

Thankyou!

edit: There are some SQL code to only pull unique data, but what I want to do is don't add the data to the database in the first place

There is no option to do it with pandas.to_sql. You should use insted INSERT ... ON DUPLICATE KEY ... — nacho
– nacho, Commented May 11, 2020 at 8:35
Have you checked this link - Postgres: INSERT if does not exist already — Cavin Dsouza
– Cavin Dsouza, Commented May 11, 2020 at 8:57
Reply Comment#2 Yes, but that includes SQL code, I'm looking vanilla pandas code :)) — Anonymous Anonymous
– Anonymous Anonymous, Commented May 11, 2020 at 12:35
@AnonymousAnonymous you do know that you are writing a SQL query string to the pd.to_sql — Cavin Dsouza
– Cavin Dsouza, Commented May 11, 2020 at 20:08
@tidakdiinginkan No I don't, but I'm trying to make a function on top of pd.to_sql so I can blindly throw in bunch of data without need of checking, and let the function check for me instead :) — Anonymous Anonymous
– Anonymous Anonymous, Commented May 12, 2020 at 6:41

Sanchit Singh · Accepted Answer · 2021-04-06 13:58:40Z

8

Don't use to_sql a simple query can work

query = text(f""" INSERT INTO test_db VALUES {','.join([str(i) for i in list(df0.to_records(index=False))])} ON CONFLICT ON CONSTRAINT test_db_pkey DO NOTHING""")

self.engine.connect().execute(query)

For each DataFrame change df0 to df1

Follow these link for a better understanding

answered Apr 6, 2021 at 13:58

Sanchit Singh

811 silver badge2 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Alex Lamson Over a year ago

This method doesn't work with <NA> (eg. null boolean) or nan (eg. null string) values

Mike · Accepted Answer · 2022-04-26 06:55:40Z

0

For example in my Sqlite 3 database I used temporary table:

I inserted all the dataframe data into temporary table:

df0.to_sql(con=con, name='temptable', if_exists='append', index=False)
df1.to_sql(con=con, name='temptable', if_exists='append', index=False)

Then I copy only new data and drop (delete) the table:

con.executescript('''
INSERT INTO test_db
SELECT test_db.* FROM temptable 
LEFT JOIN test_db on 
   test_db.Col1 = temptable.Col1
WHERE test_db.Col1 IS NULL; -- only items, that not presented in 'test_db' table

DROP TABLE temptable;
''')

edited Apr 26, 2022 at 6:55

answered Apr 26, 2022 at 5:13

Mike

4896 silver badges15 bronze badges

Comments

XYZ · Accepted Answer · 2022-09-26 09:20:57Z

There are two ways:

if the data from database is not big, read the data from database into the dataframe, and combine the two columns (Col1 and Col2) to create a new column, i.e. combined_column, and save this into a list combined_column_list. Filter out those rows from df0 and df2, whose corresponding combined_column does not appear in the combined_column_list, and inserted the filtered rows directly to the database table.

Insert df1 and df2 to a temporary table, e.g. with name "temp". Using python pymysql to run the following code:

conn = pymysql.connect(host=DB_ip, user=DB_user,passwd=DB_password,db=DB_name)    
cur = conn.cursor()

temp_query =" insert into  test_db (select * from temp where ( `Col1`, `Col2`) not in (select `Col1`, `Col2` from test_db ));"
cur.execute(temp_query)
conn.commit()

This will only insert the new data to the database table.

Bart77 · Accepted Answer · 2020-07-23 16:07:09Z

-2

add this code to your function

remove_duplicate = 'EXEC remove_duplicate'
cursor.execute(remove_duplicate)
cursor.commit()

and create procedure in your db:

    CREATE PROCEDURE remove_duplicate AS
BEGIN
;WITH duplicates as (SELECT col1, col2,
                    ROW_NUMBER() OVER (PARTITION BY col1, col2, ORDER BY col1) AS number_of_duplicates
    FROM dbo.table)
    DELETE FROM duplicates WHERE number_of_duplicates > 1

END
go

answered Jul 23, 2020 at 16:07

Bart77

1031 silver badge9 bronze badges

Collectives™ on Stack Overflow

How to use pandas.to_sql but only add row if row doesn't exist yet

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related