12

I have some experience with python but very new to the SQL thing and trying to use pandas.to_sql to add table data into my database, but when I add I want it to check if the data exists before append

This are my 2 dataframes

>>> df0.to_markdown()
|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |     00 |
|  1 |      1 |     11 |

>>> df1.to_markdown()
|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |     00 |
|  1 |      1 |     11 |
|  2 |      2 |     22 |

So here I use the pandas to_sql

>>> df0.to_sql(con=con, name='test_db', if_exists='append', index=False)
>>> df1.to_sql(con=con, name='test_db', if_exists='append', index=False)

Here I check my data inside the database file

>>> df_out = pd.read_sql("""SELECT * FROM test_db""", con)
>>> df_out.to_markdown()
|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |      0 |
|  1 |      1 |     11 |
|  2 |      0 |      0 | # Duplicate
|  3 |      1 |     11 | # Duplicate
|  4 |      2 |     22 | 

But I want my database to look like this, so I don't want to add the duplicate data to my database

|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |      0 |
|  1 |      1 |     11 |
|  3 |      2 |     22 | 

Is there any option I can set or some line of code to add to make this happend?

Thankyou!

edit: There are some SQL code to only pull unique data, but what I want to do is don't add the data to the database in the first place

5
  • 1
    There is no option to do it with pandas.to_sql. You should use insted INSERT ... ON DUPLICATE KEY ... Commented May 11, 2020 at 8:35
  • Have you checked this link - Postgres: INSERT if does not exist already Commented May 11, 2020 at 8:57
  • Reply Comment#2 Yes, but that includes SQL code, I'm looking vanilla pandas code :)) Commented May 11, 2020 at 12:35
  • @AnonymousAnonymous you do know that you are writing a SQL query string to the pd.to_sql Commented May 11, 2020 at 20:08
  • @tidakdiinginkan No I don't, but I'm trying to make a function on top of pd.to_sql so I can blindly throw in bunch of data without need of checking, and let the function check for me instead :) Commented May 12, 2020 at 6:41

4 Answers 4

8

Don't use to_sql a simple query can work

query = text(f""" INSERT INTO test_db VALUES {','.join([str(i) for i in list(df0.to_records(index=False))])} ON CONFLICT ON CONSTRAINT test_db_pkey DO NOTHING""")

self.engine.connect().execute(query)

For each DataFrame change df0 to df1

Follow these link for a better understanding

Sign up to request clarification or add additional context in comments.

1 Comment

This method doesn't work with <NA> (eg. null boolean) or nan (eg. null string) values
0

For example in my Sqlite 3 database I used temporary table:

I inserted all the dataframe data into temporary table:

df0.to_sql(con=con, name='temptable', if_exists='append', index=False)
df1.to_sql(con=con, name='temptable', if_exists='append', index=False)

Then I copy only new data and drop (delete) the table:

con.executescript('''
INSERT INTO test_db
SELECT test_db.* FROM temptable 
LEFT JOIN test_db on 
   test_db.Col1 = temptable.Col1
WHERE test_db.Col1 IS NULL; -- only items, that not presented in 'test_db' table

DROP TABLE temptable;
''')

Comments

0

There are two ways:

  1. if the data from database is not big, read the data from database into the dataframe, and combine the two columns (Col1 and Col2) to create a new column, i.e. combined_column, and save this into a list combined_column_list. Filter out those rows from df0 and df2, whose corresponding combined_column does not appear in the combined_column_list, and inserted the filtered rows directly to the database table.

  2. Insert df1 and df2 to a temporary table, e.g. with name "temp". Using python pymysql to run the following code:

    conn = pymysql.connect(host=DB_ip, user=DB_user,passwd=DB_password,db=DB_name)    
    cur = conn.cursor()
    
    temp_query =" insert into  test_db (select * from temp where ( `Col1`, `Col2`) not in (select `Col1`, `Col2` from test_db ));"
    cur.execute(temp_query)
    conn.commit()  
    

This will only insert the new data to the database table.

Comments

-2

add this code to your function

remove_duplicate = 'EXEC remove_duplicate'
cursor.execute(remove_duplicate)
cursor.commit()

and create procedure in your db:

    CREATE PROCEDURE remove_duplicate AS
BEGIN
;WITH duplicates as (SELECT col1, col2,
                    ROW_NUMBER() OVER (PARTITION BY col1, col2, ORDER BY col1) AS number_of_duplicates
    FROM dbo.table)
    DELETE FROM duplicates WHERE number_of_duplicates > 1

END
go

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.