I’m trying to efficiently read about 10 million rows (single column) from a database table in Python and I’m not sure if my current approach is reasonable or if I’m missing some optimizations.
Approach 1: cursor + fetchmany
On average, it takes around 1.2 minutes to read 10 million rows.
sql = f"SELECT {col_id} FROM {table_id}"
raw_conn = engine.raw_connection()
try:
cursor = raw_conn.cursor()
cursor.execute(sql)
total_rows = 0
while True:
rows = cursor.fetchmany(chunk_size)
if not rows:
break
# Direct string conversion - fastest approach
values.extend(str(row[0]) for row in rows)
Approach 2: pandas read_sql with chunks
On average, this takes around 2 minutes to read 10 million rows.
sql = f"SELECT {col_id} FROM {table_id} WHERE {col_id} IS NOT NULL"
values: List[str] = []
for chunk in pd.read_sql(sql, engine, chunksize=CHUNK_SIZE):
# .astype(str) keeps nulls out (already filtered in SQL)
values.extend(chunk.iloc[:, 0].astype(str).tolist())
What is the most efficient way to read this many rows from the table into Python?
Are these timings (~1.2–2 minutes for 10 million rows) reasonable, or can this be significantly improved with a different pattern (e.g., driver settings, batching strategy, multiprocessing, or a different library)?