Pandas + SQLAlchemy
Contents
14. Pandas + SQLAlchemy¶
這邊先講結論
連線
一率用 sqlalchemy 的
create_engine
來定義連線 engine,各種資料庫都是 (mysql, mssql, …)。例如:engine = create_engine("mysql+pymysql://root:my-secret-pw@localhost:3307/mydev")
用 conn 來建立連線,例如
conn = engine.connect()
可以直接用 raw sql 的語句,但記得在前面加上 sqlalchemy.sql 裡的 text(),才不會讓 sql 的特殊符號被影響到。例如
sql_string = "select * from table_A where col_1 like 'C%';"
,那%
在 sql alchemy 中會被當作特殊字元,且沒有逃脫符號可以幫你忙,所以正確做法是這樣:sql_string2 = text(sql_string)
,再拿 sql_string2 去做事。raw sql 的寫法,就直接用在 CRUD 中,例如:
“改”, “刪”,例如
conn.execute("update table_A set col_1 = 17;")
“讀”,不會用
conn.execute(sql_string2)
的作法,因為他回傳的不是我想要的 pandas dataframe。會用pd.read_sql(sql_string2, engine)
這種方式來做。得到的就會是 pd.DataFrame。但不變的是,我還是用 raw sql 語句,只是丟給不同 function 來做而已。CRUD 裡面的 “增”,你可以用懶人法
pd.DataFrame.to_sql()
直接傳上去,缺點就是沒有 schema,所以他自動幫你生的 schema 就是文字和數字而已,沒有 datetime。如果要更細緻的做,就直接用 raw sql 語句,做法就是和剛剛一樣:conn.execute(sql_string)
14.1. packages¶
from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pymysql
import pandas as pd
14.2. 連線¶
engine = create_engine("mysql+pymysql://root:my-secret-pw@localhost:3307/mydev")
conn = engine.connect()
14.3. [增]¶
14.3.1. 用 pd.DataFrame.to_sql()
¶
這是最簡單的方式,也是之前在 R 的時候使用的方式
簡單來說,就是不管 sql table schema,全都用 文字, 數字 的形式來存。犧牲 datetime 的格式。
有關時間的操作,都是從 sql 拉到 local 變成 dataframe 後,再自己轉成時間格式來處理
之後可以再查查看,應該有機會也可以定義 時間, ENUM, SET 等格式。
user2 = pd.DataFrame({
"username": ["hank","mike","pink","shark"],
"height": [185, 172, 153, 198],
"blood_type":["O", "A", "B", "O"]
})
user2
username | height | blood_type | |
---|---|---|---|
0 | hank | 185 | O |
1 | mike | 172 | A |
2 | pink | 153 | B |
3 | shark | 198 | O |
user2.to_sql(
name = "user2", # table name
con = engine,
if_exists = "replace", # 'fail', 'replace' 就是 overwrite, 'append'
index = False
)
4
pd.read_sql("select * from user2", engine)
username | height | blood_type | |
---|---|---|---|
0 | hank | 185 | O |
1 | mike | 172 | A |
2 | pink | 153 | B |
3 | shark | 198 | O |
new = pd.DataFrame({
"username": ["pinpin","mimi"],
"height": [160, 170],
"blood_type": ["B", "B"]
})
new.to_sql("user2", engine, if_exists="append", index = False)
2
pd.read_sql("select * from user2", engine)
username | height | blood_type | |
---|---|---|---|
0 | hank | 185 | O |
1 | mike | 172 | A |
2 | pink | 153 | B |
3 | shark | 198 | O |
4 | pinpin | 160 | B |
5 | mimi | 170 | B |
14.3.2. 用 sql 語句來做¶
這種作法對熟悉 sql 的人來說蠻直覺的,因為就完全使用 sql 的語句來做
sql_string = text('''
CREATE TABLE IF NOT EXISTS employee(
id INT NOT NULL AUTO_INCREMENT,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
title VARCHAR(100) DEFAULT NULL,
salary DOUBLE DEFAULT NULL,
hire_date DATE NOT NULL,
notes TEXT,
PRIMARY KEY (id)
);
''')
conn.execute(sql_string)
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x11adf54c0>
pd.read_sql("select * from employee", engine)
id | first_name | last_name | title | salary | hire_date | notes |
---|
sql_string = text('''
INSERT INTO employee (first_name, last_name, title, salary, hire_date) VALUES
('Robin', 'Jackman', 'Software Engineer', 5500, '2001-10-12'),
('Taylor', 'Edward', 'Software Architect', 7200, '2002-09-21'),
('Vivian', 'Dickens', 'Database Administrator', 6000, '2012-08-29'),
('Harry', 'Clifford', 'Database Administrator', 6800, '2015-12-10'),
('Eliza', 'Clifford', 'Software Engineer', 4750, '1998-10-19'),
('Nancy', 'Newman', 'Software Engineer', 5100, '2007-01-23'),
('Melinda', 'Clifford', 'Project Manager', 8500, '2013-10-29'),
('Jack', 'Chan', 'Test Engineer', 6500, '2018-09-07'),
('Harley', 'Gilbert', 'Software Architect', 8000, '2000-07-17');
''')
conn.execute(sql_string)
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x107a25fa0>
pd.read_sql("select * from employee", engine)
id | first_name | last_name | title | salary | hire_date | notes | |
---|---|---|---|---|---|---|---|
0 | 1 | Robin | Jackman | Software Engineer | 5500.0 | 2001-10-12 | None |
1 | 2 | Taylor | Edward | Software Architect | 7200.0 | 2002-09-21 | None |
2 | 3 | Vivian | Dickens | Database Administrator | 6000.0 | 2012-08-29 | None |
3 | 4 | Harry | Clifford | Database Administrator | 6800.0 | 2015-12-10 | None |
4 | 5 | Eliza | Clifford | Software Engineer | 4750.0 | 1998-10-19 | None |
5 | 6 | Nancy | Newman | Software Engineer | 5100.0 | 2007-01-23 | None |
6 | 7 | Melinda | Clifford | Project Manager | 8500.0 | 2013-10-29 | None |
7 | 8 | Jack | Chan | Test Engineer | 6500.0 | 2018-09-07 | None |
8 | 9 | Harley | Gilbert | Software Architect | 8000.0 | 2000-07-17 | None |
這邊補充一些 sql 語句
# 寫明 欄位名稱 和 值,如果該表有其他欄位沒被提到,就用 NULL 或 default 帶入
sql_string = text('''
INSERT INTO person(name, phone, age) VALUES ("ABC", "1234", 20);
''')
sql_string = text('''
INSERT INTO person(name, phone, age) VALUES
("ABC", "1234", 20),
("DEF", "5678", 21);
''')
# 沒寫明欄位名稱,那就是默認要照該表所有欄位順序來做
sql_string = text('''
INSERT INTO person VALUES ("ABC", "1234", 20);
''')
sql_string = text('''
INSERT INTO person VALUES
("ABC", "1234", 20),
("DEF", "5678", 21);
''')
14.4. [查] Read¶
14.4.1. 用 pd.read_sql()
查整張表¶
pd.read_sql("employee", engine)
id | first_name | last_name | title | salary | hire_date | notes | |
---|---|---|---|---|---|---|---|
0 | 1 | Robin | Jackman | Software Engineer | 5500.0 | 2001-10-12 | None |
1 | 2 | Taylor | Edward | Software Architect | 7200.0 | 2002-09-21 | None |
2 | 3 | Vivian | Dickens | Database Administrator | 6000.0 | 2012-08-29 | None |
3 | 4 | Harry | Clifford | Database Administrator | 6800.0 | 2015-12-10 | None |
4 | 5 | Eliza | Clifford | Software Engineer | 4750.0 | 1998-10-19 | None |
5 | 6 | Nancy | Newman | Software Engineer | 5100.0 | 2007-01-23 | None |
6 | 7 | Melinda | Clifford | Project Manager | 8500.0 | 2013-10-29 | None |
7 | 8 | Jack | Chan | Test Engineer | 6500.0 | 2018-09-07 | None |
8 | 9 | Harley | Gilbert | Software Architect | 8000.0 | 2000-07-17 | None |
14.4.2. 用 sql 語句 (推薦)¶
一樣,繼續用
pd.read_sql()
就好,你丟 sql 字串進去就可以了
sql_string = text('''
SELECT *
FROM employee
WHERE salary > 5000 and last_name = "Clifford"
''')
pd.read_sql(sql_string, engine)
id | first_name | last_name | title | salary | hire_date | notes | |
---|---|---|---|---|---|---|---|
0 | 4 | Harry | Clifford | Database Administrator | 6800.0 | 2015-12-10 | None |
1 | 7 | Melinda | Clifford | Project Manager | 8500.0 | 2013-10-29 | None |
既然可以下 sql 指令,那各種 order by, filter, group_by, aggregate 的整令,當然就都可以下了
14.4.2.1. order by¶
sql_string = text('''
select * from employee
order by salary;
''')
pd.read_sql(sql_string, engine)
id | first_name | last_name | title | salary | hire_date | notes | |
---|---|---|---|---|---|---|---|
0 | 5 | Eliza | Clifford | Software Engineer | 4750.0 | 1998-10-19 | None |
1 | 6 | Nancy | Newman | Software Engineer | 5100.0 | 2007-01-23 | None |
2 | 1 | Robin | Jackman | Software Engineer | 5500.0 | 2001-10-12 | None |
3 | 3 | Vivian | Dickens | Database Administrator | 6000.0 | 2012-08-29 | None |
4 | 8 | Jack | Chan | Test Engineer | 6500.0 | 2018-09-07 | None |
5 | 4 | Harry | Clifford | Database Administrator | 6800.0 | 2015-12-10 | None |
6 | 2 | Taylor | Edward | Software Architect | 7200.0 | 2002-09-21 | None |
7 | 9 | Harley | Gilbert | Software Architect | 8000.0 | 2000-07-17 | None |
8 | 7 | Melinda | Clifford | Project Manager | 8500.0 | 2013-10-29 | None |
sql_string = text('''
select * from employee
order by salary desc;
''')
pd.read_sql(sql_string, engine)
id | first_name | last_name | title | salary | hire_date | notes | |
---|---|---|---|---|---|---|---|
0 | 7 | Melinda | Clifford | Project Manager | 8500.0 | 2013-10-29 | None |
1 | 9 | Harley | Gilbert | Software Architect | 8000.0 | 2000-07-17 | None |
2 | 2 | Taylor | Edward | Software Architect | 7200.0 | 2002-09-21 | None |
3 | 4 | Harry | Clifford | Database Administrator | 6800.0 | 2015-12-10 | None |
4 | 8 | Jack | Chan | Test Engineer | 6500.0 | 2018-09-07 | None |
5 | 3 | Vivian | Dickens | Database Administrator | 6000.0 | 2012-08-29 | None |
6 | 1 | Robin | Jackman | Software Engineer | 5500.0 | 2001-10-12 | None |
7 | 6 | Nancy | Newman | Software Engineer | 5100.0 | 2007-01-23 | None |
8 | 5 | Eliza | Clifford | Software Engineer | 4750.0 | 1998-10-19 | None |
sql_string = text('''
select * from employee
order by salary desc, first_name;
''')
pd.read_sql(sql_string, engine)
id | first_name | last_name | title | salary | hire_date | notes | |
---|---|---|---|---|---|---|---|
0 | 7 | Melinda | Clifford | Project Manager | 8500.0 | 2013-10-29 | None |
1 | 9 | Harley | Gilbert | Software Architect | 8000.0 | 2000-07-17 | None |
2 | 2 | Taylor | Edward | Software Architect | 7200.0 | 2002-09-21 | None |
3 | 4 | Harry | Clifford | Database Administrator | 6800.0 | 2015-12-10 | None |
4 | 8 | Jack | Chan | Test Engineer | 6500.0 | 2018-09-07 | None |
5 | 3 | Vivian | Dickens | Database Administrator | 6000.0 | 2012-08-29 | None |
6 | 1 | Robin | Jackman | Software Engineer | 5500.0 | 2001-10-12 | None |
7 | 6 | Nancy | Newman | Software Engineer | 5100.0 | 2007-01-23 | None |
8 | 5 | Eliza | Clifford | Software Engineer | 4750.0 | 1998-10-19 | None |
14.4.2.2. limit¶
pd.read_sql(text("select * from employee limit 3;"), engine)
id | first_name | last_name | title | salary | hire_date | notes | |
---|---|---|---|---|---|---|---|
0 | 1 | Robin | Jackman | Software Engineer | 5500.0 | 2001-10-12 | None |
1 | 2 | Taylor | Edward | Software Architect | 7200.0 | 2002-09-21 | None |
2 | 3 | Vivian | Dickens | Database Administrator | 6000.0 | 2012-08-29 | None |
14.4.2.3. like¶
就是 R 的
str_detect()
# 常用語法
sql_string1 = text("select * from employee where last_name like 'C%';") # C開頭的,%表隨意字元不限字數
sql_string2 = text("select * from employee where last_name like '%i%';") # 中間出現過 i 的
sql_string3 = text("select * from employee where last_name like '%an';") # an 結尾的
sql_string4 = text("select * from employee where last_name like '__an';") # 知道長度是4,結尾是 an
sql_string5 = text("select * from employee where last_name like 'J______';")
pd.read_sql(sql_string1, engine)
id | first_name | last_name | title | salary | hire_date | notes | |
---|---|---|---|---|---|---|---|
0 | 4 | Harry | Clifford | Database Administrator | 6800.0 | 2015-12-10 | None |
1 | 5 | Eliza | Clifford | Software Engineer | 4750.0 | 1998-10-19 | None |
2 | 7 | Melinda | Clifford | Project Manager | 8500.0 | 2013-10-29 | None |
3 | 8 | Jack | Chan | Test Engineer | 6500.0 | 2018-09-07 | None |
14.4.2.4. count(*)¶
# 整張 table 有幾列
sql_string = text('''
select count(*) from employee;
''')
pd.read_sql(sql_string, engine)
count(*) | |
---|---|
0 | 9 |
# title 為 Software Engineer 的table,列數有幾列
sql_string = text('''
select count(*) from employee where title="Software Engineer";
''')
pd.read_sql(sql_string, engine)
count(*) | |
---|---|
0 | 3 |
# 各種 title 各有幾列
sql_string = text('''
select title, count(*) from employee
group by title;
''')
pd.read_sql(sql_string, engine)
title | count(*) | |
---|---|---|
0 | Database Administrator | 2 |
1 | Project Manager | 1 |
2 | Software Architect | 2 |
3 | Software Engineer | 3 |
4 | Test Engineer | 1 |
14.4.2.5. distinct¶
# 有哪些 title?
sql_string = text('''
select distinct(title) from employee;
''')
pd.read_sql(sql_string, engine)
title | |
---|---|
0 | Software Engineer |
1 | Software Architect |
2 | Database Administrator |
3 | Project Manager |
4 | Test Engineer |
# 有多少種 title
sql_string = text('''
select count(distinct(title)) from employee;
''')
pd.read_sql(sql_string, engine)
count(distinct(title)) | |
---|---|
0 | 5 |
14.4.2.6. min, max, sum, avg¶
# 平均薪水, 最低薪水, 最高薪水, 薪水總合 各是多少
sql_string = text('''
select min(salary), max(salary), sum(salary), avg(salary) from employee;
''')
pd.read_sql(sql_string, engine)
min(salary) | max(salary) | sum(salary) | avg(salary) | |
---|---|---|---|---|
0 | 4750.0 | 8500.0 | 58350.0 | 6483.333333 |
# 換個名稱比較好讀
sql_string = text('''
select min(salary) as min_salary,
max(salary) as max_salary,
sum(salary) as sum_salary,
avg(salary) as avg_salary
from employee;
''')
pd.read_sql(sql_string, engine)
min_salary | max_salary | sum_salary | avg_salary | |
---|---|---|---|---|
0 | 4750.0 | 8500.0 | 58350.0 | 6483.333333 |
# 各種 title 的薪水比較
sql_string = text('''
select title,
count(*) as n,
min(salary) as min_salary,
max(salary) as max_salary,
sum(salary) as sum_salary,
avg(salary) as avg_salary
from employee
group by title
order by avg_salary desc;
''')
pd.read_sql(sql_string, engine)
title | n | min_salary | max_salary | sum_salary | avg_salary | |
---|---|---|---|---|---|---|
0 | Project Manager | 1 | 8500.0 | 8500.0 | 8500.0 | 8500.000000 |
1 | Software Architect | 2 | 7200.0 | 8000.0 | 15200.0 | 7600.000000 |
2 | Test Engineer | 1 | 6500.0 | 6500.0 | 6500.0 | 6500.000000 |
3 | Database Administrator | 2 | 6000.0 | 6800.0 | 12800.0 | 6400.000000 |
4 | Software Engineer | 3 | 4750.0 | 5500.0 | 15350.0 | 5116.666667 |
14.4.2.7. having¶
這邊只有一個觀念:想在 group_by+summarise 後,想對結果的資料表做篩選,那要用 having,不能用 where
所以,having就是 group by 後,拿來做篩選用的 where。
結論:
要這樣寫:
select title, count(*) as n from employee group by title having n>1;
不能這樣寫:
select title, count(*) as n from employee group by title where n>1;
sql_string = text('''
select title, count(*) as n from employee group by title having n>1;
''')
pd.read_sql(sql_string, engine)
title | n | |
---|---|---|
0 | Database Administrator | 2 |
1 | Software Architect | 2 |
2 | Software Engineer | 3 |
14.4.2.8. logic operators¶
14.5. [改] Update¶
因為 pandas 只有做 讀 和 寫 的 wrapper,分別是
pd.DataFrame.read_sql()
和pd.DataFrame.to_sql()
對於 update 和 delete,之前在 R 裡面是用
dbSendUpdate(sql)
,就是直接執行 sql 指令在 python,就要先建立連線
conn = engine.connect()
,建立後,就可以 execute 原生的 sql 語句conn.execute(sql_string)
sql_string = text('''
update user2
set username = 'hank2'
where height = 185
''')
conn.execute(sql_string)
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x11ae9bd90>
pd.read_sql("select * from user2", engine)
username | height | blood_type | |
---|---|---|---|
0 | hank2 | 185 | O |
1 | mike | 172 | A |
2 | pink | 153 | B |
3 | shark | 198 | O |
4 | pinpin | 160 | B |
5 | mimi | 170 | B |
14.6. [刪] Delete¶
刪除 table 中的所有列 (但表和 schema 還在)
sql_string = text('''
delete from user2
''')
# 刪除所有列
# delete from user2
# 刪除某些列
# delete from user2 where xxx
conn.execute(sql_string)
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x11ae70550>
pd.read_sql("select * from user2", engine)
username | height | blood_type |
---|
刪除整張表
sql_string = text('''
drop table user2
''')
conn.execute(sql_string)
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x11af145e0>
# pd.read_sql("select * from user2", engine)