Python爬虫之豆瓣电影TOP250

学习爬虫的最初始原因是为了抓取一些想要的信息,也是为了后续学习机器学习相关内容积累一些数据处理方面的知识;废话不多说,上代码。

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# __author__ = 'c1rew'
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup
import re
import time
import pymysql.cursors

header = {'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'}

urls = ['https://movie.douban.com/top250?start={}&filter='.format(str(i)) for i in range(0, 250, 25)]

movies = []

def parse_one_movie(cursor, infos):
url = infos.find("a", href=re.compile(r"https://movie\.douban\.com/subject/[0-9]+/"))['href']
titles = infos.findAll("span", class_="title")
title = ""
if isinstance(titles, list):
for item in titles:
title = title + item.get_text()
else:
title = titles.get_text()
other_title = infos.find("span", class_="other").string[3:] # delete / and space
rating_num = infos.find("span", class_="rating_num").string
movie_info = infos.find("div", class_="bd").p.get_text().strip()
movie_infos = movie_info.split('\n')
director_actor = movie_infos[0].strip()
other_infos = movie_infos[1].strip().split('/')
if len(other_infos) == 3:
year = other_infos[0].strip()
region = other_infos[1].strip()
movie_type = other_infos[2].strip()
else: # for 82 大闹天宫
year = other_infos[0].strip() +"," + other_infos[1].strip() + "," + other_infos[2].strip() + "," + other_infos[3].strip()
region = other_infos[4].strip()
movie_type = other_infos[5].strip()
introduce = infos.find("span", class_="inq")
# 有电影的简介是空的,这里容错下
if introduce != None:
introduce = introduce.string
else:
introduce = ""

movie = [title,\
other_title,\
url,\
rating_num,\
director_actor,\
year,\
region,\
movie_type,\
introduce]
movies.append(tuple(movie))


def get_url_info(cursor, url):
url_info = requests.get(url, header)
content = url_info.content.decode('utf-8')
soup = BeautifulSoup(content, "html.parser")
items = soup.findAll("div", class_="item")

for item in items:
parse_one_movie(cursor, item)
time.sleep(1) # 一次网页请求间隔1m,防止被禁

sql_create = "CREATE TABLE IF NOT EXISTS douban_top_movie " \
"( ID INT NOT NULL AUTO_INCREMENT, " \
" PRIMARY KEY(ID), " \
" title VARCHAR(128), " \
" other_title VARCHAR(128), " \
" url VARCHAR(512), " \
" rating_num VARCHAR(64), " \
" director_actor VARCHAR(1024), " \
" year VARCHAR(64), " \
" region VARCHAR(128), " \
" movie_type VARCHAR(128), " \
" introduce VARCHAR(1024)" \
");"

# 连接本地mysql数据库,mysql.server需要保证已启动
connection = pymysql.connect(host='localhost',
user='root',
password='',
db='douban_infos_db',
charset='utf8mb4')

try:
with connection.cursor() as cursor:
cursor.execute(sql_create)

#从list中获取一个网页,一次解析25个电影
for url in urls:
print(url)
get_url_info(cursor, url)

sqlcmd = '''insert into douban_top_movie (
title,
other_title,
url,
rating_num,
director_actor,
year,
region,
movie_type,
introduce
) values(%s, %s, %s, %s, %s, %s, %s, %s, %s)'''

cursor.executemany(sqlcmd, movies)
connection.commit()
finally:
connection.close()

代码运行环境:

Python 3.5.2 |Anaconda 4.2.0 (x86_64)

Mac mini

macOS Sierra

Version 10.12.2

如果本地运行环境不一致,可能会出现一些错误,到时候再对应调整一下,如果不用MySQL存储数据,也可以使用Redis或者Excel表格也行。

MySQL相关

命令行中使用sql语句导出时,MySQL报错了:

1
The MySQL server is running with the --secure-file-priv option so it cannot execute this statemen

google了一番,发现是MySQL的配置问题,需要修改/etc/my.cnf

如果这个目录下没有这个文件,就到MySQL的安装目录下去拷贝默认的一份配置,我的Mac是在下面的路径里:

/usr/local/Cellar/mysql/5.7.17/support-files/my-defalult.cnf

拷贝过去后,在my.cnf的最后增加一句 secure_file_priv=””

接下来再使用sql语句将之前已入库的数据导出csv表格,以便后续上传到BDP做简单的数据统计分析

1
2
3
4
5
SELECT * FROM douban_top_movie   
INTO OUTFILE '~/work/topmovie.csv'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';

数据分析

仅仅挑选了几个比较好展示的数据,没有对类型进行分词处理,不然是可以对类型做一个饼图,这样效果是会好一些;

img

img

img

Reference

写这个代码的过程中,参考了一些类似资料,也是在学习Python爬虫课的过程中记录下练习过程;

推荐下七月在线的课,打折的时候还是有很多优惠:七月在线 (PS: 无利益相关哈!!)

以下是一些参考链接:

python爬虫之豆瓣音乐top250

Python爬虫之Xpath学习

Python MySQL数据库操作