ablog

不器用で落着きのない技術者のメモ

Python ではてなフォトライフの RSS をスクレイピングして画像をダウンロードする

Python

はてなフォトライフの RSS をスクレイピングして、画像の URI を取得してダウンロードするする Python スクリプト。

RSS(yohei-a's fotolife)

...

<item rdf:about="http://f.hatena.ne.jp/yohei-a/20161211091424">

...

<dc:date>2016-12-11T09:14:24+09:00</dc:date>
<hatena:imageurl>
http://cdn-ak.f.st-hatena.com/images/fotolife/y/yohei-a/20161211/20161211091424.png ★この URI を取出す
</hatena:imageurl>
<hatena:imageurlsmall>
http://cdn-ak.f.st-hatena.com/images/fotolife/y/yohei-a/20161211/20161211091424_m.jpg
</hatena:imageurlsmall>
<hatena:imageurlmedium>
http://cdn-ak.f.st-hatena.com/images/fotolife/y/yohei-a/20161211/20161211091424_120.jpg
</hatena:imageurlmedium>
<hatena:syntax>f:id:yohei-a:20161211091424p:image</hatena:syntax>
<hatena:colors>
<hatena:color>white</hatena:color>
<hatena:color>blue</hatena:color>
</hatena:colors>
</item>
<item rdf:about="http://f.hatena.ne.jp/yohei-a/20161211091423">

...

Python スクリプト(dl_fotolife.py)

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://f.hatena.ne.jp/yohei-a/rss')
soup = BeautifulSoup(html, "html.parser")

for item in soup.find_all("hatena:imageurl"): # <hatena:imageurl> を取出して一件ずつループ
	img_uri = item.contents[0] # item から <hatena:imageurl> タグの中身の URI を取出す
	img_filename = os.path.basename(img_uri)
	r = urllib2.urlopen(img_uri)
	f = open(img_filename, "wb")
	f.write(r.read())
	r.close()
	f.close()

実行する

$ python dl_fotolife.py

実行結果

$ ls -lt|head
total 7264
-rw-r--r--+ 1 yazekats None 226274 Dec 31 13:12 20160515181139.png
-rw-r--r--+ 1 yazekats None  26866 Dec 31 13:12 20160529032112.jpg
-rw-r--r--+ 1 yazekats None  18160 Dec 31 13:12 20160529033413.jpg
-rw-r--r--+ 1 yazekats None  21304 Dec 31 13:12 20160529033826.jpg
-rw-r--r--+ 1 yazekats None  22360 Dec 31 13:12 20160530093602.jpg
-rw-r--r--+ 1 yazekats None  19574 Dec 31 13:12 20160530093603.jpg
-rw-r--r--+ 1 yazekats None  53302 Dec 31 13:12 20160530094316.jpg
-rw-r--r--+ 1 yazekats None  65751 Dec 31 13:12 20160530184902.jpg
-rw-r--r--+ 1 yazekats None  28564 Dec 31 13:12 20160530184903.jpg

参考

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:
for child in title_tag.children:
    print(child)
# The Dormouse's story
Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation