Scraping with CSS and (pure) Python

Code discovery of the day: cssselect2 by Simon Sapin, a pure python implementation of fancy CSS3 selectors. Here, I’m scraping some data based on the “unique selector” Firefox gives for a table element embedded in a page (with no unique id). The library is not exactly documented well, and pip installation seems (for me, temporarily?) broken. I used the package on github. See you later lxml!

from cssselect2 import ElementWrapper
from urllib2 import urlopen
import html5lib
from xml.etree import ElementTree as ET 
 
url = "http://libregraphicsmeeting.org/2014/program/index.html"
selector = "#table_day_2 > tbody:nth-child(2) > tr:nth-child(6)"
 
t = html5lib.parse(urlopen(url), namespaceHTMLElements = False)
doc = ElementWrapper.from_html_root(t)
 
# Plural form...
# for match in doc.query_all(selector):
 
# Singular
match = doc.query(selector)
 
b = {}
b['title'] = ET.tostring(match.query("p.schedule_title").etree_element)
b['presentor'] = ET.tostring(match.query("p.schedule_presenter").etree_element)
b['summary'] = ET.tostring(match.query("p.schedule_summary").etree_element)
b['biography'] = ET.tostring(match.query("p.schedule_biography").etree_element)
print b

← Previous post

Next post →

2 Comments

  1. Mmh. Interesting. Thx for the tip.
    Are you working on a poster for the LGM? 😉

  2. Aha, good eyes! Yes something for LGM, but not a poster … all will be revealed soon 😉

Comments are closed.