Proof of Concept: A single timeline to scrub through the entire contents of the Constant online video archive.

Constant’s video archive has a humble but very functional (primary) interface, namely it’s just a set of folders made web-available via the Apache server’s default “directory listing” functionality.

video.constant.listingconstantvideotimeline/

Step One: A basic spider to walk through the links collecting basic HTTP information (like “Content-type” (the MIME or file type) and “Content-Length” (the file size)).

httpspider.py

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import sys, urllib2, html5lib, urlparse, csv
 
out = csv.writer(sys.stdout)
out.writerow(('url','content-type','content-length'))
 
def spider (url):
    visited[url] = True      # "remember" where we've been
    f = urllib2.urlopen(url) # open up a connection to the URL
    if url != f.url:         # when redirected, use the new url
        url = f.url
        visited[url] = True
 
    # use html5lib to parse the HTML into a tree structure
    t = html5lib.parse(f, treebuilder="etree", namespaceHTMLElements=False)
    # loop over all the link (anchor) elements in the page
    for a in t.iter("a"):
        href = a.attrib.get("href")
        if href and not (href.startswith("?") or href == '/'):
            url2 = urlparse.urljoin(url, href)
            f2 = urllib2.urlopen(url2)
            ctype = f2.headers.get('content-type')
            clen = f2.headers.get('content-length')
            f2.close()
 
            if ctype.startswith("video/"):                                # report videos
                out.writerow((url2, ctype, clen))
            elif ctype.startswith("text/html") and url2 not in visited: # recursively process HTML
                spider(url2)
 
visited = {}
url = sys.argv[1]
spider(url)

Run with the command:

python httpspider.py http://video.constantvzw.org

This script uses python’s CSV module to dump results as spreadsheet with 3 columns, URL, content-type, and content-length:

video.constant.spreadsheet01

This result gets piped into the next part of the pipeline, the ffmpeg “sniffer”, which feeds each row’s URL to ffmpeg and scrapes the results to pull out three things: the video’s duration, it’s frame size, and any “metadata” added by the encoding software (often the ffmpeg2theora program in this case).

ffmpegsniffer.py

import sys, subprocess, re, csv
from pprint import pprint
 
def extract_metadata (text):
    ret = {}
    for line in text.splitlines():
        if ':' in line:
            (name, value) = line.split(':', 1)
            if not name.endswith("http") and (name.upper() == name):
                ret[name.strip().lower().decode('utf-8')] = value.strip().decode('utf-8')
    return ret
 
timecodepat = re.compile(r"Duration: (\d+):(\d+):(\d+)\.(\d+)")
def extract_duration (text):
    m = timecodepat.search(text)
    if m:
        parts = m.groups()
        return (int(parts[0])*3600) + (int(parts[1])*60) + int(parts[2]) + float("0."+parts[-1])
 
reader = csv.reader(sys.stdin)
headers = reader.next()
headers.extend(('duration','metadata'))
 
writer = csv.writer(sys.stdout)
writer.writerow(headers)
 
for row in reader:
    url = row[0].strip()
    if url:
        # nb ffmpeg output is stderr (not stdout)
        popen = subprocess.Popen(["ffmpeg", "-i", url], stderr=subprocess.PIPE)
        o = popen.communicate()[1]
        duration = extract_duration(o)
        metadata = extract_metadata(o)
        row.append(duration)
 
        # Just glom the metadata together as a string
        md = []
        for k in sorted(metadata):
            md.append(u"{0}: {1}".format(k, metadata[k]))
        row.append(u"\n".join(md).encode("utf-8"))
 
        writer.writerow(row)

So running the command:

python httpsniffer.py http://constantvzw.org/ | python ffmpegsniffer.py

Results in the original spreadsheet of data, with some metadata added from ffmpeg:
video.constant.spreadsheet02

Last but not least, the resulting spreadsheet is loaded using d3 in javascript, and a virtual timeline is constructed to “scrub through” the full 61 hours of the collection, showing the metadata (as well as providing a link to actually load the video — which may or may not be playable depending on the format/browser you’re using to view it). The full sketch interface is viewable here.

video.constant.timeline