This week I found myself thinking about media “streams” (versus “files”) and a possible parallel to pipelines and their possible “promiscuity”. A stream is essentially a stream by virtue of functioning without “access the whole” of the content; for instance that audio can start playing without the entire track being loaded. This has several some implications, not least of which is enabling the transmission of live events (where no “end” necessarily exists yet).

In some coding work, I was adjusting a script that spidered an HTML site (you give it a starting URL and it checks the webpage for links, and then follows those links to other pages and so on). The initial script output CSV from the resulting links; when I ran it, the rows of output would appear to me “in realtime” as the links were being followed so I could “follow along” from the commandline. In one instance I noticed that in some cases my script incorrectly seemed to “crawl up the directory hierarchy” in a way I wasn’t expecting and I noticed it was maybe even stuck in a loop, revisiting the same links. A simple Ctrl-C allowed me to stop the process.

The script (part of the Spider+Sniff+Scrape Timeline proof of concept code) was in fact designed to feed into another script, ffmpeg as a “sniffer”, which then reads each row, taking the URL and running another program on it to collect and then add in additional information — the duration and size of a video in this case.

In any case, I was busy changing the script to output RDF instead of CSV rows. In my first implementation however, I made a change that made the output no longer occured while the process was running — in other words it no longer streamed. (Using rdflib I first created an empty graph object, then as the spider would run, triples were silently added to the graph, and then finally when the spider was finished I would call the graph.serialize() function to actually produce the output). The trouble with this is that if the script started crawling paths I hadn’t expected, I wouldn’t see the error immediately, or possibly would never see it if the script got stuck in a loop. This could be fixed by adding some sort of “trace” output (say printing messages to stderr), but still the shift is significant in that the script’s final output is either everything or nothing depending on whether it can finish, and only delivers it’s result at the end of the time this takes (so no other process could begin in the mean time).

“In the end” the change doesn’t exactly matter (given enough time and assuming the script will finish, the result will be the same); but in fact the choice of implementation has potentially quite some impact on how the script in practice can be used, and how tolerent (if an all) it could be to error. In the “streaming” case, even a faulty script could be used to produce partial output that could potentially still be salvaged and used to continue some pipeline of work, whereas the all or nothing-ness of the non-streaming approach would thus impose a similar strictness to any adjoining pipeline elements making any error lead to complete failure. When lines are properly streamed in my initial pipeline, the “spider” and “sniff” scripts can in principle run in parallel with early spider results immediately passed on to the sniffer for checking while the spider continues its operation.

(In the end a change in implementation could be to directly output the RDF as triples (in so-called “n-triples” form). In fact in rdflib (and other rdf libraries) a “stream of statements” is sometimes described when working with graph structures.)

… to be continued …