Saturday, January 19, 2008

I promised to explain that little bit of python black-magic I posted a couple of days ago. It may take a couple of posts ...

But to start with, here's some code I just found, that I wrote about 18 months ago, to create a "pipeline" to do various transformations to lines of text. Basically I want to transform wiki markup into html, but I wanted a flexible mechanism which allowed me to add or remove various transformations at runtime so I can rejig the same transformation for slightly different targets. A good example of this occurs in the original SdiDesk, where the production of HTML to be viewed in SdiDesk is *almost* the same as the production of HTML for export to a static site, except for the link tags, who's href property for internal use needs to be "About:Blank" whereas the external site needs the actual URL of the page.

Here's my original code (with a couple of example transformations) ...





class PipeNode :
"""Base class for pipeline nodes.
They must implement
process() and getName()"""

def process(self, item) :
return item

def getName(self) : return "PipeNode"

class Pipeline :

def __init__(self) :
self.pipe = []
self.index = 0
self.item = None

def addNode(self, n) :
self.pipe.append(n)

def load(self, item) :
self.index = 0
self.item = item

def next(self) :
self.item = self.pipe[self.index].process(self.item)
self.index = self.index + 1

def getCurrentItem(self) :
return self.item

def getCurrentIndex(self) :
return self.index

def getCurrentNode(self) :
return self.pipe[self.index]

def eop(self) :
"""End of pipe"""
return self.index == len(self.pipe)

def run(self) :
while (not self.eop() ) :
self.next()

class BlankLines (PipeNode) :

def getName(self) : return "blankline-to-para"

def process(self, item) :
if item == '' :
return "<p/>"
else :
return item

class Bullet (PipeNode) :

def process(self, item) :
if item[0] == '*' :
return '<li>' + item[1:] + '</li>'
else :
return item

class LinewiseStringPipeline (Pipeline) :

"""
Based on the generic pipeline, this cuts a long string into lines
(separated by '\n') and processes them one by one
"""

def __init__(self, sep='\n') :
Pipeline.__init__(self)
self.sep = sep

def runAll(self, s) :
if len(self.pipe) == 0 :
return s
build = ''
lines = s.split(self.sep)
for x in lines :
self.load(x)
self.run()
build = build + self.getCurrentItem() + self.sep
build = build[0:-1]
return build

# test it
pl = LinewiseStringPipeline(r'\n')
pl.addNode( BlankLines() )
pl.addNode( Bullet() )

page = """here

is

* some
* data"""

print pl.runAll(page)




You'll see it's all good object oriented stuff. I define a base-class generic "PipeNode" which doesn't really do anything, but acts as a definition of the interface of a section of the pipe. Its important method is process, which takes an input and delivers some kind of output.

Then there's the pipe object itself. A straight-forward list of nodes which knows how to run ie. pull the items through each of the processing nodes. It keeps track of which section of the pipe the item is in using index.

Then there's the LinewiseStringPipeline which knows how to cut a large marked up page into single lines (wiki markup is applied on a line-by-line basis), run them separately through the pipe, and then concatenates them back together at the end.

Finally there are a couple of example PipeNode filters ... one which turns blank lines into blank paragraphs, and * into list-items.

I hope it's comprehensible. It's a little over-engineered with extra methods that are not strictly necessary for the operation. (Such as admin functions I guessed would be useful.)

And it is extremely verbose.

Now compare the same thing in my new, functional style which dispenses with objects and classes but seriously uses generators and closures.



def makeSection(f) :
def gen(source) :
for x in source :
yield f(x)
return gen


def makeFilter(test) :
def gen(source) :
for x in source :
if test(x) : yield x
return gen


def pipeline(generators) :
def p(source) :
old = source
for g in generators :
new = g(old)
old = new
return old
return p


def blankLines(item) :
if item == '' :
return "<p/>"
else :
return item


def bullet(item) :
if item[0] == '*' :
return '<li>' + item[1:].strip() + '</li>'
else :
return item


def linewisePipe(page, pipe) :
return '\n'.join(pipe(page.split('\n')))


# test it
pipe = pipeline([ makeSection(blankLines) ,
makeSection(bullet) ])
page = """here

is

* some
* data"""

print linewisePipe(page,pipe)




So, how does it work? Instead of handling each stage of the pipe with an object with only one significant method : process, it now handles each with a single function. Except that function is, in fact, a generator (ie. it's a function who's execution frame doesn't disappear when it yields it's return value, but continues in memory, and gets resumes where it left off).

The reason for this, is that these generators take a in iterator for their argument. And only pull and process the next item from the iterator before yielding it. The use of a generator is important because once it's up and running it's the generator itself which is remembering which item is currently being processed.

Of course, writing generators is a mildly more complicated than writing functions, so I've written some "make" functions to turn ordinary functions into them. makeSection(f) takes an ordinary x -> y function and returns the generator as a closure; note how the actual processing function is bound to f at the call of makeSection.

The result of calling makeSection, then, is a new generator which a) sucks something from an iterator (called "source"), b) calls f on it, c) yields it (ie. returns it, but hangs around with the next item from the iterator in "x", which it will yield next time it's called)

To simply use one of these sections on its own you could do something like this :




s = makeSection(lambda x: x * 2)
for x in s( [1,2,3,4,5] ) :
print x




Closure s takes an iterator as an argument (in this case, the list [1,2,3,4,5] ). It's result is also an iterator. (Which is what Generators look like from outside)

In fact the "for" loop keyword is implicitly calling next() on it every time to get a new value for x. When it does so, it pulls the next value out of s. Now, inside s, we have the body of "gen" as defined in makeSection. This is applying the function f (which doubles its argument x) on each value that comes through.

So sctions are always generator functions that are started with an iterator as input and at every step of the iteration return the next item from their source iterator transformed. The transformation itself depends on the argument given to the call to makeSection.

Filters are the same, except in this case, they only pass on items that meet the "test" criteria.

The next part of the program is to chain a number of these generators together into a pipeline. In fact, "pipeline" is another function which returns a closure. It takes a list of sections and filters and it's this list which is stored in the closure. (Be careful here, the reference to the list not the actual list content - so there's a danger of changing them after the execution of pipeline)

What comes out of pipeline is a new generator. One which takes an initial iterator as source, and on each next, sucks the next item from that source all the way through all the stages of the pipeline and spits it out.

So what are the advantages of this over my original object oriented version? It's certainly far "cleverer". To me, today, it looks more elegant. And it's much easier to start using. My specific transformation functions don't need to know anything about pipelines - they don't need to be defined as methods on new classes that are explicitly sub-classed from PipeNode. Instead each function can be written in ignorance of it's role, and then turned into a section or filter when needed. (Often just before assembling the pipe.)

Pipes are easy to assemble using the pipeline function, and I believe are very intuitive to use.

Of course, the object version could be tweaked to look more like the functional version to the outside user. The Pipeline object could be an iterator, making it possible to use it in a for loop. We could perhaps make it take a list of sections in it's constructor rather than force the programmer to write multiple explicit addNodes.

Nevertheless, I'm smitten with the new style. I'm replacing the old code with the new today.

No comments: