Wednesday 23 January 2013

About Iterators using blessed File Ties

This posting will try to limit itself to summarizing Iterators and their benefits, a brief detour into re-blessing tied files for greater profit, and how Iterators relate to lexical analysis. You are expected to understand Perl's approach to object oriented programming, and you'll probably want to learn about tied filehandles if you're keen to understand the internals.


The book Higher-Order Perl by Mark Jason Dominus devotes a lot of pages to the subject of Iterators, which are basically objects representing the process of stepping through a bigger object, be it an array with an integer step index, a file with lines, or any other kind of container.

In a nutshell, an iterator must contain a reference to the container being traversed, and it must remember the current position attained. Iterators can have a few more common features; an important one I've used* is the ability to 'peek' at the next item without advancing the position.

* this peek feature isn't actually used by my Marpa parser for 4GL...

Iterators can also have methods defined to return other interesting facts about the state. An important one is the current line number when reading a file source, and also the name of the file itself. These two features are useful for error reporting or tracing activity.

Iterators commonly have a 'start' (prepare) method, a 'next item' call, maybe a test for end of iteration, and maybe a 'finish' or wrap-up method. Typical use is
my $iter = Iter::Class->new($container);
while($iter->hasMore) {
    my $item = $iter->next;
    process $item;
}
$iter->close;
or the more usual
my $iter = Iter::Class->new($container);
while(my $item = $iter->next) {
    process $item;
}
$iter->close;
 Some of these calls, such as close, may not be needed depending on the type of the container, and the cleverness of the programmer.

The syntax and features of Perl make it possible to create an iterator that behaves like it's a file handle. You can write:
my $iter = Iter::Class->new($container);    # some things don't change...
while(not eof($iter)) {
    my $item = <$iter>;
    process $item;
}
close $iter;
I've only used the eof() call to prove the point that they behave like files, but the usual Perlish file reading loop can be used.

So how do you make something that behaves like a file? If you don't know, you need to read perldoc perltie. Go ahead, I'll wait. Make sure you understand tied file handles, because that's what I'll be using. I'll also be re-blessing THAT into another class for other purposes; I'll explain that because you probably haven't bought a copy of "Object Oriented Perl" By Damien Conway (and if not, why not? Get it!)

A useful piece of extra code is a factory method to create the tied filehandles, because the code is a bit on the messy side. More on that soon.

Now that you understand Tied file handles, (and of course you already understand Perl's method of 'blessing' references into classes) I can explain the point of re-blessing the reference into another class after tie-ing it.

A tied filehandle responds to all the usual Perl file operators and calls, such as <$handle>, readline($handle), eof($handle), seek($handle), close($handle) etc. The ability to respond to all of these depends on providing a suitable method in your class, as you've just read over in perldoc.

Once you have a $ref which is a tied file handle, what if you want to make some value-added method call on the reference, for example $ref->peek or $ref->lineno which are not in the standard set of file operators? It turns out you can re-bless the handle into another class, and then it will respond to the methods of that class too. Best wisdom is to bless it into a different class than the one you used in the tie operation.
package Iterator::File::Tie;
use fields qw(_handle _lineno);

sub TIEHANDLE {
    my ($class, $handle) = @_;

    my Iterator::File::Tie $impl = fields::new($class);
    $impl->{_handle} = $handle;
    $impl->{_lineno} = 0;

    return $impl;
}

sub READLINE { ... }
etc

package Iterator::File;
sub peek { ... }
sub lineno { ... }
sub new {
    my $class = shift;
    my $fhandle = shift;
    my $self = Symbol::gensym();
    tie *$self, 'Iterator::File::Tie', $fhandle;
    bless $self, $class;   # re-bless the reference
    return $self;
}

package main;
open my $fhandle, '<', "filename" or die "cannot open filename: $!\n";
my $iter = Iterator::File->new($fhandle);
while(<$iter>) {
    if($_ eq 'foo' && $iter->peek eq 'bar') {
        ...
    }
}
Notice how the tied file handle (the anonymous glob returned from gensym) is tied into one class Iterator::File::Tie, and the class Iterator::File defines not only the extra methods, but is also a very convenient place to put the factory method new().

The thing to get your head around in the new() routine is how the re-blessing works with the tying. First an anonymous glob is created by Symbol::gensym(), which is immediately tied to the Iterator::File::Tie class; the $fhandle is passed in so the TIEHANDLE can store it inside the little alien that takes over the soul of the glob. In this instance, the alien is a hash created by fields::new(). This method also performs the bless into the class in case you were looking for it (time to read perldoc fields and perldoc base).

At this point, the glob has been marked internally by Perl as being an el-fako tied filehandle, so whenever it is passed to an IO operator, the special methods in the ::Tie class will be applied.

HOWEVER - the glob reference itself is not actually a class object, it is still a simple mindless object without any class, kind of like Paris Hilton in that regard. It turns out, the reference can ALSO be blessed into a class along-side of the tie class. In this way, we not only get the proper response to IO operators, but we can call other methods too.

As an aside, you may be wondering why create an Iterator that wraps a filehandle - one which is more than capable of responding to IO operators all on its own. The reason is, you can add extra power to the Iterator but keep it in a neat package. The extra features can be things like full-line lookahead, full-line pushback, keeping track of the line number and so on. It's also nice for consistency when chaining layers of iterators.

Back to the story. The only remaining thing to know is exactly how to write the methods in the second module. Since the real goodies of the state of the iterator are buried inside the little alien that replaced the soul of the glob, you merely need to get access to the alien. It goes like this:
sub fname : method {
    my $self = $_[0];
    my $alien = tied(*$self);
    return $alien->{fname};
}
Actually, a more appropriate name for $alien is $impl - ie the "implementation object" of the tie. As you can see, the ancilliary methods are given the reference ($self) and then you get access to the embedded object using the tied() function. From then on you have full access to whatever simple structure (usually a hash) that handles the state of the iterator.

Note: this sample iterator over ordinary files is a real one, and it's code will be posted in the next blog entry, since it's usually the first iterator in my iterator chains.

Clear as mud? This has turned out to be a bit of a rushed tutorial on re-blessed tied filehandles and not so much about iterators. So it's time to correct that deficit. We still haven't tied it back to Lexical analysis, or put much of an explanation of the "chained iterators" I keep talking about.

Chaining Iterators

Now that we have a (very simple) iterator which doesn't do much more than a file handle apart from letting us write extra methods, what else can we do?

The next step could be a simple iterator which detabs the input stream and removes trailing blanks. Here's the READLINE function in such an iterator:
package Iterator::Detabber::Tie;
...
sub READLINE {
    my Iterator::Detabber::Tie $impl = $_[0];
    my $handle = $impl->{_handle};

    local $_;
    if($_ = <$handle>) {
        # this mess converts tabs to spaces
        1 while s/^([^\t]*)(\t)/$1 . (' ' x (length($2)*8 - length($1)%8))/e;
        s/ +$//;    # remove trailing blanks
    }

    return $_;
}  # READLINE
Another iterator might perform look-ahead on behalf of other iterators:
package Iterator::Peek::Tie;
...
my $reader = sub {      # put another token on the queue
    my Iterator::Peek::Tie $impl = $_[0];
    my $handle = $impl->{_handle};
    my $item;

    if(defined($item = <$handle>) {
        push @{$impl->{_queue}}, $item;
        return 1;
    }

    return undef;   # signal failure to fetch - source must be at eof
};  # reader

sub READLINE {
    my Iterator::Peek::Tie $impl = $_[0];
    # just need 1 item on the queue
    $reader->($impl) unless @{$impl->{_queue}};
    # take the item off the queue
    return shift @{$impl->{_queue}};
}   # READLINE
...
package Iterator::Peek;
...
sub peek {
    my Iterator::Peek::Tie $impl = $_[0];
    # just need 1 item on the queue
    $reader->($impl) unless @{$impl->{_queue}};
    return ${$impl->{_queue}}[0];
}
To put it all together:
my $fi  = Iterator::File->new("filename");
my $dti = Iterator::Detab->new($fi);
my $pki = Iterator::Peek->new($dti);

while(my $item = <$pki>) {
    if(/$item eq "foo" && $pki->peek eq "bar") {
        ...
    }
}
Each iterator is given another iterator as it's source "handle". The chains can be stacked up as deeply as your imagination can take you.

The book Higher-Order Perl shows all sorts of chains. One of the more interesting styles is progressive analysis that can end up with dozens of iterators chained together. In fact one of the support functions creates closures around iterators, each one handling a single item to be recognised by each iterator.

One of my own scripts goes like this:
my $iter = Iterator::Tokenize->new(-file => $ARGV[0]) or die;
my $peeker = Iterator::Peek->new($iter, 'comment', 'space');
my $parser = Iterator::FGL->new($peeker);
my $reform = Iterator::Reformat->new($parser);
my $assem = Iterator::Assemble->new($reform);

print while <$assem>;
This script uses a tokeniser Iterator, which is read by a Peek iterator, which is being read by the FGL reader (it needs the look-ahead provided by Peek), in turn read by a Reformatter which hacks the recorded line number and column number of each token, effectively performing a pretty-print on the input stream. Finally the Reformatter is being read by an Assemble Iterator which reads the stream of tokens and glues it back together into a simple line-oriented stream for printing.

Each iterator can pretend it's doing a simple read of a file, perform some work on it, and emit it's own stream of objects, be they strings or objects. Not only that, but each iterator can read its handle at several points around its loop. For example, if an iterator needs to handle continuation lines:
while(<$handle>) {
    # is there a trailing slash?
    while(s/\\$//) {
        last if eof($handle);   # safety first!
        $_ .= <$handle>;
    }
    ....
}
I hope this page has popped up a light-bulb over your head about iterators.

Next postings will cover the actual iterators used by my Marpa-based 4GL parser. If you're lucky, there will be less words to wade through.

Iterators for Lexical analysis

Finally, the tie-in to lexing. No pun intended. OR IS IT?

A critical layer in the Iterator chain is a tokenizer or Lexer. Recognizing distinct tokens in an input stream is necessary before a parser can process the language of interest.Wouldn't it be nice if you could have some sort of handle which can yield a token whenever you ask it? Enter Iterators, stage left.

Here's the important snippet of a simple lexing iterator:

sub READLINE {
    my Iterator::Tokenize::Tie $impl = $_[0];
    my $handle = $impl->{_handle};
    local $_;

    # make sure our queue has something on it
    while(! @{$impl->{_queue}}) {
        last unless defined($_ = <$handle>);
        chomp;

        # following are the two-character tokens of
        # interest. Anything else is single char or
        # a compound eg strings, numbers, idents
        while(/\G(<=|<>|==|>=|!=|.))/gc) {
            my $buf = $1;
            my $type;

            if($buf eq '"') {
                # double-quote string
                # DANGER! string may not be terminated
                # if there is a syntax error.
                # Deal with it.
                /\G((\\.|[^"])*)"?/gc;
                ($type, $buf) = ('string', $1);
            }
            elsif($buf eq "'") {
                # single-quote string
                # DANGER! string may not be terminated
                # if there is a syntax error.
                # Deal with it.
                /\G((\\.|[^'])*)'?/gc;
                ($type, $buf) = ('string', $1);
            }
            elsif(substr($buf, 0, 1) =~ /\d/) {
                # number
                /\G(\d*)/gc;
                # don't forget to include the original
                # character captured above...
                ($type, $buf) = ('number', "$buf$1");
            }
            elsif(substr($buf, 0, 1) =~ /[a-z_]/i) {
                # identifier
                /\G(\w*)/gc;
                ($type, $buf) = ('ident', "$buf$1");
            }
            elsif($buf eq ' ') {
                /\G(\s*)/gc;
              ($type, $buf) = 'space', "$buf$1");
            }
            else {
                $type = 'symbol';
            }

            push @{$impl->{_queue}},
                 Token::Lex->new($type, $buf);
            # The full token type has more attributes
        }
    }

    # happily, if the queue is empty due to eof,
    # shift returns undef, and so do we.
    return shift @{$impl->{_queue}};
}
A real tokenizer will be a bit more complicated (hell yeah! for the 4GL version) and of course this thing should probably be table-driven so it can support different languages, but you get the picture. The beauty is, complex details such as handling comments, multi-line comments, etc can be isolated from more mundane activities such as reading lines, detabbing, keeping track of line numbers, etc.

My Marpa parser for 4GL uses the following layers
  • File reader
  • detabber and blank-trimmer
  • tokeniser
  • token munger which looks for keywords and special symbols
  • alternative-builder which determines if a token is ambiguous
The last two layers won't mean much right now, but they are key for dealing with the weird and wonderful syntax that governs Informix 4GL, and interact with a powerful feature of Marpa.

Finally I can say - stay tuned for the next installment!

Happy Programming.

No comments:

Post a Comment