Tuesday, 22 January 2013

A class to hold lexical tokens

All language processors have to deal with words, otherwise known as tokens. You could simply keep them in the form of strings, but most usually you need at least a Type field too. Other useful attributes of a token could include things like the name of the source file, the line number and column number it was found at, perhaps it's length as a string, and so on.

My tokens are composed as a pair of classes, the first being a simple base class containing the filename and line number. I split these out because they kept turning up in a number of iterators, and it was getting silly copying the code around.

The second class inherits from this base, and contains all the attributes I care about.

The first class is lib/Token/Base.pm:
use strict;
use warnings;
no warnings 'uninitialized';

package Token::Base;
our $VERSION = v1.0.0;

# Token::Base
#   used by a number of Iterators in various guises.
# Fields:
#   fname   => name of source file
#   lineno  => line number within file.
use fields qw(_fname _lineno);

sub fname : method lvalue {
    my Token::Base $self = $_[0];

sub lineno : method lvalue {
    my Token::Base $self = $_[0];

sub new {
# ->new($fname, $lineno)
    my $class = $_[0];
    my Token::Base $self = fields::new($class);
    $self->{_fname} = $_[1];
    $self->{_lineno} = $_[2];
    return $self;
}  # new

As with all my modules here, I'm using the fields module due to the slight performance benefits AND the ability to detect spelling mistakes when accessing the fields. If you use plain hashes then a simple spelling mistake on the field name will go unnoticed (at first...). If you use the fields module, you can also declare an 'advisory' type for your my variables, which allows the compiler to prepare faster access to the named fields:
my Some::Class $x = whatever;
$self->{field} = whatever;
The other important point to note here is the use of "lvalue" methods. In the language of compilers, an lvalue is anything that can appear on the left-side of assignment - ie something with memory that can be splatted. By using the lvalue attribute on the method declarations, you can write things like:
$token->filename = "blah";
Perl currently has very strict rules on the definition of lvalue methods. The assignable item MUST be the last "statement" in the method, you can't use the return statement, and you can't even have a trailing semicolon after the expression. If you mess this up, it will go unnoticed (at first...)

A subtle point: I'm not using the accessors internally (ie I'm using ->{_field} directly) because there are noticable performance benefits. External users of the module do not have that luxury - at least, not for fields named with a leading underscore. This is how package maintainers keep their control! (but then they give it away with lvalue methods???)

For more information about method attributes, see perldoc attributes but be warned, it gets hard-core. The method and lvalue attributes are simpler aspects of the feature!

The token class proper is lib/Token/Lex.pm:
use strict;
use warnings;
no warnings 'uninitialized';

use feature "switch";

package Token::Lex;
our $VERSION = v1.0.0;
use Carp;

my $caseless = 1;
# by default, the cmp operator (tokenCmp) is
# case-insensitive

# call Token::Lex::Caseless(1 | 0) to switch case
# sensitivity of tokenCmp
# Not worth exporting this - it will only be called
# once so spell it out!!!
sub Caseless($) {
    # keep the value clean no matter what's passed in.
    $caseless = $_[0] ? 1 : 0;
}  # Caseless

use base 'Token::Base';

use fields qw(_colno _type _nlines
              _len _text _caseless);

use overload
    'cmp' => \&tokenCmp,
    '""'  => \&tokenStringer;

sub colno : method lvalue {
    my Token::Lex $self = $_[0];

sub type : method lvalue {
    my Token::Lex $self = $_[0];

sub nlines : method lvalue {
    my Token::Lex $self = $_[0];

sub len : method lvalue {
    my Token::Lex $self = $_[0];

sub text : method lvalue {
    my Token::Lex $self = $_[0];

sub caseless : method lvalue {
    my Token::Lex $self = $_[0];

# following is only for internal use only. I think.
# can't use caseless above because it has to follow
# the rule for lvalue methods.
# that's kind of tragic, and this whole situation will
# probably need improving.
sub _caseless : method {
    my Token::Lex $self = $_[0];
    return $self->caseless // $caseless;

sub new {
    my $class = shift;
    my $fname = shift;
    my $lineno = shift;
    my Token::Lex $self = fields::new($class);

    while(@_) {
        my $opt = shift;
        my $val = shift;

        given($opt) {
          when ('-colno')  { $self->{_colno}  = $val }
          when ('-type')   { $self->{_type}   = $val }
          when ('-nlines') { $self->{_nlines} = $val }
          when ('-len')    { $self->{_len}    = $val }
          when ('-text')   { $self->{_text}   = $val }
          default          {
            croak "Bad option to ",
                   ref $self, "->new: $opt"

    $self->fname = $fname;
    $self->lineno = $lineno;
    return $self;
}  # new

sub tokenCmp : method {
# usage:
#   $tok eq "something"
#   $tok lt $tok2
#    etc
    my Token::Lex $self = $_[0];
    my $arg = $_[1];
    my $reversed = $_[2];

    # If comparing a Token to a Token,
    # need to convert 2nd arg to a plain string
    $arg = $arg->text if ref($arg)
                      && $arg->isa('Token::Lex');

    my $val = $self->{_text};
    if($self->_caseless) {
        $arg = lc $arg;
        $val = lc $val;
    # NOTE: if comparing two tokens, but the
    #  ->caseless setting of each are different
    # then the left object wins. Try not to be
    # this clumsy in your code...

    my $result = $val cmp $arg;
    $result = 0 - $result if $reversed;
    return $result;
}   # tokenCmp

sub tokenStringer : method {
    my Token::Lex $self = $_[0];
    my $text = $self->{_text};
    substr($text, 37) = "..." if length($text) > 40;

    $text = "Token<" . $self->{_type} . "> ($text)";
    $text = "[" . $self->lineno . "] " . $text
        if $self->lineno;
    $text = $self->fname . $text if $self->fname;

    return $text;
}   # tokenStringer
sub is : method {
# Usage: $tok->is('type') or $tok->is('type', 'text').
# If only type passed, tests for that.
# If text is also passed, then that must compare true
# along with type.

# Performs a smart-match on the text, so you can pass
# in a regex if you like.

    my Token::Lex $self = $_[0];

    # immediate failure if type is not a match
    return undef if $self->{_type} ne $_[1];
    # immediate success if no text comparison to do
    return $self if @_ < 3;

    my $text = $self->{_text};
    my $arg = $_[2];

    if($self->_caseless) {
        $text = lc $text;
        $arg = lc $arg unless ref($arg) eq 'Regexp';

    # here comes the smart-match which can handle regex
    return undef unless $text ~~ $arg;

    return $self;
}   # is

sub isnt : method {
# this is just a convenient negation of the
#   ->is() method.

    my Token::Lex $self = shift;
    return $self->is(@_) ? undef : $self;
}   # isnt


Points of interest

Case sensitivity

The class supports switching on whether comparisons are case sensitive or not. By default it is case-insensitive. I've even allowed for each token to contain it's own caseless flag, if the class is ever used in a complex program with several distinct languages. The code only prepares for the per-object flag, and reads it too, but does not otherwise support passing it into the constructor. Sub-classes can look after themselves if that is ever needed.


  • colno
    column number within the source line for the token
  • type
    assigned token type (eg 'Number', 'String', etc...)
  • nlines
    if the token covers many lines then this counts how many. Examples are Here-Docs, a multiline comment, multi-line strings, etc. This was needed by my re-assembler iterator which needed to keep track of where it was so it could insert blank lines as needed. If it's an ordinary token then nlines is zero. Rule of thumb: if there's a newline in the token, start setting this (my Iterator does...)
  • len
    plain length of the token's text in actual chars.
  • text
    body of the token. May be redundant if the type uniquely identifies the token, but the text is put in anyway.
  • caseless
    sub-classes can choose to assign this, and it will win on a token-by-token case (pun intended). This module does not, however, put any effort into handling it beyond testing it.


RTFM - perldoc overload
An overload is provided for Perl's cmp operator. Given this, all use of the operators lt, gt, le, gt, eq and ne will be mapped by Perl to call this cmp operator. The eq and ne operators are the most likely ones to be used. The cmp operator defaults to case-insensitive.

The stringifying overload produces a visually descriptive string which is NOT suitable for use other than tracing and debugging. Always use direct access to the type and text fields within your code. By the way, the stringifying overload is called when you do things like "$token etc etc" or $token . "something". Avoid that for working code. Use it only in tracing/debugging.

Functions and methods

sub Token::Lex::Caseless($)
Allows external control of the case insensitivity.

attribute accessors
Assignable accessors are provided for all the new fields.

sub new($fname, $lineno, options)
The filename and line number must be provided. Other fields can be passed in as -field => value pairs.

tokenCmp() and tokenStringer()
Next follows the guts of the overloads. Nothing more to say about them.

$token->is('type') and $token->is('type', text)
These two methods make it easy to test the complex value of a token. If you pass in only a type, then it must be of that type. If you also pass in some text, it must also compare favourably with that text. Note that the 2nd argument can actually be a regexp object such as qr/wierd-token-pattern/. Inside the code you'll see use of the new ~~ operator provided from Perl 5.10 forwards.

Speaking of new Perl operators, I'm also using // in several places. Look 'em up in perldoc perlop.

$token->isnt('type') and $token->isnt('type', text)
This thing is just a convenient negation of the ->is() method.

No comments:

Post a Comment