CA4 Project - Functional Specification
Parse::Combinator, a combinator parser for Perl
Dave Madden 97304531
12 January 2001
Introduction
As a perl programmer, I have often used perl in problems involving
text-handling. Perl has developed a good reputation over time in this
regard, because it allows for very quick and easy development, and has
what is generally considered to be the best regular expression engine
currently in widespread use.
However, despite this, perl's native functionality is lacking. Regular
expressions simply are not powerful enough to deal with complex
grammars. Parsing emails (RFC822) is incredibly convoluted using regular
expressions. There are several parsing modules (libraries) available for
perl, such as Parse-RecDescent, libparse, Parse-Yapp, and
Parse-Vipar. Most of these are unfortunately underutilizing by the perl
community. This might be partly to do with the fact that they do not give
perl an edge over other tools and languages - Yapp, ParseLex, and
Parse-Vipar, for example, are very like existing UNIX tools such as Yacc,
Bison and Lex. This means that they make it possible to do Yacc and Lex
like tasks in Perl, but do not make it especially easy to allow the use of
this functionality as an integral part of the language. RecDescent is
powerful, but again, uses its own "little language" rather than perl
functions.
Advantages offered by combinator parsing
A combinator parser offers the possibility of a more integrated parsing
process. As Daan Leijen puts it, "Combinator parsers are written and used
within the same programming language as the rest of the program. There is
no gap between the grammar formalism (Yacc) and the actual programming
language used (C). Users learn just one language!". Parsers are
manipulable as part of the language. For example, if you were building a
file viewer, that needs to parse several types of text, and apply common
functionality to them, you could construct a custom-parser dynamically
within the program, and pass it around within the language.
In addition, new combinators can be created. This allows the use of
entirely new ways of constructing grammars. See the glossary.
Scope of this document
This document describes the basic architecture, aims and functionality of
the combinator parser project.
Overview of the project
Glossery
In order to better explain the architecture of the project, the glossary
provides brief descriptions of some of the main elements of the
project.
- BNF: A notation used to describe grammars. More powerful but
more verbose than regular expressions.
- Combinator: A conbinator is a higher-order function. That is,
it is a function which manipulates other functions. In our case,
combinators serve much the same role as metacharacters do in regular
expressions. Of course, we can create and use new combinators, allowing
for totally new functionality.
- CPAN: The Comprehensive Perl Archive Network. The main public
archive of Perl modules.
- EBNF: Extended BNF. Uses regex-like metacharacters.
- Functional programming: A type of programming language (or
methodology, as in this project) which tends to use recursive solutions,
avoids assignments, looping and gotos, and which uses combinators. For the
purposes of this project, the key issue is the use of combinators. Perl
itself is not a functional language, and as such, not all of the project
will be functional; the parts relating directly to the problem to be
solved will solve the parsing problem using functional techniques. This is
analogous to the advantages of combinator parsing in perl - the power to
combine the flexibility of combinators and combinator parsing with the
practical usability of perl.
- Gofer: A variant of Haskell.
- Haskell: A modern functional language.
- Miranda(tm): A functional language (tm Research Software
Limited)
- Metacharacter: Special character in a regular expression, which
dictates the interpretation of the literal
(normal) characters. Metacharacters offer repetition and alternation, for
example.
- Module: A perl library.
- Perl: A scripting language, which is normally used for
scripting or procedural programming. It also supports OO programming, and
functional programming to a lesser degree. This is the language in which
this combinator parser will be developed and used.
- Regular expression: A simple but powerful way to express many
non-recursive text patterns. Also called "regex" or "regexp".
Project objectives
These are the project objectives:
- To formulate and implement a way of using the functional methodolgy
within Perl, a non-functional language.
- To produce the fundamental parts of the combinator parser architecure,
and package these in one or more perl modules.
- To provide additional functionality on top of the core combinator
parser itself, to make it easier to use, more flexible, and to provide
common functionality pre-written.
The combinator parser's structure is to be topologically and semantically
based on [E]BNF. This requires the combinators to offer recursion,
sequencing etc. Objective 3 might explicitly provide a BNF-like grammar if
it is found that perl programmers aren't satisfied with using the
combinators directly. (This would be less powerful than using the
combinators directly.) Exactly how Objective 3 is dealt with will depend
largely on programmer feed-back.
Project constraints
There are two major constraints:
- Perl is not a functional language. This means that applying a
functional methodology will be difficult, and not always practical.
- Time. A module requires significant amounts of usage and testing by
programmers to refine its API and internals. Time is the main limitation
on this, as in most projects.
System architecture
There are three tiers to the architecture, corresponding to the three
Project Objectives:
- Tier 1: Functional programming functionality. This will involve
both using appropriate coding techniques, and implementing perl functions
to mimic the functionality of common functional language functions, as
needed. It will involve finding and exploiting ways of using perl's
features to act like functional features. e.g. Using perl arrays (or array
refs, or objects containing arrays) as sequences.
- Tier 2: Combinator parser core. This consists of several
sub-parts:
- Primitive parsers - These are the most low level of parsers,
equivalent to the primitive data types in a programming language. They are
the building blocks of more advanced parsers.
- Basic parsers - These are simple parsers built from the
primitive parsers. They do simple, specific jobs, such as matching a
single character of a specific type.
- Combinators - Higher-order functions used to structure basic
parsers.
- Tier 3: Additional Functionality. Exactly what this includes
will depend on the feed-back I get from other perl programmers. It will
certainly include general parsers. i.e. high level parsers build
using combinators, other general parsers, and basic parsers. These general
parsers will include a library of common patterns. It will also include
additional ways of using Tier 2 functionality, through alternative
BNF-style notation and/or suitable general parsers and combinators.
-------------------------------------------------------------
| General parser / combinator library | Additional notation | Tier 3
-------------------------------------------------------------
| Basic Parsers | Tier 2
-------------------------------------------------------------
| Primitive Parsers | Basic combinators |
-------------------------------------------------------------
| Functional methodology functionality | Tier 1
-------------------------------------------------------------
Schedule
January:
Functional Specification
Simple prototype. This will be a single module, containing primitive and
basic parsers, and basic combinators. The functional elements will be
dealt with in an ad hoc manner. (Prototypes of Tier 1 and Tier 2)
February:
Abstract out functional elements (Tier 1) into seperate functions and/or a
seperate module as needed. Document clearly how all functional aspects are
being dealt with. This might be a minor job, initially.
Re-code Tier 2 as a seperate module (or modules).
Write some general parsers (part of Tier 3).
Write initial documentation.
Find out more about CPAN packaging.
Start getting feed-back from perl coders.
March:
Initial prototype of Tier 3 functionality.
Get more feed-back.
April (after exams):
Do a re-code. Some code will be retained from before if suitable. Any
major new unforseen elements will be introduced / dealt with at this
point.
May:
Debug, test, get more feedback, modify Tier 3 based on this. Tier 1 and 2
should require much less modification at this point.
Documentation, packaging.
References
Academic papers:
The main papers upon which this project is significantly based are about
implementing combinator parsers in
actual functional languages such as Gofer, Miranda(tm) and
Haskell.
Graham Hutton. (1992)
Higher-order functions for parsing.
Journal of Functional Programming 2: 232-343.
http://www.cs.nott.ac.uk/Department/Staff/gmh/parsing.ps
Graham Hutton and Erik Meijer. (1993)
Functional Pearls - Monadic Parsing in Haskell.
http://www.cs.nott.ac.uk/~gmh//pearl.ps
Graham Hutton and Erik Meijer. (1996)
Monadic Parser Combinators.
Technical report NOTTCS-TR-96-4. Department of Computer Science,
University of Nottingham.
http://www.cs.nott.ac.uk/Department/Staff/gmh/monparsing.ps.
Comparitive implementations:
One of the implementations I examined was Parsec, a combinator parser for
Haskell, by Daan Leijan
Parsec: http://www.cs.uu.nl/~daan
Functional programming:
"Functional Programming : Practice and Theory" (Bruce J. MacLennan, ISBN
0-201-13744-5)
Functional programming:
http://cm.bell-labs.com/cm/cs/who/wadler/topics/monads.html
Haskell: www.haskell.org
Haskell: "Haskell: The Craft of Functional Programming" (Simon Thompson,
ISBN 0-201-34275-8)
Additional topics:
Perl: www.perl.org
Perl modules: www.cpan.org