CA4 Project - Functional Specification

Parse::Combinator, a combinator parser for Perl

Dave Madden 97304531
12 January 2001

Introduction
As a perl programmer, I have often used perl in problems involving text-handling. Perl has developed a good reputation over time in this regard, because it allows for very quick and easy development, and has what is generally considered to be the best regular expression engine currently in widespread use.

However, despite this, perl's native functionality is lacking. Regular expressions simply are not powerful enough to deal with complex grammars. Parsing emails (RFC822) is incredibly convoluted using regular expressions. There are several parsing modules (libraries) available for perl, such as Parse-RecDescent, libparse, Parse-Yapp, and Parse-Vipar. Most of these are unfortunately underutilizing by the perl community. This might be partly to do with the fact that they do not give perl an edge over other tools and languages - Yapp, ParseLex, and Parse-Vipar, for example, are very like existing UNIX tools such as Yacc, Bison and Lex. This means that they make it possible to do Yacc and Lex like tasks in Perl, but do not make it especially easy to allow the use of this functionality as an integral part of the language. RecDescent is powerful, but again, uses its own "little language" rather than perl functions.

Advantages offered by combinator parsing
A combinator parser offers the possibility of a more integrated parsing process. As Daan Leijen puts it, "Combinator parsers are written and used within the same programming language as the rest of the program. There is no gap between the grammar formalism (Yacc) and the actual programming language used (C). Users learn just one language!". Parsers are manipulable as part of the language. For example, if you were building a file viewer, that needs to parse several types of text, and apply common functionality to them, you could construct a custom-parser dynamically within the program, and pass it around within the language.
In addition, new combinators can be created. This allows the use of entirely new ways of constructing grammars. See the glossary.

Scope of this document
This document describes the basic architecture, aims and functionality of the combinator parser project.

Overview of the project
Glossery
In order to better explain the architecture of the project, the glossary provides brief descriptions of some of the main elements of the project.

BNF: A notation used to describe grammars. More powerful but more verbose than regular expressions.
Combinator: A conbinator is a higher-order function. That is, it is a function which manipulates other functions. In our case, combinators serve much the same role as metacharacters do in regular expressions. Of course, we can create and use new combinators, allowing for totally new functionality.
CPAN: The Comprehensive Perl Archive Network. The main public archive of Perl modules.
EBNF: Extended BNF. Uses regex-like metacharacters.
Functional programming: A type of programming language (or methodology, as in this project) which tends to use recursive solutions, avoids assignments, looping and gotos, and which uses combinators. For the purposes of this project, the key issue is the use of combinators. Perl itself is not a functional language, and as such, not all of the project will be functional; the parts relating directly to the problem to be solved will solve the parsing problem using functional techniques. This is analogous to the advantages of combinator parsing in perl - the power to combine the flexibility of combinators and combinator parsing with the practical usability of perl.
Gofer: A variant of Haskell.
Haskell: A modern functional language.
Miranda(tm): A functional language (tm Research Software Limited)
Metacharacter: Special character in a regular expression, which dictates the interpretation of the literal (normal) characters. Metacharacters offer repetition and alternation, for example.
Module: A perl library.
Perl: A scripting language, which is normally used for scripting or procedural programming. It also supports OO programming, and functional programming to a lesser degree. This is the language in which this combinator parser will be developed and used.
Regular expression: A simple but powerful way to express many non-recursive text patterns. Also called "regex" or "regexp".

Project objectives
These are the project objectives:

To formulate and implement a way of using the functional methodolgy within Perl, a non-functional language.
To produce the fundamental parts of the combinator parser architecure, and package these in one or more perl modules.
To provide additional functionality on top of the core combinator parser itself, to make it easier to use, more flexible, and to provide common functionality pre-written.

The combinator parser's structure is to be topologically and semantically based on [E]BNF. This requires the combinators to offer recursion, sequencing etc. Objective 3 might explicitly provide a BNF-like grammar if it is found that perl programmers aren't satisfied with using the combinators directly. (This would be less powerful than using the combinators directly.) Exactly how Objective 3 is dealt with will depend largely on programmer feed-back.

Project constraints
There are two major constraints:

Perl is not a functional language. This means that applying a functional methodology will be difficult, and not always practical.
Time. A module requires significant amounts of usage and testing by programmers to refine its API and internals. Time is the main limitation on this, as in most projects.

System architecture
There are three tiers to the architecture, corresponding to the three Project Objectives:

Tier 1: Functional programming functionality. This will involve both using appropriate coding techniques, and implementing perl functions to mimic the functionality of common functional language functions, as needed. It will involve finding and exploiting ways of using perl's features to act like functional features. e.g. Using perl arrays (or array refs, or objects containing arrays) as sequences.
Tier 2: Combinator parser core. This consists of several sub-parts:
- Primitive parsers - These are the most low level of parsers, equivalent to the primitive data types in a programming language. They are the building blocks of more advanced parsers.
- Basic parsers - These are simple parsers built from the primitive parsers. They do simple, specific jobs, such as matching a single character of a specific type.
- Combinators - Higher-order functions used to structure basic parsers.
Tier 3: Additional Functionality. Exactly what this includes will depend on the feed-back I get from other perl programmers. It will certainly include general parsers. i.e. high level parsers build using combinators, other general parsers, and basic parsers. These general parsers will include a library of common patterns. It will also include additional ways of using Tier 2 functionality, through alternative BNF-style notation and/or suitable general parsers and combinators.

-------------------------------------------------------------
| General parser / combinator library | Additional notation |	Tier 3
-------------------------------------------------------------
| 			Basic Parsers		  	    |	Tier 2
-------------------------------------------------------------
| Primitive Parsers		| Basic combinators	    |
-------------------------------------------------------------
| Functional methodology functionality			    |	Tier 1
-------------------------------------------------------------

Schedule
January:
Functional Specification
Simple prototype. This will be a single module, containing primitive and basic parsers, and basic combinators. The functional elements will be dealt with in an ad hoc manner. (Prototypes of Tier 1 and Tier 2)

February:
Abstract out functional elements (Tier 1) into seperate functions and/or a seperate module as needed. Document clearly how all functional aspects are being dealt with. This might be a minor job, initially.
Re-code Tier 2 as a seperate module (or modules).
Write some general parsers (part of Tier 3).
Write initial documentation.
Find out more about CPAN packaging.
Start getting feed-back from perl coders.

March:
Initial prototype of Tier 3 functionality.
Get more feed-back.

April (after exams):
Do a re-code. Some code will be retained from before if suitable. Any major new unforseen elements will be introduced / dealt with at this point.

May:
Debug, test, get more feedback, modify Tier 3 based on this. Tier 1 and 2 should require much less modification at this point.
Documentation, packaging.

References
Academic papers:
The main papers upon which this project is significantly based are about implementing combinator parsers in actual functional languages such as Gofer, Miranda(tm) and Haskell.

Graham Hutton. (1992)
Higher-order functions for parsing.
Journal of Functional Programming 2: 232-343.
http://www.cs.nott.ac.uk/Department/Staff/gmh/parsing.ps

Graham Hutton and Erik Meijer. (1993)
Functional Pearls - Monadic Parsing in Haskell.
http://www.cs.nott.ac.uk/~gmh//pearl.ps

Graham Hutton and Erik Meijer. (1996)
Monadic Parser Combinators.
Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham.
http://www.cs.nott.ac.uk/Department/Staff/gmh/monparsing.ps.

Comparitive implementations:
One of the implementations I examined was Parsec, a combinator parser for Haskell, by Daan Leijan
Parsec: http://www.cs.uu.nl/~daan

Functional programming:
"Functional Programming : Practice and Theory" (Bruce J. MacLennan, ISBN 0-201-13744-5)
Functional programming: http://cm.bell-labs.com/cm/cs/who/wadler/topics/monads.html
Haskell: www.haskell.org
Haskell: "Haskell: The Craft of Functional Programming" (Simon Thompson, ISBN 0-201-34275-8)

Additional topics:
Perl: www.perl.org
Perl modules: www.cpan.org