光阴冢

We make choices and life has a way of making us pay for them.

ML for the working programmer

Jul 7, 2020  

学习如何给 $\lambda$-Calculus 写一个好看的 Parser 和 Interpreter。

Scanning, or lexical analysis

首先,把 String 转化为 Token List。我们可以先写出期望的 signature

1
2
3
4
5
signature LEXICAL =
sig
  datatype token = Id of string | Key of string
  val scan : string -> token list
end;

在此之前,我们想更 Generic 一点,不想先把 Keyword 限制死。因此实现的 Lexical 最好是个 Functor,只要符合了 KEYWORDsignature,就可以有一个对应的 Lexical 了。

1
2
3
4
5
signature KEYWORD =
sig
  val alphas : string list
  and symbols : string list
end;

好了,接下来看看 Lexical 怎么写。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
functor Lexical (Keyword : KEYWORD) : LEXICAL =
struct
  datatype token = Key of string | Id of string;
  fun member (x:string, l) = List.exists (fn y => x = y) l;
  fun alphaTok a =
      if member (a, Keyword.alphas) then Key(a) else Id(a);
  
  (* scanning of a symbolic keyword *)
  fun symbolic (sy, ss) =
      case Substring.getc ss of
          NONE => (Key sy, ss)
        | SOME (c, ss1) =>
          if member(sy, Keyword.symbols)
             orelse not (Char.isPunct c)
          then (Key sy, ss)
          else symbolic (sy ^ String.str c, ss1);
  (* scanning a substring into a list of tokens *)
  fun scanning (toks, ss) =
      case Substring.getc ss of
          NONE => rev toks
        | SOME (c, ss1) =>
          if Char.isAlphaNum c
          then (* identifier or keyword *)
              let val (id, ss2) = Substring.splitl Char.isAlphaNum ss
                  val tok = alphaTok (Substring.string id)
              in scanning (tok::toks, ss2)
              end
          else if Char.isPunct c
          then (* special symbol *)
              let val (tok, ss2) = symbolic (String.str c, ss1)
              in scanning (tok::toks, ss2)
              end
          else (*ignore spaces, line breaks, control characters *)
              scanning (toks, Substring.dropl (not o Char.isGraph) ss);
  fun scan a = scanning([], Substring.full a);
end;

多态的函数可能会很慢,因此定义了一个固定类型的 member 函数用来看是否是在 Keyword.alphas 中。使用 Substring 类型来一个字符一个字符读入。如果是字母或者数字,那么状态机进入看这个是 Id 还是 Key 的状态;如果读入的是标点,那可能是一个 Symbol ;其他的不可打印字符则忽略不看。

symbolic 函数返回这个字符串中最前面的 Symbol 以及剩下的字符串。

A toolkit for top-down parsing

一个类型 $\tau$ 对应的 parser,可以做到读入一个 Token List,返回一个该类型的值和剩下的 Token List,即这个函数的类型是:

\[ \text{token list} \rightarrow \tau \times \text{token list} \]

简单记这个类型为 $\tau~\text{phrase}$。并不是所有的 $\tau~\text{phrase}$ 类型的函数都是一个 parser,parser 不能增加或者修改后面的 Token List。

Parsing primitive phrases

  1. id 识别 Id _
  2. $ 'a 识别 Key a
  3. empty pair 一个空列表给它,多态类型:$\text{‘a list} \rightarrow \text{‘a list} \times \text{‘a list}$。

Alternative phrases

  1. ph1 || ph2:组合两个相同类型的 parser,哪个 parse 成功了就返回哪个。
  2. !!ph 为了处理 ph parser 可能产生的 SyntexError

Consecutive phrases

  1. ph1 -- ph2:先使用 ph1,剩下的交给 ph2
  2. a $-- ph:resembles $ a -- ph

Modifying the meaning

ph >> f returns (f(x), toks) instead of (x, toks)

Repetition

1
2
fun repeat ph toks = ( ph -- repeat ph >> (op::)
                      || empty ) toks;

运算优先级定义如下:

1
2
3
4
infix 6 $--;
infix 5 --;
infix 3 >>;
infix 0 ||;

所以 parser repeat ph,要么是一系列的 ph 组合,失败后则是 parser empty

The ML code of the parser

首先是 PARSE 的签名:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
signature PARSE =
sig
  exception SyntaxErr of string
  type token
  val id : token list -> string * token list
  val $  : string -> token list -> string * token list
  val empty : 'a -> 'b list * 'a
  val || : ('a -> 'b) * ('a -> 'b) -> 'a -> 'b
  val !! : ('a -> 'b * 'c) -> 'a -> 'b * 'c
  val -- : ('a -> 'b * 'c) * ('c -> 'd * 'e) -> 'a -> ('b * 'd) * 'e
  val $-- : string * (token list -> 'a * 'b) -> token list -> 'a * 'b
  val >> : ('a -> 'b * 'c) * ('b -> 'd) -> 'a -> 'd * 'c
  val repeat : ('a -> 'b * 'a) -> 'a -> 'b list * 'a
  val infixes :
      (token list -> 'a * token list) * (string -> int) *
      (string -> 'a -> 'a -> 'a) -> token list -> 'a * token list
  val reader: (token list -> 'a * 'b list) -> string -> 'a
end;

然后是实现:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
functor Parsing (Lex: LEXICAL) : PARSE =
struct
  type token = Lex.token;

  exception SyntaxErr of string;

  (*Phrase consisting of an identifier*)
  fun id (Lex.Id a :: toks) = (a,toks)
    | id toks = raise SyntaxErr "Identifier expected";

  (*Phrase consisting of the keyword 'a' *)
  fun $a (Lex.Key b :: toks) = if a=b then (a,toks)
             else raise SyntaxErr a
    | $a _ = raise SyntaxErr "Symbol expected";

  (*The empty phrase!*)
  fun empty toks = ([],toks);

  (*Alternative phrases*)
  fun (ph1 || ph2) toks = ph1 toks
        handle SyntaxErr _ => ph2 toks;

  fun !! ph toks = ph toks
      handle SyntaxErr msg => raise Fail ("Syntax error: " ^ msg);

  (*One phrase then another*)
  fun (ph1 -- ph2) toks =
      let val (x,toks2) = ph1 toks
    val (y,toks3) = ph2 toks2
      in  ((x,y), toks3)  end;

  (*Application of f to the result of a phrase*)
  fun (ph>>f) toks =
      let val (x,toks2) = ph toks
      in  (f x, toks2)  end;

  fun (a $-- ph) = ($a -- !!ph >> #2);

  (*Zero or more phrases*)
  fun repeat ph toks = (   ph -- repeat ph >> (op::)
                        || empty   ) toks;

  fun infixes (ph,prec_of,apply) =
    let fun over k toks = next k (ph toks)
        and next k (x, Lex.Key(a)::toks) =
              if prec_of a < k then (x, Lex.Key a :: toks)
              else next k ((over (prec_of a) >> apply a x) toks)
          | next k (x, toks) = (x, toks)
    in  over 0  end;

  (*Scan and parse, checking that no tokens remain*)
  fun reader ph a =
   (case ph (Lex.scan a) of
        (x, []) => x
      | (_, _::_) => raise SyntaxErr "Extra characters in phrase");

end;

Example: parsing and displaying types

一个用 Poly/ML 写的例子