Dr Tom Ridge
2014-09-14
briefly discuss mechanized verification, and what verified parsing might mean
what has been done so far
the real world?
future plans
Informal (eg Wikipedia): For all x, if x is even, then it can be expressed as the sum of two primes.
Formal (eg FOL): ∀ x. even(x) → ∃ y z. prime(y) ∧ prime(z) ∧ (x = y + z)
General logics such as HOL and ZFC can express most formal systems, in particular, you can define programs, and reason about them
Many verifiers have the same dream
The dream is: a perfect world, or at least, a world where computers function perfectly, code never goes wrong etc.
This is unrealistic, but it provides a strong motivation: building a paradise on earth etc
Each system component must be verified. People talk about the "verified stack", consisting of verified OS, verified networking and file systems, verified compilers, verified runtimes etc.
Why? Parsing is an integral part of many systems.
Example application: system security often depends on checking inputs to the system, which could be done in a general way using parsing.
So a verified parser could be a key component of a lot of verified systems.
Verified parsing consists of using a machine to prove "some theorem" about "some code" (a parser or parser generator)
The question is: what is the theorem, and what is the code?
For the code: could be a parser generator, or generated parser, or parser for a particular language, or ...
Words such as "correct", "sound" and "complete" are used a lot, and with different meanings
A weak statement:
A stronger statement(?):
A strong statement:
grammar_to_parser
in Fig. 3HOL proofs available from github, via here
OCaml code available in the P1 module from the P3 distribution; defns sufficiently short (~40 lines) that the code can be ported to your favourite functional language in a very short time
As a starting point, we should really aim for: a verified deterministic parser (that is actually useful, say, LALR(1) or LR(1)) and a verified general parser (again, that is actually useful, say, Earley)
(Ridge, 2011) is obviously great (!), but it isn't either of these.
For general parsing: CYK has been verified, but it is not really useful
What about deterministic parsers?
"TRX: A Formally Verified Parser Interpreter" is described here. This is for PEGs. Not publicly available (?)
Availability also seems to be a problem with the similar Ynot packrat PEG parser.
Also of interest is the work on verifying a parser for a C compiler by Jourdan, Pottier and Leroy. This is for an LR(1) parser generator. Distributed as part of the Compcert verified compiler distribution, but specialized to C (?)
So it seems that actually using these to parse your favourite grammar is difficult
And as mentioned, the theorems are often not what you might want
A bit disappointing
A verified parser generator for deterministic parsing (eg for LR(1)), and a verified parser for general CFGs (eg Earley; I am currently working on this)
For both, strong correctness theorems
Usable real-world implementations with the expected performance characteristics
Verified performance wrt some abstract machine
The real-world is a complex place!
Performance (theoretical, and real-world), memory consumption, user interface, debugging information, interaction with input (eg does the input string need to be completely in memory?), memory access patterns, interaction with GC, etc etc
In particular, performance (theoretical and real-world) is very important
Performance essentially the same as with traditional combinator parsing (on the grammars where traditional combinators terminate)
Can handle all grammars (good)... but left-recursive grammars are slow to parse (bad)
One of the aims of the CPP'11 work (beyond verification) was to allow the user to write a parser for whatever grammar they want, without having to worry about the implementation (no need to fiddle with the grammar etc)
But if performance is poor on some grammars, then this goal hasn't been met - users do worry about whether the grammar will result in poor performance
Combine combinator parsing interface with an Earley parser backend. This is the subject of the SLE talk.
Basic approach: allow the user to write combinator parsers as before. Extract a grammar from this, perform the parse using an Earley parser, and then use the Earley parse results to guide the action phase.
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
|||| ((a "1") >>>> (fun _ -> `LF("1")))
|||| ((a "") >>>> (fun _ -> `LF(""))))
i)
let results = run_parser parse_E "1111111111"
This approach combines the flexibility of combinator parsing with the performance of Earley parsing. Easy to port to other languages.
What about the real world?
Real-world performance is hard to achieve. For example CYK is O(n^3)
theoretically
10^6
bytes (~ 1MB), we may need an array of (the order of) 10^12
bytes (~ 1000 GB); this is a lot of main memory for a relatively small inputMy observations when engineering Earley for real-world performance:
O(n^3)
bound was difficult.O(n^3)
performance, but aren't feasible for large inputs - we must accept O(n^3 ln n)
performance in the worst case, and smoothly transition from one bound to the othern^3
runtime is not acceptable for grammars which users expect to be parsed in linear time (another reason CYK is not suitable for the real world). We have to make the parser perform well on all the different subclasses. This is a difficult thing to achieve.Compare to Happy on 5 sample small grammars
On 2, P3 is much much better
E_EEE: E -> E E E | "1" | ""
Grammar|Input size|Happy/s|P3/s
E_EEE |020 |0.14 |0.10
E_EEE |040 |6.87 |0.13
E_EEE |100 |2535.88|0.50
On 2, P3 is better, but only by a factor of 10 or so. On one, Happy appeared to loop for non-empty inputs.
See here for comparison with Happy
I think these are very good results; getting these results took an immense amount of time
This talk is really just a very brief overview of the kind of things that have been done, some things I'm working on, and the places we might want to go
Disclaimer: I am not a parsing person, I am a formal methods person, so I need enlightening!
Please come and tell me lots of interesting stuff about parsing, what the problems are, and where you think we need to go!