2005-11-26
Replacement for Read class
So here's something that has annoyed me for ages. The standard Read class in Haskell. OK, so you can derive it automatically at the same time as you derive Show. But if you want to write your own instance of Show to make your data printer even slightly out of the ordinary, then you'll have to write your own instance of Read to match it. Writing an instance of Read is far more complicated than it ought to be. The list comprehension style is OK, but having to plumb the trailing unconsumed input all the way through it is rather tedious and offputting. This plumbing problem was solved ages ago when someone invented monadic parser combinators.
But wait, there's another annoying aspect of Read as well: rubbish error-handling. If anything at all goes wrong - for instance you edit the string produced by show and forget a comma, or you write your instance of Read incorrectly - then boom, your program crashes with the error "no parse". No opportunity to catch the error and do something graceful. No indication of what actually went wrong. It just dies and leaves you in the dark. But wait, the monadic parser combinator people solved that problem eons ago as well!
[Historical note: the Read class was originally defined (actually known as the Text class) long before any of us FP people even knew what a monad was, so we can't really blame the Haskell committee. It was a reasonable choice at the time.]
Now, I'm aware that Koen Claessen and Simon PJ have come up with a replacement for Read called ReadP, based on monadic parser combinators. It is even used internally by GHC when you derive Read. But the trouble is, their scheme is designed to get better efficiency and avoid backtracking, not to improve the error handling. The interface to the class Read remains the same. So my thought is, why not propose a complete replacement of the Read class, interface and all? Here's what I came up with:
The monadic combinators still provide back-tracking, through a standard choice operator, but the key thing is that the caller of runParser gets to see whether there was an error, allowing it to do something sensible, and if there was an error, there is an explanatory message to go with it.
Also, defining an instance of the Parse class is dead easy. Here's an example:
Even dealing with field notation is rather simple as well:
I've been using this Parse class for a few weeks now, and it is a big improvement over Read, so much so that I'm thinking of proposing it (or something similar) for Haskell-prime, the new version of the language standard. The actual set of parser combinators used to underlie the class is not so important. I'm using my own recently developed set, called Text.ParserCombinators.Poly, based originally on the Hutton-Meijer combinators from 1996. But that is just because I'm most familiar with them. I'm sure Daan Leijen's Parsec combinators would be just as good a choice.
For now, both Poly and TextParser are distributed in HaXml-1.14, (also available in fptools CVS) and will no doubt continue to evolve over the next few weeks. Once they are fairly settled, I might distribute them as a separate package, not tied to HaXml.
But wait, there's another annoying aspect of Read as well: rubbish error-handling. If anything at all goes wrong - for instance you edit the string produced by show and forget a comma, or you write your instance of Read incorrectly - then boom, your program crashes with the error "no parse". No opportunity to catch the error and do something graceful. No indication of what actually went wrong. It just dies and leaves you in the dark. But wait, the monadic parser combinator people solved that problem eons ago as well!
[Historical note: the Read class was originally defined (actually known as the Text class) long before any of us FP people even knew what a monad was, so we can't really blame the Haskell committee. It was a reasonable choice at the time.]
Now, I'm aware that Koen Claessen and Simon PJ have come up with a replacement for Read called ReadP, based on monadic parser combinators. It is even used internally by GHC when you derive Read. But the trouble is, their scheme is designed to get better efficiency and avoid backtracking, not to improve the error handling. The interface to the class Read remains the same. So my thought is, why not propose a complete replacement of the Read class, interface and all? Here's what I came up with:
-- | First a general parser monad, t = token type, a = result value.
-- Errors are returned through the Either type.
newtype Parser t a = P ([t] -> (Either String a, [t]))
instance Functor (Parser t)
instance Monad (Parser t)
-- | Apply a parser to an input token sequence.
runParser :: Parser t a -> [t] -> (Either String a, [t])
-- | A TextParser is a specialisation of the standard parser monad
-- over character strings.
type TextParser a = Parser Char a
-- | The class @Parse@ is a replacement for @Read@.
class Parse a where
parse :: TextParser a
parseList :: TextParser [a]
-- | If there already exists a Read instance for a type, then we can make
-- a Parser for it.
parseByRead :: Read a => String -> TextParser a
The monadic combinators still provide back-tracking, through a standard choice operator, but the key thing is that the caller of runParser gets to see whether there was an error, allowing it to do something sensible, and if there was an error, there is an explanatory message to go with it.
Also, defining an instance of the Parse class is dead easy. Here's an example:
data Shape =
Circle ShapeStyle Double
| Polygon ShapeStyle [DoublePoint]
instance Parse Shape where
parse = oneOf
[ do{ isWord "Circle"; return Circle `apply` parse `apply` parse }
, do{ isWord "Polygon"; return Polygon `apply` parse `apply` parse }
] `adjustErr` (++"\nexpected a Shape (Circle,Polygon)")
Even dealing with field notation is rather simple as well:
data ShapeStyle = ShapeStyle
{ styleStrokeWidth :: Int
, styleFill :: Colour
}
instance Parse ShapeStyle where
parse = do{ isWord "ShapeStyle"
; return ShapeStyle
`discard` isWord "{" `apply` field "styleStrokeWidth"
`discard` isWord "," `apply` field "styleFill"
`discard` isWord "}"
}
I've been using this Parse class for a few weeks now, and it is a big improvement over Read, so much so that I'm thinking of proposing it (or something similar) for Haskell-prime, the new version of the language standard. The actual set of parser combinators used to underlie the class is not so important. I'm using my own recently developed set, called Text.ParserCombinators.Poly, based originally on the Hutton-Meijer combinators from 1996. But that is just because I'm most familiar with them. I'm sure Daan Leijen's Parsec combinators would be just as good a choice.
For now, both Poly and TextParser are distributed in HaXml-1.14, (also available in fptools CVS) and will no doubt continue to evolve over the next few weeks. Once they are fairly settled, I might distribute them as a separate package, not tied to HaXml.
2005-11-17
Bytecode experiments
I have another undergraduate starting a project this week. Originally he was to look at replacing the spineless G-machine bytecode used in nhc98 with a three-address instruction code. That is still the plan, only now he might use yhc as a platform instead.
The idea is simple. Standard G-machine instructions use the stack heavily as temporary storage. But this is potentially bad for performance, because of all the memory accesses. Using registers for temporary storage would potentially be much faster. This is what native-code compilers do of course. But I would like to continue to have a portable bytecode, running on a small interpretive virtual machine. So part of the problem will be to map temporaries to real registers in the VM itself.
However I expect that to be relatively straightforward, and the major work will be defining and generating a register-based bytecode. I'm thinking of translating the already-generated G-code to register code by simulating the stack.
Ultimately, I'm hoping for a decent speed-up, maybe 2x or better. But I guess there are quite a few pitfalls that could mean we only an improvement of say 20%. Or maybe it will go spectacularly well, and we'll get 3x improvement. We'll find out in a few months time.
The idea is simple. Standard G-machine instructions use the stack heavily as temporary storage. But this is potentially bad for performance, because of all the memory accesses. Using registers for temporary storage would potentially be much faster. This is what native-code compilers do of course. But I would like to continue to have a portable bytecode, running on a small interpretive virtual machine. So part of the problem will be to map temporaries to real registers in the VM itself.
However I expect that to be relatively straightforward, and the major work will be defining and generating a register-based bytecode. I'm thinking of translating the already-generated G-code to register code by simulating the stack.
Ultimately, I'm hoping for a decent speed-up, maybe 2x or better. But I guess there are quite a few pitfalls that could mean we only an improvement of say 20%. Or maybe it will go spectacularly well, and we'll get 3x improvement. We'll find out in a few months time.