-
Notifications
You must be signed in to change notification settings - Fork 348
Why TAB cycle indentation for Haskell is a hard problem
I remember when I started to use haskell-indent
and was hit by some
bugs in that indentation mode I though 'Boy, this can't be that hard
to get right!'. Wow, was I wrong! And here we are, over a year into
fixing haskell-indentation
a decent TAB-cycle Haskell indentation is
not even close. Let me describe why it is such a surprisingly hard
problem.
To make things clear note that we are talking about TAB-cycle based indentation here, that is when a user presses TAB key they should be presented with a list of syntactically valid and aesthetically pleasing indentation points for the line of code containing cursor while taking into account the surrounding context.
Whitespace is semantically significant in Haskell in the sense that some rules take it into account how to group expressions. This mechanism is called layout rule or off-side rule.
Off-side rule says that Haskell parser should add an implicit opening or closing braces or semicolons in places determined by keywords, newlines and relative amounts of indentation of lines. Informally layout rules is described in Section 2.7 of The Haskell 98 Report and formally specified in Section 9.3.
There are two major problems lurking in there.
Most rules of layout look normal but one should catch your attention as a somewhat non-standard if not hipster construct:
- insert a virtual '}' if parse-error(t) (Note 5)
What does this mean? Helpful Note 5 explains (some details omitted):
The side condition parse-error(t) is to be interpreted as follows: if the tokens generated so far together with the next token t represent an invalid prefix of the Haskell grammar, and the tokens generated so far followed by the token "}" represent a valid prefix of the Haskell grammar, then parse-error(t) is true.
To know if parse-error(t)
occurs it is required to fully implement
Haskell parser.
Let me repeat: for indentation to know when blocks close it must fully implement Haskell parser that accepts every Haskell language prefix. This is a lot to ask!
To understand how involved and against tools available in Emacs this requirement is lets look closer at the commas in the following expression:
[ e | let x, y :: String
x = "abc"
y = "def", e <- x ++ y ]
The let
keyword introduces a layout block, that is the easy part. Interestingly, the comma
between x
and y
does not produce a parse error and therefore does
not close the block. But there is another comma just after y = "def"
that can't possibly be a part of the let
construct so an implicit
closing brace is inserted just before it.
Compare this to how Python defines indentation grouping by means of lines, characters and columns, which are all concepts native to Emacs and therefore easy to implement. On top of that Python layout rules are independent of the rest of Python syntax and therefore provide a bit useful redundancy. For example statement header ends with a colon and it is expected that a block follows, but the existence of the block is determined solely by indentation.
def func():
pass
It would be much easier to provide intuitive indentation grouping if it was defined using more fundamental concepts like chars and lines instead of full Haskell grammar. It is a shame this is not the case.
According to Haskell's layout rule an implicit open brace is inserted when a token after one of layout keywords is something else than an open brace. This works with tokens on separate lines or on the same line as the opening keyword.
And this causes issues for indent-region
. In theory indent-region
could be used to fixup aesthetics of part of the code while keeping
semantics intact. Sadly, indent-region
is used also to fixup code indentation
that some other editing command broke. The most interesting example is
splice
operation from smartparens
or paredit
. Given the code:
func = (do putStrLn "abc"
putStrLn "def"
putStrLn "ghi")
Splice will remove parentheses:
func = do putStrLn "abc"
putStrLn "def"
putStrLn "ghi"
And fixup indentation by calling indent-region
on lines 2 and 3.
Sadly semantics of this program has been already broken and
there is no way for indent-region
to know how to fix it, it's too late. Note how
smoothly it would work if a newline would be required to start a new
layout block:
func = (do
putStrLn "abc"
putStrLn "def"
putStrLn "ghi")
splice:
func = do
putStrLn "abc"
putStrLn "def"
putStrLn "ghi"
fixup:
func = do
putStrLn "abc"
putStrLn "def"
putStrLn "ghi"
It is such a shame that Haskell grammar prevents usage of indent-region
in this way.
I've described but two harder issues with Haskell grammar definition and by no means it is only those two that cause headaches.
Haskell grammar make the indentation problem unnecessarily hard and
renders native Emacs concepts like indent-region
impossible to use in
practice. haskell-indentation
is an ambitious project and it tries
to support most of syntactic extensions out of the box and degrade
gracefully when a completely new syntactic construct is seen, support
for code while it is still being edited and is not yet in compilable
and support for Haskell-derived languages like Haskell+CPP,
Haskell+hsc, Haskell+c2hs, etc.
haskell-indentation
of course has competition. There is
structured-haskell-mode
that offers paredit
-style editing commands
and always keeps the code in a syntactically valid state. There is
hindent
that will standardize indentation for any valid Haskell
program it supports. Both of this approaches require shm
and hindent
to fully know the Haskell grammar and do not support Haskell-derived
language or newest extensions. Finally, there are ideas for simpler
indentation modes like ones reacting only to whitespace or only to
token boundaries.
I kind of like how haskell-indentation
works (it can be just Stockholm
syndrome, though). Recursive-descent Haskell parser that drives
haskell-indentation
reacts to key Haskell keywords and parenthesized
constructs. To have some form of graceful degradation
haskell-indentation
just slurps everything that it does not
recognize. It works quite well if a user agrees to work in the subset
of Haskell that does not present issues but can get royally confused
when an unrecognized construct comes along.
Recursive-descent parsers are easy to hack, there is an extensive test
suite for haskell-indentation
and most hairy places in the code are
already untangled. Although it looks like a never ending journey each
individual fixup is doable in finite amount of time on a weekend. So
there is way to do progress on haskell-indentation
, sadly there
always be something still left to fix.