Why TAB cycle indentation for Haskell is a hard problem

I remember when I started to use haskell-indent and was hit by some bugs in that indentation mode I though 'Boy, this can't be that hard to get right!'. Wow, was I wrong! And here we are, over a year into fixing haskell-indentation a decent TAB-cycle Haskell indentation is not even close. Let me describe why it is such a surprisingly hard problem.

To make things clear note that we are talking about TAB-cycle based indentation here, that is when a user presses TAB key they should be presented with a list of syntactically valid and aesthetically pleasing indentation points for the line of code containing cursor while taking into account the surrounding context.

Whitespace is semantically significant in Haskell in the sense that some rules take it into account how to group expressions. This mechanism is called layout rule or off-side rule.

Off-side rule says that Haskell parser should add an implicit opening or closing braces or semicolons in places determined by keywords, newlines and relative amounts of indentation of lines. Informally layout rules is described in Section 2.7 of The Haskell 98 Report and formally specified in Section 9.3.

There are two major problems lurking in there.

Context sensitive implicit blocks

Most rules of layout look normal but one should catch your attention as a somewhat non-standard if not hipster construct:

insert a virtual '}' if parse-error(t) (Note 5)

What does this mean? Helpful Note 5 explains (some details omitted):

The side condition parse-error(t) is to be interpreted as follows: if the tokens generated so far together with the next token t represent an invalid prefix of the Haskell grammar, and the tokens generated so far followed by the token "}" represent a valid prefix of the Haskell grammar, then parse-error(t) is true.

To know if parse-error(t) occurs it is required to fully implement Haskell parser.

Let me repeat: for indentation to know when blocks close it must fully implement Haskell parser that accepts every Haskell language prefix. This is a lot to ask!

To understand how involved and against tools available in Emacs this requirement is lets look closer at the commas in the following expression:

   [ e | let x, y :: String
             x = "abc"
             y = "def", e <- x ++ y ]

The let keyword introduces a layout block, that is the easy part. Interestingly, the comma between x and y does not produce a parse error and therefore does not close the block. But there is another comma just after y = "def" that can't possibly be a part of the let construct so an implicit closing brace is inserted just before it.

Compare this to how Python defines indentation grouping by means of lines, characters and columns, which are all concepts native to Emacs and therefore easy to implement. On top of that Python layout rules are independent of the rest of Python syntax and therefore provide a bit useful redundancy. For example statement header ends with a colon and it is expected that a block follows, but the existence of the block is determined solely by indentation.

def func():
    pass

It would be much easier to provide intuitive indentation grouping if it was defined using more fundamental concepts like chars and lines instead of full Haskell grammar. It is a shame this is not the case.

Implicit blocks open on the same line

According to Haskell's layout rule an implicit open brace is inserted when a token after one of layout keywords is something else than an open brace. This works with tokens on separate lines or on the same line as the opening keyword.

And this causes issues for indent-region. In theory indent-region could be used to fixup aesthetics of part of the code while keeping semantics intact. Sadly, indent-region is used also to fixup code indentation that some other editing command broke. The most interesting example is splice operation from smartparens or paredit. Given the code:

func = (do putStrLn "abc"
           putStrLn "def"
           putStrLn "ghi")

Splice will remove parentheses:

func = do putStrLn "abc"
           putStrLn "def"
           putStrLn "ghi"

And fixup indentation by calling indent-region on lines 2 and 3. Sadly semantics of this program has been already broken and there is no way for indent-region to know how to fix it, it's too late. Note how smoothly it would work if a newline would be required to start a new layout block:

func = (do
            putStrLn "abc"
            putStrLn "def"
            putStrLn "ghi")

splice:

func = do
            putStrLn "abc"
            putStrLn "def"
            putStrLn "ghi"

fixup:

func = do
    putStrLn "abc"
    putStrLn "def"
    putStrLn "ghi"

It is such a shame that Haskell grammar prevents usage of indent-region in this way.

Where do we go from here?

I've described but two harder issues with Haskell grammar definition and by no means it is only those two that cause headaches.

Haskell grammar make the indentation problem unnecessarily hard and renders native Emacs concepts like indent-region impossible to use in practice. haskell-indentation is an ambitious project and it tries to support most of syntactic extensions out of the box and degrade gracefully when a completely new syntactic construct is seen, support for code while it is still being edited and is not yet in compilable and support for Haskell-derived languages like Haskell+CPP, Haskell+hsc, Haskell+c2hs, etc.

haskell-indentation of course has competition. There is structured-haskell-mode that offers paredit-style editing commands and always keeps the code in a syntactically valid state. There is hindent that will standardize indentation for any valid Haskell program it supports. Both of this approaches require shm and hindent to fully know the Haskell grammar and do not support Haskell-derived language or newest extensions. Finally, there are ideas for simpler indentation modes like ones reacting only to whitespace or only to token boundaries.

I kind of like how haskell-indentation works (it can be just Stockholm syndrome, though). Recursive-descent Haskell parser that drives haskell-indentation reacts to key Haskell keywords and parenthesized constructs. To have some form of graceful degradation haskell-indentation just slurps everything that it does not recognize. It works quite well if a user agrees to work in the subset of Haskell that does not present issues but can get royally confused when an unrecognized construct comes along.

Recursive-descent parsers are easy to hack, there is an extensive test suite for haskell-indentation and most hairy places in the code are already untangled. Although it looks like a never ending journey each individual fixup is doable in finite amount of time on a weekend. So there is way to do progress on haskell-indentation, sadly there always be something still left to fix.

Why TAB cycle indentation for Haskell is a hard problem

Context sensitive implicit blocks

Implicit blocks open on the same line

Where do we go from here?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally