|
| 1 | + |
| 2 | +# MD4C Readme |
| 3 | + |
| 4 | +* Home: http://github.com/mity/md4c |
| 5 | +* Wiki: http://github.com/mity/md4c/wiki |
| 6 | +* Issue tracker: http://github.com/mity/md4c/issues |
| 7 | + |
| 8 | +MD4C stands for "Markdown for C" and that's exactly what this project is about. |
| 9 | + |
| 10 | + |
| 11 | +## What is Markdown |
| 12 | + |
| 13 | +In short, Markdown is the markup language this `README.md` file is written in. |
| 14 | + |
| 15 | +The following resources can explain more if you are unfamiliar with it: |
| 16 | +* [Wikipedia article](http://en.wikipedia.org/wiki/Markdown) |
| 17 | +* [CommonMark site](http://commonmark.org) |
| 18 | + |
| 19 | + |
| 20 | +## What is MD4C |
| 21 | + |
| 22 | +MD4C is Markdown parser implementation in C, with the following features: |
| 23 | + |
| 24 | +* **Compliance:** Generally, MD4C aims to be compliant to the latest version of |
| 25 | + [CommonMark specification](http://spec.commonmark.org/). Currently, we are |
| 26 | + fully compliant to CommonMark 0.31. |
| 27 | + |
| 28 | +* **Extensions:** MD4C supports some commonly requested and accepted extensions. |
| 29 | + See below. |
| 30 | + |
| 31 | +* **Performance:** MD4C is [very fast](https://talk.commonmark.org/t/2520). |
| 32 | + |
| 33 | +* **Compactness:** MD4C parser is implemented in one source file and one header |
| 34 | + file. There are no dependencies other than standard C library. |
| 35 | + |
| 36 | +* **Embedding:** MD4C parser is easy to reuse in other projects, its API is |
| 37 | + very straightforward: There is actually just one function, `md_parse()`. |
| 38 | + |
| 39 | +* **Push model:** MD4C parses the complete document and calls few callback |
| 40 | + functions provided by the application to inform it about a start/end of |
| 41 | + every block, a start/end of every span, and with any textual contents. |
| 42 | + |
| 43 | +* **Portability:** MD4C builds and works on Windows and POSIX-compliant OSes. |
| 44 | + (It should be simple to make it run also on most other platforms, at least as |
| 45 | + long as the platform provides C standard library, including a heap memory |
| 46 | + management.) |
| 47 | + |
| 48 | +* **Encoding:** MD4C by default expects UTF-8 encoding of the input document. |
| 49 | + But it can be compiled to recognize ASCII-only control characters (i.e. to |
| 50 | + disable all Unicode-specific code), or (on Windows) to expect UTF-16 (i.e. |
| 51 | + what is on Windows commonly called just "Unicode"). See more details below. |
| 52 | + |
| 53 | +* **Permissive license:** MD4C is available under the [MIT license](LICENSE.md). |
| 54 | + |
| 55 | + |
| 56 | +## Using MD4C |
| 57 | + |
| 58 | +### Parsing Markdown |
| 59 | + |
| 60 | +If you need just to parse a Markdown document, you need to include `md4c.h` |
| 61 | +and link against MD4C library (`-lmd4c`); or alternatively add `md4c.[hc]` |
| 62 | +directly to your code base as the parser is only implemented in the single C |
| 63 | +source file. |
| 64 | + |
| 65 | +The main provided function is `md_parse()`. It takes a text in the Markdown |
| 66 | +syntax and a pointer to a structure which provides pointers to several callback |
| 67 | +functions. |
| 68 | + |
| 69 | +As `md_parse()` processes the input, it calls the callbacks (when entering or |
| 70 | +leaving any Markdown block or span; and when outputting any textual content of |
| 71 | +the document), allowing application to convert it into another format or render |
| 72 | +it onto the screen. |
| 73 | + |
| 74 | + |
| 75 | +### Converting to HTML |
| 76 | + |
| 77 | +If you need to convert Markdown to HTML, include `md4c-html.h` and link against |
| 78 | +MD4C-HTML library (`-lmd4c-html`); or alternatively add the sources `md4c.[hc]`, |
| 79 | +`md4c-html.[hc]` and `entity.[hc]` into your code base. |
| 80 | + |
| 81 | +To convert a Markdown input, call `md_html()` function. It takes the Markdown |
| 82 | +input and calls the provided callback function. The callback is fed with |
| 83 | +chunks of the HTML output. Typical callback implementation just appends the |
| 84 | +chunks into a buffer or writes them to a file. |
| 85 | + |
| 86 | + |
| 87 | +## Markdown Extensions |
| 88 | + |
| 89 | +The default behavior is to recognize only Markdown syntax defined by the |
| 90 | +[CommonMark specification](http://spec.commonmark.org/). |
| 91 | + |
| 92 | +However, with appropriate flags, the behavior can be tuned to enable some |
| 93 | +extensions: |
| 94 | + |
| 95 | +* With the flag `MD_FLAG_COLLAPSEWHITESPACE`, a non-trivial whitespace is |
| 96 | + collapsed into a single space. |
| 97 | + |
| 98 | +* With the flag `MD_FLAG_TABLES`, GitHub-style tables are supported. |
| 99 | + |
| 100 | +* With the flag `MD_FLAG_TASKLISTS`, GitHub-style task lists are supported. |
| 101 | + |
| 102 | +* With the flag `MD_FLAG_STRIKETHROUGH`, strike-through spans are enabled |
| 103 | + (text enclosed in tilde marks, e.g. `~foo bar~`). |
| 104 | + |
| 105 | +* With the flag `MD_FLAG_PERMISSIVEURLAUTOLINKS` permissive URL autolinks |
| 106 | + (not enclosed in `<` and `>`) are supported. |
| 107 | + |
| 108 | +* With the flag `MD_FLAG_PERMISSIVEEMAILAUTOLINKS`, permissive e-mail |
| 109 | + autolinks (not enclosed in `<` and `>`) are supported. |
| 110 | + |
| 111 | +* With the flag `MD_FLAG_PERMISSIVEWWWAUTOLINKS` permissive WWW autolinks |
| 112 | + without any scheme specified (e.g. `www.example.com`) are supported. MD4C |
| 113 | + then assumes `http:` scheme. |
| 114 | + |
| 115 | +* With the flag `MD_FLAG_LATEXMATHSPANS` LaTeX math spans (`$...$`) and |
| 116 | + LaTeX display math spans (`$$...$$`) are supported. (Note though that the |
| 117 | + HTML renderer outputs them verbatim in a custom tag `<x-equation>`.) |
| 118 | + |
| 119 | +* With the flag `MD_FLAG_WIKILINKS`, wiki-style links (`[[link label]]` and |
| 120 | + `[[target article|link label]]`) are supported. (Note that the HTML renderer |
| 121 | + outputs them in a custom tag `<x-wikilink>`.) |
| 122 | + |
| 123 | +* With the flag `MD_FLAG_UNDERLINE`, underscore (`_`) denotes an underline |
| 124 | + instead of an ordinary emphasis or strong emphasis. |
| 125 | + |
| 126 | +Few features of CommonMark (those some people see as mis-features) may be |
| 127 | +disabled with the following flags: |
| 128 | + |
| 129 | +* With the flag `MD_FLAG_NOHTMLSPANS` or `MD_FLAG_NOHTMLBLOCKS`, raw inline |
| 130 | + HTML or raw HTML blocks respectively are disabled. |
| 131 | + |
| 132 | +* With the flag `MD_FLAG_NOINDENTEDCODEBLOCKS`, indented code blocks are |
| 133 | + disabled. |
| 134 | + |
| 135 | + |
| 136 | +## Input/Output Encoding |
| 137 | + |
| 138 | +The CommonMark specification declares that any sequence of Unicode code points |
| 139 | +is a valid CommonMark document. |
| 140 | + |
| 141 | +But, under a closer inspection, Unicode plays any role in few very specific |
| 142 | +situations when parsing Markdown documents: |
| 143 | + |
| 144 | +1. For detection of word boundaries when processing emphasis and strong |
| 145 | + emphasis, some classification of Unicode characters (whether it is |
| 146 | + a whitespace or a punctuation) is needed. |
| 147 | + |
| 148 | +2. For (case-insensitive) matching of a link reference label with the |
| 149 | + corresponding link reference definition, Unicode case folding is used. |
| 150 | + |
| 151 | +3. For translating HTML entities (e.g. `&`) and numeric character |
| 152 | + references (e.g. `#` or `ಫ`) into their Unicode equivalents. |
| 153 | + |
| 154 | + However note MD4C leaves this translation on the renderer/application; as |
| 155 | + the renderer is supposed to really know output encoding and whether it |
| 156 | + really needs to perform this kind of translation. (For example, when the |
| 157 | + renderer outputs HTML, it may leave the entities untranslated and defer the |
| 158 | + work to a web browser.) |
| 159 | + |
| 160 | +MD4C relies on this property of the CommonMark and the implementation is, to |
| 161 | +a large degree, encoding-agnostic. Most of MD4C code only assumes that the |
| 162 | +encoding of your choice is compatible with ASCII. I.e. that the codepoints |
| 163 | +below 128 have the same numeric values as ASCII. |
| 164 | + |
| 165 | +Any input MD4C does not understand is simply seen as part of the document text |
| 166 | +and sent to the renderer's callback functions unchanged. |
| 167 | + |
| 168 | +The two situations (word boundary detection and link reference matching) where |
| 169 | +MD4C has to understand Unicode are handled as specified by the following |
| 170 | +preprocessor macros (as specified at the time MD4C is being built): |
| 171 | + |
| 172 | +* If preprocessor macro `MD4C_USE_UTF8` is defined, MD4C assumes UTF-8 for the |
| 173 | + word boundary detection and for the case-insensitive matching of link labels. |
| 174 | + |
| 175 | + When none of these macros is explicitly used, this is the default behavior. |
| 176 | + |
| 177 | +* On Windows, if preprocessor macro `MD4C_USE_UTF16` is defined, MD4C uses |
| 178 | + `WCHAR` instead of `char` and assumes UTF-16 encoding in those situations. |
| 179 | + (UTF-16 is what Windows developers usually call just "Unicode" and what |
| 180 | + Win32API generally works with.) |
| 181 | + |
| 182 | + Note that because this macro affects also the types in `md4c.h`, you have |
| 183 | + to define the macro both when building MD4C as well as when including |
| 184 | + `md4c.h`. |
| 185 | + |
| 186 | + Also note this is only supported in the parser (`md4c.[hc]`). The HTML |
| 187 | + renderer does not support this and you will have to write your own custom |
| 188 | + renderer to use this feature. |
| 189 | + |
| 190 | +* If preprocessor macro `MD4C_USE_ASCII` is defined, MD4C assumes nothing but |
| 191 | + an ASCII input. |
| 192 | + |
| 193 | + That effectively means that non-ASCII whitespace or punctuation characters |
| 194 | + won't be recognized as such and that link reference matching will work in |
| 195 | + a case-insensitive way only for ASCII letters (`[a-zA-Z]`). |
| 196 | + |
| 197 | + |
| 198 | +## Documentation |
| 199 | + |
| 200 | +The API of the parser is quite well documented in the comments in the `md4c.h`. |
| 201 | +Similarly, the markdown-to-html API is described in its header `md4c-html.h`. |
| 202 | + |
| 203 | +There is also [project wiki](http://github.com/mity/md4c/wiki) which provides |
| 204 | +some more comprehensive documentation. However note it is incomplete and some |
| 205 | +details may be somewhat outdated. |
| 206 | + |
| 207 | + |
| 208 | +## FAQ |
| 209 | + |
| 210 | +**Q: How does MD4C compare to other Markdown parsers?** |
| 211 | + |
| 212 | +**A:** Some other implementations combine Markdown parser and HTML generator |
| 213 | +into a single entangled code hidden behind an interface which just allows the |
| 214 | +conversion from Markdown to HTML. They are often unusable if you want to |
| 215 | +process the input in any other way. |
| 216 | + |
| 217 | +Second, most parsers (if not all of them; at least within the scope of C/C++ |
| 218 | +language) are full DOM-like parsers: They construct abstract syntax tree (AST) |
| 219 | +representation of the whole Markdown document. That takes time and it leads to |
| 220 | +bigger memory footprint. |
| 221 | + |
| 222 | +Building AST is completely fine as long as you need it. If you don't, there is |
| 223 | +a very high chance that using MD4C will be substantially faster and less hungry |
| 224 | +in terms of memory consumption. |
| 225 | + |
| 226 | +Last but not least, some Markdown parsers are implemented in a naive way. When |
| 227 | +fed with a [smartly crafted input pattern](test/pathological_tests.py), they |
| 228 | +may exhibit quadratic (or even worse) parsing times. What MD4C can still parse |
| 229 | +in a fraction of second may turn into long minutes or possibly hours with them. |
| 230 | +Hence, when such a naive parser is used to process an input from an untrusted |
| 231 | +source, the possibility of denial-of-service attacks becomes a real danger. |
| 232 | + |
| 233 | +A lot of our effort went into providing linear parsing times no matter what |
| 234 | +kind of crazy input MD4C parser is fed with. (If you encounter an input pattern |
| 235 | +which leads to a sub-linear parsing times, please do not hesitate and report it |
| 236 | +as a bug.) |
| 237 | + |
| 238 | +**Q: Does MD4C perform any input validation?** |
| 239 | + |
| 240 | +**A:** No. And we are proud of it. :-) |
| 241 | + |
| 242 | +CommonMark specification states that any sequence of Unicode characters is |
| 243 | +a valid Markdown document. (In practice, this more or less always means UTF-8 |
| 244 | +encoding.) |
| 245 | + |
| 246 | +In other words, according to the specification, it does not matter whether some |
| 247 | +Markdown syntax construction is in some way broken or not. If it's broken, it |
| 248 | +won't be recognized and the parser should see it just as a verbatim text. |
| 249 | + |
| 250 | +MD4C takes this a step further: It sees any sequence of bytes as a valid input, |
| 251 | +following completely the GIGO philosophy (garbage in, garbage out). I.e. any |
| 252 | +ill-formed UTF-8 byte sequence will propagate to the respective callback as |
| 253 | +a part of the text. |
| 254 | + |
| 255 | +If you need to validate that the input is, say, a well-formed UTF-8 document, |
| 256 | +you have to do it on your own. The easiest way how to do this is to simply |
| 257 | +validate the whole document before passing it to the MD4C parser. |
| 258 | + |
| 259 | + |
| 260 | +## License |
| 261 | + |
| 262 | +MD4C is covered with MIT license, see the file `LICENSE.md`. |
| 263 | + |
| 264 | + |
| 265 | +## Links to Related Projects |
| 266 | + |
| 267 | +Ports and bindings to other languages: |
| 268 | + |
| 269 | +* [commonmark-d](https://github.com/AuburnSounds/commonmark-d): |
| 270 | + Port of MD4C to D language. |
| 271 | + |
| 272 | +* [markdown-wasm](https://github.com/rsms/markdown-wasm): |
| 273 | + Port of MD4C to WebAssembly. |
| 274 | + |
| 275 | +* [PyMD4C](https://github.com/dominickpastore/pymd4c): |
| 276 | + Python bindings for MD4C |
| 277 | + |
| 278 | +Software using MD4C: |
| 279 | + |
| 280 | +* [imgui_md](https://github.com/mekhontsev/imgui_md): |
| 281 | + Markdown renderer for [Dear ImGui](https://github.com/ocornut/imgui) |
| 282 | + |
| 283 | +* [MarkDown Monolith Assembler](https://github.com/1Hyena/mdma): |
| 284 | + A command line tool for building browser-based books. |
| 285 | + |
| 286 | +* [QOwnNotes](https://www.qownnotes.org/): |
| 287 | + A plain-text file notepad and todo-list manager with markdown support and |
| 288 | + ownCloud / Nextcloud integration. |
| 289 | + |
| 290 | +* [Qt](https://www.qt.io/): |
| 291 | + Cross-platform C++ GUI framework. |
| 292 | + |
| 293 | +* [Textosaurus](https://github.com/martinrotter/textosaurus): |
| 294 | + Cross-platform text editor based on Qt and Scintilla. |
| 295 | + |
| 296 | +* [8th](https://8th-dev.com/): |
| 297 | + Cross-platform concatenative programming language. |
0 commit comments