Writing a parser for textile in gleam
Textile
textile is a markup language just like Markdown and its extensions, Asciidoc, Tex, Troff etc. You can check out GingerBill’s post about the topic—an interesting read. The textile website describes the language as such:
Textile is a markup language (like Markdown) for formatting text in a blog or a content management system (CMS). Textile has been around since 2002, and implementations exist for many major CMS.
What was interesting to me when I came across textile was, I had never seen anyone use it, or even mention it at all, at least not as much as Markdown which has become the universal default for READMEs, docs and even LLM outputs. So I looked into it, and the syntax felt like a fresh breeze for me, but also similar to some parts of GML which I only looked well into after reading GingerBill’s post.
Looking at the ecosystem briefly on the Textile Github org I found a spec from 18 years ago and some parser implementations, which were written in Php and Python, which my blog wasn’t going to depend on. I took the existing test fixtures (all in yaml) in the Php parser as well as the original spec repo and used them to form the basis of my testing in my Gleam parser. So far the fixtures were really helpful but I have yet to pass a large percentage of them as I am taking a different approach in my rendering compared to the spec as well as outright not parsing correctly in some places.
On the previous version of this website I had used markdoc while nice—I didn’t like that I had to rely on JS and it felt finicky to extend for my liking, since I had to write FFIs between Javascript and Gleam.
My goal with this new iteration, (particularly the first step) was simple, write a parser and migrate my minimal content from the existing markdoc content to textile, then build the SSG part with a system for defining functions and includes, and so far I have accomplished some of that.
Textile syntax at a glance
| Content | Textile | HTML | Markdown |
|---|---|---|---|
| Paragraph | p. paragraph | <p>paragraph</p> | paragraph |
| Heading 1 | h1. heading | <h1>heading</h1> | # heading |
| Heading 2 | h2. heading | <h2>heading</h2> | ## heading |
| Heading 3 | h3. heading | <h3>heading</h3> | ### heading |
| Bold | **bold** | <b>bold</b> | **bold** |
| Strong | *strong* | <strong>strong</strong> | **strong** |
| Italic | __italic__ | <i>italic</i> | *italic* |
| Emphasis | _emphasis_ | <em>emphasis</em> | *emphasis* |
| Deleted | -deleted- | <del>deleted</del> | ~~deleted~~ |
| Inserted | +inserted+ | <ins>inserted</ins> | N/A |
| Superscript | text ^super^ | text <sup>super</sup> | N/A |
| Subscript | text ~sub~ | text <sub>sub</sub> | N/A |
| Inline code | @code@ | code>code</code> | `code` |
| Citation | ??citation?? | <cite>citation</cite> | N/A |
| Block code | bc. code | code | ```code``` |
| Blockquote | bq. quote | <blockquote>quote</blockquote> | quote |
| Unordered list | * item | <ul><li>item</li></ul> | * item |
| Ordered list | # item | <ol><li>item</li></ol> | 1. item |
| Link | "text":url | <a href="url">text</a> | (url) |
| Image | !url(alt)! | <img src="url" alt="alt"> |  |
| Footnote ref | text[1] | text<sup>1</sup> | N/A |
| Footnote def | fn1. text | <p id="fn1">text</p> | N/A |
| No textile | ==raw== | N/A | N/A |
| Span | %span% | <span>span</span> | N/A |
| Preformatted | pre. text | <pre>text</pre> | N/A |
| Comment | ###. comment or HTML Comments <!-- comment --> | <!-- comment --> | N/A |
| Some of the N/A for markdown, may very well exists. They just depend on implementation, use case, ecosystem. | |||
Writing a parser for Textile
This is my second attempt at writing a parser for a grammar, I have tried once with graphql using `Swift`. I will have a repo up for that when this post goes up, it was just an attempt at getting better at writing them.
Writing a parser in gleam allowed me to just use gleams pattern matching feature so essentially after tokenising, in the parser context I would match on exact scenarios to present a resulting AST1
The parsing process
So for some context, here is my understanding of the parsing process
The lexer
The lexer would take in some string which may or not include correct textile syntax, and the result of this part of the parsing process will be to output individual tokens, we can think of tokens as atoms, which represent the smallest meaningful unit of a string, so for an example input "p. Hello World" when we run the lex(...) function we get tokens, as such:
[
BlockMarker("p."),
Whitespace(" "),
Text("Hello"),
Whitespace(" "),
Text("World"),
]
Here we have extracted the individual tokens from the string, you might assume each letter becomes its own token, while that is valid in some grammars, for our purpose full words separated by whitespace work better.
Another value to this approach specifically for our textile parser is this pays off when using Gleam's List patterns which allow for somewhat representing our expectation of the syntax as patterns and performing actions on them directly.
Syntax analysis
Now that we have our List(Tokens) we can construct an AST for our Textile input, for that we can approach the tokens as below, based on similar patterns they share.
| Blocks | Inline | Modifiers and Attributes | Table |
- Block tokens start on a new line, so our example input
p. Hello Worldis a block, so we use a pattern that expects aBlock(..), with this pattern we are instructing theparse_blocks(...)function; when you encounter an array with the first token asBlock(block_token)we want to enter this code block, run the functions inside and match on the textblock_tokenand depending on that as well we have different behaviour, which in our example would be to parse the paragraph
[Block(block_token), ..rest] -> {
.. = parse_block(blocken_token, [])
...
}
For our example input, the match on block_token will result in us checking "p." and stepping into the parse_paragraph(...) function
"p." -> parse_paragraph(...)The parser then return a list of nodes represent our AST for our input document, Node(Para(modifiers: [],content: List(Node(TextNode("Hello...")))))
- Inline, extending our hello world example paragraph we add a link,
p. Hello World, "Wikipedia":https://en.wikipedia.org/, this input will be analysed in and produce the tokens below, expanding on the earlier tokens.
[
BlockMarker("p."),
Whitespace(" "),
Text("Hello"),
Whitespace(" "),
Text("World,"),
Whitespace(" "),
Quote,
Text("Wikipedia"),
Quote,
Colon,
Text("https"),
Colon,
DoubleSlash,
Text("en"),
Period,
Text("wikipedia"),
Period,
Text("org"),
Slash,
]
- Modifiers and Attributes, textile allows us to adjust the way a text is rendered inline with tokens like
, <, =, <>for alignments as well as having~~, **, __to make text emphasized, bold, italic. so for modifiers they are applied to the texts inline and the attributes to the blocks or cells.
- Tables, tables in textile are similar to markdown, except they allow for defining table header, table footer, summary etc. full spec here
The SSG
Now that I have the textile parser done, I added a few extensions that I will discuss in another post, with the SSG, I implemented a frontmatter extraction bit, a simple includes pattern, as well as passing arguments to the includes for adding partial .textile files, as well as a loop. I plan to add some other functions as I find need for them, one of them being an extract headings function to include table of contents.
This was nice to work on as I have found parsing, such a common pattern between Linguistics, and computer science that I enjoy, plus, getting better at this could help me move faster on some other ideas I have in the future.
Till next time.
The repo for the gleam textile parser is NOT yet up!
1AST