Set up a module structure (#44)

2025-07-27 17:17:45 +00:00 · 2023-04-07 18:06:51 +02:00 · 2023-04-07 18:06:51 +02:00 · 16f7a3bd80
commit 16f7a3bd80
parent baa0a9f2f0
38 changed files with 2291 additions and 454 deletions
--- a/README.md
+++ b/README.md
@ -32,8 +32,291 @@ Notable differences of `cstree` compared to `rowan`:
  - Performance optimizations for tree traversal: persisting red nodes allows tree traversal methods to return references. You can still `clone` to obtain an owned node, but you only pay that cost when you need to.

 ## Getting Started
-The main entry points for constructing syntax trees are `GreenNodeBuilder` and `SyntaxNode::new_root` for green and red trees respectively.
-See `examples/s_expressions` for a guided tutorial to `cstree`.
+
+If you're looking at `cstree`, you're probably looking at or already writing a parser and are considering using
+concrete syntax trees as its output. We'll talk more about parsing below -- first, let's have a look at what needs
+to happen to go from input text to a `cstree` syntax tree:
+
+ 1. Define an enumeration of the types of tokens (like keywords) and nodes (like "an expression")
+ that you want to have in your syntax and implement `Language`
+
+ 2. Create a `GreenNodeBuilder` and call `start_node`, `token` and `finish_node` from your parser  
+
+ 3. Call `SyntaxNode::new_root` or `SyntaxNode::new_root_with_resolver` with the resulting
+ `GreenNode` to obtain a syntax tree that you can traverse
+
+Let's walk through the motions of parsing a (very) simple language into `cstree` syntax trees.
+We'll just support addition and subtraction on integers, from which the user is allowed to construct a single,
+compound expression. They will, however, be allowed to write nested expressions in parentheses, like `1 - (2 + 5)`.
+
+### Defining the language
+First, we need to list the different part of our language's grammar.
+We can do that using an `enum` with a unit variant for any terminal and non-terminal.
+The `enum` needs to be convertible to a `u16`, so we use the `repr` attribute to ensure it uses the correct
+representation.
+
+```rust
+#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
+#[repr(u16)]
+enum SyntaxKind {
+    /* Tokens */
+    Int,    // 42
+    Plus,   // +
+    Minus,  // -
+    LParen, // (
+    RParen, // )
+    /* Nodes */
+    Expr,
+    Root,
+}
+```
+
+Most of these are tokens to lex the input string into, like numbers (`Int`) and operators (`Plus`, `Minus`).
+We only really need one type of node; expressions.
+Our syntax tree's root node will have the special kind `Root`, all other nodes will be
+expressions containing a sequence of arithmetic operations potentially involving further, nested
+expression nodes.
+
+To use our `SyntaxKind`s with `cstree`, we need to tell it how to convert it back to just a number (the
+`#[repr(u16)]` that we added) by implementing the `Language` trait. We can also tell `cstree` about tokens that
+always have the same text through the `static_text` method on the trait. This is useful for the operators and
+parentheses, but not possible for numbers, since an integer token may be produced from the input `3`, but also from
+other numbers like `7` or `12`. We implement `Language` on an empty type, just so we can give it a name.
+
+```rust
+#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
+pub struct Calculator;
+
+impl Language for Calculator {
+    // The tokens and nodes we just defined
+    type Kind = SyntaxKind;
+
+    fn kind_from_raw(raw: RawSyntaxKind) -> Self::Kind {
+        // This just needs to be the inverse of `kind_to_raw`, but could also
+        // be an `impl TryFrom<u16> for SyntaxKind` or any other conversion.
+        match raw.0 {
+            0 => SyntaxKind::Int,
+            1 => SyntaxKind::Plus,
+            2 => SyntaxKind::Minus,
+            3 => SyntaxKind::LParen,
+            4 => SyntaxKind::RParen,
+            5 => SyntaxKind::Expr,
+            6 => SyntaxKind::Root,
+            n => panic!("Unknown raw syntax kind: {n}"),
+        }
+    }
+
+    fn kind_to_raw(kind: Self::Kind) -> RawSyntaxKind {
+        RawSyntaxKind(kind as u16)
+    }
+
+    fn static_text(kind: Self::Kind) -> Option<&'static str> {
+        match kind {
+            SyntaxKind::Plus => Some("+"),
+            SyntaxKind::Minus => Some("-"),
+            SyntaxKind::LParen => Some("("),
+            SyntaxKind::RParen => Some(")"),
+            _ => None,
+        }
+    }
+}
+```
+
+### Parsing into a green tree
+With that out of the way, we can start writing the parser for our expressions.
+For the purposes of this introduction to `cstree`, I'll assume that there is a lexer that yields the following
+tokens:
+
+```rust
+#[derive(Debug, PartialEq, Eq, Clone, Copy)]
+pub enum Token<'input> {
+    // Note that number strings are not yet parsed into actual numbers,
+    // we just remember the slice of the input that contains their digits
+    Int(&'input str),
+    Plus,
+    Minus,
+    LParen,
+    RParen,
+    // A special token that indicates that we have reached the end of the file
+    EoF,
+}
+```
+
+A simple lexer that yields such tokens is part of the full `readme` example, but we'll be busy enough with the
+combination of `cstree` and the actual parser, which we define like this:
+
+```rust
+pub struct Parser<'input> {
+             // `Peekable` is a standard library iterator adapter that allows
+             // looking ahead at the next item without removing it from the iterator yet
+    lexer:   Peekable<Lexer<'input>>,
+    builder: GreenNodeBuilder<'static, 'static, Calculator>,
+}
+
+impl<'input> Parser<'input> {
+    pub fn new(input: &'input str) -> Self {
+        Self {
+            // we get `peekable` from implementing `Iterator` on `Lexer`
+            lexer:   Lexer::new(input).peekable(),
+            builder: GreenNodeBuilder::new(),
+        }
+    }
+
+    pub fn bump(&mut self) -> Option<Token<'input>> {
+        self.lexer.next()
+    }
+}
+```
+
+In contrast to parsers that return abstract syntax trees, with `cstree` the syntax tree nodes for
+all element in the language grammar will have the same type: `GreenNode` for the inner ("green")
+tree and `SyntaxNode` for the outer ("red") tree.  Different kinds of nodes (and tokens) are
+differentiated by their `SyntaxKind` tag, which we defined above.
+
+You can implement many types of parsers with `cstree`. To get a feel for how it works, consider
+a typical recursive descent parser. With a more traditional AST, one would define different AST
+structs for struct or function definitions, statements, expressions and so on. Inside the
+parser, the components of any element, such as all fields of a struct or all statements inside a
+function, are parsed first and then the parser wraps them in the matching AST type, which is
+returned from the corresponding parser function.
+
+Because `cstree`'s syntax trees are untyped, there is no explicit AST representation that the parser
+would build.  Instead, parsing into a CST using the `GreenNodeBuilder` follows the source code more
+closely in that you tell `cstree` about each new element you enter and all tokens that the parser
+consumes. So, for example, to parse a struct definition the parser first "enters" the struct
+definition node, then parses the `struct` keyword and type name, then parses each field, and finally
+"finishes" parsing the struct node.
+
+The most trivial example is the root node for our parser, which just creates a root node
+containing the whole expression (we could do without a specific root node if any expression was
+a node, in particular if we wrapped integer literal tokens inside `Expr` nodes).
+
+```rust
+pub fn parse(&mut self) -> Result<(), String> {
+    self.builder.start_node(SyntaxKind::Root);
+    self.parse_expr()?;
+    self.builder.finish_node();
+    Ok(())
+}
+```
+
+As there isn't a static AST type to return, the parser is very flexible as to what is part of a
+node. In the previous example, if the user is adding a new field to the struct and has not yet
+typed the field's type, the CST node for the struct doesn't care if there is no child node for
+it. Similarly, if the user is deleting fields and the source code currently contains a leftover
+field name, this additional identifier can be a part of the struct node without any
+modifications to the syntax tree definition. This property is the key to why CSTs are such a
+good fit as a lossless input representation, which necessitates the syntax tree to mirror the
+user-specific layout of whitespace and comments around the AST items.
+
+In the parser for our simple expression language, we'll also have to deal with the fact that,
+when we see a number the parser doesn't yet know whether there will be additional operations
+following that number. That is, in the expression `1 + 2`, it can only know that it is parsing
+a binary operation once it sees the `+`. The event-like model of building trees in `cstree`,
+however, implies that when reaching the `+`, the parser would have to have already entered an
+expression node in order for the whole input to be part of the expression.
+
+To get around this, `GreenNodeBuilder` provides the `checkpoint` method, which we can call to
+"remember" the current position in the input. For example, we can create a checkpoint before the
+parser parses the first `1`.  Later, when it sees the following `+`, it can create an `Expr` node
+for the whole expression using `start_node_at`:
+
+```rust
+fn parse_lhs(&mut self) -> Result<(), String> {
+    // An expression may start either with a number, or with an opening parenthesis that is
+    // the start of a parenthesized expression
+    let next_token = *self.lexer.peek().unwrap();
+    match next_token {
+        Token::Int(n) => {
+            self.bump();
+            self.builder.token(SyntaxKind::Int, n);
+        }
+        Token::LParen => {
+            // Wrap the grouped expression inside a node containing it and its parentheses
+            self.builder.start_node(SyntaxKind::Expr);
+            self.bump();
+            self.builder.static_token(SyntaxKind::LParen);
+            self.parse_expr()?; // Inner expression
+            if self.bump() != Some(Token::RParen) {
+                return Err("Missing ')'".to_string());
+            }
+            self.builder.static_token(SyntaxKind::RParen);
+            self.builder.finish_node();
+        }
+        Token::EoF => return Err("Unexpected end of file: expected expression".to_string()),
+        t => return Err(format!("Unexpected start of expression: '{t:?}'")),
+    }
+    Ok(())
+}
+
+fn parse_expr(&mut self) -> Result<(), String> {
+    // Remember our current position
+    let before_expr = self.builder.checkpoint();
+
+    // Parse the start of the expression
+    self.parse_lhs()?;
+
+    // Check if the expression continues with `+ <more>` or `- <more>`
+    let Some(next_token) = self.lexer.peek() else {
+        return Ok(());
+    };
+    let op = match *next_token {
+        Token::Plus => SyntaxKind::Plus,
+        Token::Minus => SyntaxKind::Minus,
+        Token::RParen | Token::EoF => return Ok(()),
+        t => return Err(format!("Expected operator, found '{t:?}'")),
+    };
+
+    // If so, retroactively wrap the (already parsed) LHS and the following RHS
+    // inside an `Expr` node
+    self.builder.start_node_at(before_expr, SyntaxKind::Expr);
+    self.bump();
+    self.builder.static_token(op);
+    self.parse_expr()?; // RHS
+    self.builder.finish_node();
+    Ok(())
+}
+```
+
+### Obtaining the parser result
+
+Our parser is now capable of parsing our little arithmetic language, but it's methods don't return
+anything. So how do we get our syntax tree out? The answer lies in `GreenNodeBuilder::finish`, which
+finally returns the tree that we have painstakingly constructed.
+
+```rust
+impl Parser<'_> {
+    pub fn finish(mut self) -> (GreenNode, impl Interner) {
+        assert!(self.lexer.next().map(|t| t == Token::EoF).unwrap_or(true));
+        let (tree, cache) = self.builder.finish();
+        (tree, cache.unwrap().into_interner().unwrap())
+    }
+}
+```
+
+`finish` also returns the cache it used to deduplicate tree nodes and tokens, so you can re-use it
+for parsing related inputs (e.g., different source files from the same crate may share a lot of
+common function and type names that can be deduplicated). See `GreenNodeBuilder`'s documentation for
+more information on this, in particular the `with_cache` and `from_cache` methods. Most importantly
+for us, we can extract the `Interner` that contains the source text of the tree's tokens from the
+cache, which we need if we want to look up things like variable names or the value of numbers for
+our calculator.
+
+To work with the syntax tree, you'll want to upgrade it to a `SyntaxNode` using
+`SyntaxNode::new_root`.  You can also use `SyntaxNode::new_root_with_resolver` to combine tree and
+interner, which lets you directly retrieve source text and makes the nodes implement `Display` and
+`Debug`. The same output can be produced from `SyntaxNode`s by calling the `debug` or `display`
+method with a `Resolver`. To visualize the whole syntax tree, pass `true` for the `recursive`
+parameter on `debug`, or simply debug-print a `ResolvedNode`:
+
+```rust
+let input = "11 + 2-(5 + 4)";
+let mut parser = Parser::new(input);
+parser.parse().unwrap();
+let (tree, interner) = parser.finish();
+let root = SyntaxNode::<Calculator>::new_root_with_resolver(tree, interner);
+dbg!(root);
+```

 ## AST Layer
 While `cstree` is built for concrete syntax trees, applications are quite easily able to work with either a CST or an AST representation, or freely switch between them.