1
Fork 0
mirror of https://github.com/RGBCube/cstree synced 2025-07-27 17:17:45 +00:00

Set up a module structure (#44)

This commit is contained in:
DQ 2023-04-07 18:06:51 +02:00 committed by GitHub
parent baa0a9f2f0
commit 16f7a3bd80
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
38 changed files with 2291 additions and 454 deletions

287
README.md
View file

@ -32,8 +32,291 @@ Notable differences of `cstree` compared to `rowan`:
- Performance optimizations for tree traversal: persisting red nodes allows tree traversal methods to return references. You can still `clone` to obtain an owned node, but you only pay that cost when you need to.
## Getting Started
The main entry points for constructing syntax trees are `GreenNodeBuilder` and `SyntaxNode::new_root` for green and red trees respectively.
See `examples/s_expressions` for a guided tutorial to `cstree`.
If you're looking at `cstree`, you're probably looking at or already writing a parser and are considering using
concrete syntax trees as its output. We'll talk more about parsing below -- first, let's have a look at what needs
to happen to go from input text to a `cstree` syntax tree:
1. Define an enumeration of the types of tokens (like keywords) and nodes (like "an expression")
that you want to have in your syntax and implement `Language`
2. Create a `GreenNodeBuilder` and call `start_node`, `token` and `finish_node` from your parser
3. Call `SyntaxNode::new_root` or `SyntaxNode::new_root_with_resolver` with the resulting
`GreenNode` to obtain a syntax tree that you can traverse
Let's walk through the motions of parsing a (very) simple language into `cstree` syntax trees.
We'll just support addition and subtraction on integers, from which the user is allowed to construct a single,
compound expression. They will, however, be allowed to write nested expressions in parentheses, like `1 - (2 + 5)`.
### Defining the language
First, we need to list the different part of our language's grammar.
We can do that using an `enum` with a unit variant for any terminal and non-terminal.
The `enum` needs to be convertible to a `u16`, so we use the `repr` attribute to ensure it uses the correct
representation.
```rust
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
#[repr(u16)]
enum SyntaxKind {
/* Tokens */
Int, // 42
Plus, // +
Minus, // -
LParen, // (
RParen, // )
/* Nodes */
Expr,
Root,
}
```
Most of these are tokens to lex the input string into, like numbers (`Int`) and operators (`Plus`, `Minus`).
We only really need one type of node; expressions.
Our syntax tree's root node will have the special kind `Root`, all other nodes will be
expressions containing a sequence of arithmetic operations potentially involving further, nested
expression nodes.
To use our `SyntaxKind`s with `cstree`, we need to tell it how to convert it back to just a number (the
`#[repr(u16)]` that we added) by implementing the `Language` trait. We can also tell `cstree` about tokens that
always have the same text through the `static_text` method on the trait. This is useful for the operators and
parentheses, but not possible for numbers, since an integer token may be produced from the input `3`, but also from
other numbers like `7` or `12`. We implement `Language` on an empty type, just so we can give it a name.
```rust
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct Calculator;
impl Language for Calculator {
// The tokens and nodes we just defined
type Kind = SyntaxKind;
fn kind_from_raw(raw: RawSyntaxKind) -> Self::Kind {
// This just needs to be the inverse of `kind_to_raw`, but could also
// be an `impl TryFrom<u16> for SyntaxKind` or any other conversion.
match raw.0 {
0 => SyntaxKind::Int,
1 => SyntaxKind::Plus,
2 => SyntaxKind::Minus,
3 => SyntaxKind::LParen,
4 => SyntaxKind::RParen,
5 => SyntaxKind::Expr,
6 => SyntaxKind::Root,
n => panic!("Unknown raw syntax kind: {n}"),
}
}
fn kind_to_raw(kind: Self::Kind) -> RawSyntaxKind {
RawSyntaxKind(kind as u16)
}
fn static_text(kind: Self::Kind) -> Option<&'static str> {
match kind {
SyntaxKind::Plus => Some("+"),
SyntaxKind::Minus => Some("-"),
SyntaxKind::LParen => Some("("),
SyntaxKind::RParen => Some(")"),
_ => None,
}
}
}
```
### Parsing into a green tree
With that out of the way, we can start writing the parser for our expressions.
For the purposes of this introduction to `cstree`, I'll assume that there is a lexer that yields the following
tokens:
```rust
#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub enum Token<'input> {
// Note that number strings are not yet parsed into actual numbers,
// we just remember the slice of the input that contains their digits
Int(&'input str),
Plus,
Minus,
LParen,
RParen,
// A special token that indicates that we have reached the end of the file
EoF,
}
```
A simple lexer that yields such tokens is part of the full `readme` example, but we'll be busy enough with the
combination of `cstree` and the actual parser, which we define like this:
```rust
pub struct Parser<'input> {
// `Peekable` is a standard library iterator adapter that allows
// looking ahead at the next item without removing it from the iterator yet
lexer: Peekable<Lexer<'input>>,
builder: GreenNodeBuilder<'static, 'static, Calculator>,
}
impl<'input> Parser<'input> {
pub fn new(input: &'input str) -> Self {
Self {
// we get `peekable` from implementing `Iterator` on `Lexer`
lexer: Lexer::new(input).peekable(),
builder: GreenNodeBuilder::new(),
}
}
pub fn bump(&mut self) -> Option<Token<'input>> {
self.lexer.next()
}
}
```
In contrast to parsers that return abstract syntax trees, with `cstree` the syntax tree nodes for
all element in the language grammar will have the same type: `GreenNode` for the inner ("green")
tree and `SyntaxNode` for the outer ("red") tree. Different kinds of nodes (and tokens) are
differentiated by their `SyntaxKind` tag, which we defined above.
You can implement many types of parsers with `cstree`. To get a feel for how it works, consider
a typical recursive descent parser. With a more traditional AST, one would define different AST
structs for struct or function definitions, statements, expressions and so on. Inside the
parser, the components of any element, such as all fields of a struct or all statements inside a
function, are parsed first and then the parser wraps them in the matching AST type, which is
returned from the corresponding parser function.
Because `cstree`'s syntax trees are untyped, there is no explicit AST representation that the parser
would build. Instead, parsing into a CST using the `GreenNodeBuilder` follows the source code more
closely in that you tell `cstree` about each new element you enter and all tokens that the parser
consumes. So, for example, to parse a struct definition the parser first "enters" the struct
definition node, then parses the `struct` keyword and type name, then parses each field, and finally
"finishes" parsing the struct node.
The most trivial example is the root node for our parser, which just creates a root node
containing the whole expression (we could do without a specific root node if any expression was
a node, in particular if we wrapped integer literal tokens inside `Expr` nodes).
```rust
pub fn parse(&mut self) -> Result<(), String> {
self.builder.start_node(SyntaxKind::Root);
self.parse_expr()?;
self.builder.finish_node();
Ok(())
}
```
As there isn't a static AST type to return, the parser is very flexible as to what is part of a
node. In the previous example, if the user is adding a new field to the struct and has not yet
typed the field's type, the CST node for the struct doesn't care if there is no child node for
it. Similarly, if the user is deleting fields and the source code currently contains a leftover
field name, this additional identifier can be a part of the struct node without any
modifications to the syntax tree definition. This property is the key to why CSTs are such a
good fit as a lossless input representation, which necessitates the syntax tree to mirror the
user-specific layout of whitespace and comments around the AST items.
In the parser for our simple expression language, we'll also have to deal with the fact that,
when we see a number the parser doesn't yet know whether there will be additional operations
following that number. That is, in the expression `1 + 2`, it can only know that it is parsing
a binary operation once it sees the `+`. The event-like model of building trees in `cstree`,
however, implies that when reaching the `+`, the parser would have to have already entered an
expression node in order for the whole input to be part of the expression.
To get around this, `GreenNodeBuilder` provides the `checkpoint` method, which we can call to
"remember" the current position in the input. For example, we can create a checkpoint before the
parser parses the first `1`. Later, when it sees the following `+`, it can create an `Expr` node
for the whole expression using `start_node_at`:
```rust
fn parse_lhs(&mut self) -> Result<(), String> {
// An expression may start either with a number, or with an opening parenthesis that is
// the start of a parenthesized expression
let next_token = *self.lexer.peek().unwrap();
match next_token {
Token::Int(n) => {
self.bump();
self.builder.token(SyntaxKind::Int, n);
}
Token::LParen => {
// Wrap the grouped expression inside a node containing it and its parentheses
self.builder.start_node(SyntaxKind::Expr);
self.bump();
self.builder.static_token(SyntaxKind::LParen);
self.parse_expr()?; // Inner expression
if self.bump() != Some(Token::RParen) {
return Err("Missing ')'".to_string());
}
self.builder.static_token(SyntaxKind::RParen);
self.builder.finish_node();
}
Token::EoF => return Err("Unexpected end of file: expected expression".to_string()),
t => return Err(format!("Unexpected start of expression: '{t:?}'")),
}
Ok(())
}
fn parse_expr(&mut self) -> Result<(), String> {
// Remember our current position
let before_expr = self.builder.checkpoint();
// Parse the start of the expression
self.parse_lhs()?;
// Check if the expression continues with `+ <more>` or `- <more>`
let Some(next_token) = self.lexer.peek() else {
return Ok(());
};
let op = match *next_token {
Token::Plus => SyntaxKind::Plus,
Token::Minus => SyntaxKind::Minus,
Token::RParen | Token::EoF => return Ok(()),
t => return Err(format!("Expected operator, found '{t:?}'")),
};
// If so, retroactively wrap the (already parsed) LHS and the following RHS
// inside an `Expr` node
self.builder.start_node_at(before_expr, SyntaxKind::Expr);
self.bump();
self.builder.static_token(op);
self.parse_expr()?; // RHS
self.builder.finish_node();
Ok(())
}
```
### Obtaining the parser result
Our parser is now capable of parsing our little arithmetic language, but it's methods don't return
anything. So how do we get our syntax tree out? The answer lies in `GreenNodeBuilder::finish`, which
finally returns the tree that we have painstakingly constructed.
```rust
impl Parser<'_> {
pub fn finish(mut self) -> (GreenNode, impl Interner) {
assert!(self.lexer.next().map(|t| t == Token::EoF).unwrap_or(true));
let (tree, cache) = self.builder.finish();
(tree, cache.unwrap().into_interner().unwrap())
}
}
```
`finish` also returns the cache it used to deduplicate tree nodes and tokens, so you can re-use it
for parsing related inputs (e.g., different source files from the same crate may share a lot of
common function and type names that can be deduplicated). See `GreenNodeBuilder`'s documentation for
more information on this, in particular the `with_cache` and `from_cache` methods. Most importantly
for us, we can extract the `Interner` that contains the source text of the tree's tokens from the
cache, which we need if we want to look up things like variable names or the value of numbers for
our calculator.
To work with the syntax tree, you'll want to upgrade it to a `SyntaxNode` using
`SyntaxNode::new_root`. You can also use `SyntaxNode::new_root_with_resolver` to combine tree and
interner, which lets you directly retrieve source text and makes the nodes implement `Display` and
`Debug`. The same output can be produced from `SyntaxNode`s by calling the `debug` or `display`
method with a `Resolver`. To visualize the whole syntax tree, pass `true` for the `recursive`
parameter on `debug`, or simply debug-print a `ResolvedNode`:
```rust
let input = "11 + 2-(5 + 4)";
let mut parser = Parser::new(input);
parser.parse().unwrap();
let (tree, interner) = parser.finish();
let root = SyntaxNode::<Calculator>::new_root_with_resolver(tree, interner);
dbg!(root);
```
## AST Layer
While `cstree` is built for concrete syntax trees, applications are quite easily able to work with either a CST or an AST representation, or freely switch between them.