📜 Add better documentation across the compiler. (#3)

These changes pay particular attention to API endpoints, to try to ensure that any rustdocs generated are detailed and sensible. A good next step, eventually, might be to include doctest examples, as well. For the moment, it's not clear that they would provide a lot of value, though. In addition, this does a couple refactors to simplify the code base in ways that make things clearer or, at least, briefer.
2023-05-13 14:34:48 -05:00
parent f4594bf2cc
commit 1fbfd0c2d2
28 changed files with 1550 additions and 432 deletions
--- a/src/syntax/ast.rs
+++ b/src/syntax/ast.rs
@@ -1,12 +1,32 @@
 use crate::syntax::Location;

+/// The set of valid binary operators.
 pub static BINARY_OPERATORS: &[&str] = &["+", "-", "*", "/"];

+/// A structure represented a parsed program.
+///
+/// One `Program` is associated with exactly one input file, and the
+/// vector is arranged in exactly the same order as the parsed file.
+/// Because this is the syntax layer, the program is guaranteed to be
+/// syntactically valid, but may be nonsense. There could be attempts
+/// to use unbound variables, for example, until after someone runs
+/// `validate` and it comes back without errors.
 #[derive(Clone, Debug, PartialEq)]
 pub struct Program {
    pub statements: Vec<Statement>,
 }

+/// A parsed statement.
+///
+/// Statements are guaranteed to be syntactically valid, but may be
+/// complete nonsense at the semantic level. Which is to say, all the
+/// print statements were correctly formatted, and all the variables
+/// referenced are definitely valid symbols, but they may not have
+/// been defined or anything.
+///
+/// Note that equivalence testing on statements is independent of
+/// source location; it is testing if the two statements say the same
+/// thing, not if they are the exact same statement.
 #[derive(Clone, Debug)]
 pub enum Statement {
    Binding(Location, String, Expression),
@@ -28,6 +48,12 @@ impl PartialEq for Statement {
    }
 }

+/// An expression in the underlying syntax.
+///
+/// Like statements, these expressions are guaranteed to have been
+/// formatted correctly, but may not actually make any sense. Also
+/// like Statements, the [`PartialEq`] implementation does not take
+/// source positions into account.
 #[derive(Clone, Debug)]
 pub enum Expression {
    Value(Location, Value),
@@ -54,7 +80,9 @@ impl PartialEq for Expression {
    }
 }

+/// A value from the source syntax
 #[derive(Clone, Debug, PartialEq, Eq)]
 pub enum Value {
+    /// The value of the number, and an optional base that it was written in
    Number(Option<u8>, i64),
 }
--- a/src/syntax/eval.rs
+++ b/src/syntax/eval.rs
@@ -4,11 +4,23 @@ use crate::eval::{EvalEnvironment, EvalError, Value};
 use crate::syntax::{Expression, Program, Statement};

 impl Program {
+    /// Evaluate the program, returning either an error or what it prints out when run.
+    ///
+    /// Doing this evaluation is particularly useful for testing, to ensure that if we
+    /// modify a program in some way it does the same thing on both sides of the
+    /// transformation. It's also sometimes just nice to know what a program will be
+    /// doing.
+    ///
+    /// Note that the errors here are slightly more strict that we enforce at runtime.
+    /// For example, we check for overflow and underflow errors during evaluation, and
+    /// we don't check for those in the compiled code.
    pub fn eval(&self) -> Result<String, EvalError> {
        let mut env = EvalEnvironment::empty();
        let mut stdout = String::new();

        for stmt in self.statements.iter() {
+            // at this point, evaluation is pretty simple. just walk through each
+            // statement, in order, and record printouts as we come to them.
            match stmt {
                Statement::Binding(_, name, value) => {
                    let actual_value = value.eval(&env)?;
@@ -40,6 +52,7 @@ impl Expression {
                let mut arg_values = Vec::with_capacity(args.len());

                for arg in args.iter() {
+                    // yay, recursion! makes this pretty straightforward
                    arg_values.push(arg.eval(env)?);
                }

--- a/src/syntax/location.rs
+++ b/src/syntax/location.rs
@@ -1,5 +1,9 @@
 use codespan_reporting::diagnostic::{Diagnostic, Label};

+/// A source location, for use in pointing users towards warnings and errors.
+///
+/// Internally, locations are very tied to the `codespan_reporting` library,
+/// and the primary use of them is to serve as anchors within that library.
 #[derive(Clone, Debug, Eq, PartialEq)]
 pub struct Location {
    file_idx: usize,
@@ -7,10 +11,22 @@ pub struct Location {
 }

 impl Location {
+    /// Generate a new `Location` from a file index and an offset from the
+    /// start of the file.
+    ///
+    /// The file index is based on the file database being used. See the
+    /// `codespan_reporting::files::SimpleFiles::add` function, which is
+    /// normally where we get this index.
    pub fn new(file_idx: usize, offset: usize) -> Self {
        Location { file_idx, offset }
    }

+    /// Generate a `Location` for a completely manufactured bit of code.
+    ///
+    /// Ideally, this is used only in testing, as any code we generate as
+    /// part of the compiler should, theoretically, be tied to some actual
+    /// location in the source code. That being said, this can be used in
+    /// a pinch ... just maybe try to avoid it if you can.
    pub fn manufactured() -> Self {
        Location {
            file_idx: 0,
@@ -18,27 +34,73 @@ impl Location {
        }
    }

+    /// Generate a primary label for a [`Diagnostic`], based on this source
+    /// location.
+    ///
+    /// Note, this is just the [`Label`], you'll want to fill in the [`Diagnostic`]
+    /// with a lot more information.
+    ///
+    /// Primary labels are the things that are they key cause of the message.
+    /// If, for example, it was an error to bind a variable named "x", and
+    /// then have another binding of a variable named "x", the second one
+    /// would likely be the primary label (because that's where the error
+    /// actually happened), but you'd probably want to make the first location
+    /// the secondary label to help users find it.
    pub fn primary_label(&self) -> Label<usize> {
        Label::primary(self.file_idx, self.offset..self.offset)
    }

+    /// Generate a secondary label for a [`Diagnostic`], based on this source
+    /// location.
+    ///
+    /// Note, this is just the [`Label`], you'll want to fill in the [`Diagnostic`]
+    /// with a lot more information.
+    ///
+    /// Secondary labels are the things that are involved in the message, but
+    /// aren't necessarily a problem in and of themselves. If, for example, it
+    /// was an error to bind a variable named "x", and then have another binding
+    /// of a variable named "x", the second one would likely be the primary
+    /// label (because that's where the error actually happened), but you'd
+    /// probably want to make the first location the secondary label to help
+    /// users find it.
    pub fn secondary_label(&self) -> Label<usize> {
        Label::secondary(self.file_idx, self.offset..self.offset)
    }

-    pub fn range_label(&self, end: &Location) -> Vec<Label<usize>> {
-        if self.file_idx == end.file_idx {
-            vec![Label::primary(self.file_idx, self.offset..end.offset)]
-        } else if self.file_idx == 0 {
-            // if this is a manufactured item, then ... just try the other one
-            vec![Label::primary(end.file_idx, end.offset..end.offset)]
+    /// Given this location and another, generate a primary label that
+    /// specifies the area between those two locations.
+    ///
+    /// See [`Self::primary_label`] for some discussion of primary versus
+    /// secondary labels. If the two locations are the same, this method does
+    /// the exact same thing as [`Self::primary_label`]. If this item was
+    /// generated by [`Self::manufactured`], it will act as if you'd called
+    /// `primary_label` on the argument. Otherwise, it will generate the obvious
+    /// span.
+    ///
+    /// This function will return `None` only in the case that you provide
+    /// labels from two different files, which it cannot sensibly handle.
+    pub fn range_label(&self, end: &Location) -> Option<Label<usize>> {
+        if self.file_idx == 0 {
+            return Some(end.primary_label());
+        }
+
+        if self.file_idx != end.file_idx {
+            return None;
+        }
+
+        if self.offset > end.offset {
+            Some(Label::primary(self.file_idx, end.offset..self.offset))
        } else {
-            // we'll just pick the first location if this is in two different
-            // files
-            vec![Label::primary(self.file_idx, self.offset..self.offset)]
+            Some(Label::primary(self.file_idx, self.offset..end.offset))
        }
    }

+    /// Return an error diagnostic centered at this location.
+    ///
+    /// Note that this [`Diagnostic`] will have no information associated with
+    /// it other than that (a) there is an error, and (b) that the error is at
+    /// this particular location. You'll need to extend it with actually useful
+    /// information, like what kind of error it is.
    pub fn error(&self) -> Diagnostic<usize> {
        Diagnostic::error().with_labels(vec![Label::primary(
            self.file_idx,
@@ -46,6 +108,12 @@ impl Location {
        )])
    }

+    /// Return an error diagnostic centered at this location, with the given message.
+    ///
+    /// This is much more useful than [`Self::error`], because it actually provides
+    /// the user with some guidance. That being said, you still might want to add
+    /// even more information to ut, using [`Diagnostic::with_labels`],
+    /// [`Diagnostic::with_notes`], or [`Diagnostic::with_code`].
    pub fn labelled_error(&self, msg: &str) -> Diagnostic<usize> {
        Diagnostic::error().with_labels(vec![Label::primary(
            self.file_idx,
--- a/src/syntax/parser.lalrpop
+++ b/src/syntax/parser.lalrpop
@@ -1,14 +1,32 @@
+//! The parser for NGR!
+//!
+//! This file contains the grammar for the NGR language; a grammar is a nice,
+//! machine-readable way to describe how your language's syntax works. For
+//! example, here we describe a program as a series of statements, statements
+//! as either variable binding or print statements, etc. As the grammar gets
+//! more complicated, using tools like [`lalrpop`] becomes even more important.
+//! (Although, at some point, things can become so complicated that you might
+//! eventually want to leave lalrpop behind.)
+//!
 use crate::syntax::{LexerError, Location};
 use crate::syntax::ast::{Program,Statement,Expression,Value};
 use crate::syntax::tokens::Token;
 use internment::ArcIntern;

+// one cool thing about lalrpop: we can pass arguments. in this case, the
+// file index of the file we're parsing. we combine this with the file offset
+// that Logos gives us to make a [`crate::syntax::Location`].
 grammar(file_idx: usize);

+// this is a slighlyt odd way to describe this, but: consider this section
+// as describing the stuff that is external to the lalrpop grammar that it
+// needs to know to do its job.
 extern {
-    type Location = usize;
+    type Location = usize; // Logos, our lexer, implements locations as
+                           // offsets from the start of the file.
    type Error = LexerError;

+    // here we redeclare all of the tokens.
    enum Token {
        "=" => Token::Equals,
        ";" => Token::Semi,
@@ -22,57 +40,123 @@ extern {
        "*" => Token::Operator('*'),
        "/" => Token::Operator('/'),

+        // the previous items just match their tokens, and if you try
+        // to name and use "their value", you get their source location.
+        // For these, we want "their value" to be their actual contents,
+        // which is why we put their types in angle brackets.
        "<num>" => Token::Number((<Option<u8>>,<i64>)),
        "<var>" => Token::Variable(<ArcIntern<String>>),
    }
 }

 pub Program: Program = {
+    // a program is just a set of statements
    <stmts:Statements> => Program {
        statements: stmts
    }
 }

 Statements: Vec<Statement> = {
+    // a statement is either a set of statements followed by another
+    // statement (note, here, that you can name the result of a sub-parse
+    // using <name: subrule>) ...
    <mut stmts:Statements> <stmt:Statement> => {
        stmts.push(stmt);
        stmts
    },
+
+    // ... or it's nothing. This may feel like an awkward way to define
+    // lists of things -- and it is a bit awkward -- but there are actual
+    // technical reasons that you want to (a) use recursivion to define
+    // these, and (b) use *left* recursion, specifically. That's why, in
+    // this file, all of the recursive cases are to the left, like they
+    // are above.
+    //
+    // the details of why left recursion is better is actually pretty
+    // fiddly and in the weeds, and if you're interested you should look
+    // up LALR parsers versus LL parsers; both their differences and how
+    // they're constructed, as they're kind of neat.
+    //
+    // but if you're just writing grammars with lalrpop, then you should
+    // just remember that you should always use left recursion, and be
+    // done with it. 
    => {
        Vec::new()
    }
 }

 pub Statement: Statement = {
+    // A statement can be a variable binding. Note, here, that we use this
+    // funny @L thing to get the source location before the variable, so that
+    // we can say that this statement spans across everything.
    <l:@L> <v:"<var>"> "=" <e:Expression> ";" => Statement::Binding(Location::new(file_idx, l), v.to_string(), e),
+
+    // Alternatively, a statement can just be a print statement.
    "print" <l:@L> <v:"<var>"> ";" => Statement::Print(Location::new(file_idx, l), v.to_string()),
 }

+// Expressions! Expressions are a little fiddly, because we're going to
+// use a little bit of a trick to make sure that we get operator precedence
+// right. The trick works by creating a top-level `Expression` grammar entry
+// that just points to the thing with the *weakest* precedence. In this case,
+// we have addition, subtraction, multiplication, and division, so addition
+// and subtraction have the weakest precedence.
+//
+// Then, as we go down the precedence tree, each item will recurse (left!)
+// to other items at the same precedence level. The right hand operator, for
+// binary operators (which is all of ours, at the moment) will then be one
+// level stronger precendence. In addition, we'll let people just fall through
+// to the next level; so if there isn't an addition or subtraction, we'll just
+// fall through to the multiplication/division case.
+//
+// Finally, at the bottom, we'll have the core expressions (like constants,
+// variables, etc.) as well as a parenthesized version of `Expression`, which
+// gets us right up top again.
+//
+// Understanding why this works to solve all your operator precedence problems
+// is a little hard to give an easy intuition for, but for myself it helped
+// to run through a few examples. Consider thinking about how you want to
+// parse something like "1 + 2 * 3", for example, versus "1 + 2 + 3" or
+// "1 * 2 + 3", and hopefully that'll help.
 Expression: Expression = {
    AdditiveExpression,
 }

+// we group addition and subtraction under the heading "additive"
 AdditiveExpression: Expression = {
    <e1:AdditiveExpression> <l:@L> "+" <e2:MultiplicativeExpression> => Expression::Primitive(Location::new(file_idx, l), "+".to_string(), vec![e1, e2]),
    <e1:AdditiveExpression> <l:@L> "-" <e2:MultiplicativeExpression> => Expression::Primitive(Location::new(file_idx, l), "-".to_string(), vec![e1, e2]),
    MultiplicativeExpression,
 }

+// similarly, we group multiplication and division under "multiplicative"
 MultiplicativeExpression: Expression = {
    <e1:MultiplicativeExpression> <l:@L> "*" <e2:AtomicExpression> => Expression::Primitive(Location::new(file_idx, l), "*".to_string(), vec![e1, e2]),
    <e1:MultiplicativeExpression> <l:@L> "/" <e2:AtomicExpression> => Expression::Primitive(Location::new(file_idx, l), "/".to_string(), vec![e1, e2]),
    AtomicExpression,
 }

+// finally, we describe our lowest-level expressions as "atomic", because
+// they cannot be further divided into parts
 AtomicExpression: Expression = {
+    // just a variable reference
    <l:@L> <v:"<var>"> => Expression::Reference(Location::new(file_idx, l), v.to_string()),
+    // just a number
    <l:@L> <n:"<num>"> => {
        let val = Value::Number(n.0, n.1);
        Expression::Value(Location::new(file_idx, l), val)
    },
+    // a tricky case: also just a number, but using a negative sign. an
+    // alternative way to do this -- and we may do this eventually -- is
+    // to implement a unary negation expression. this has the odd effect
+    // that the user never actually writes down a negative number; they just
+    // write positive numbers which are immediately sent to a negation
+    // primitive!
    <l:@L> "-" <n:"<num>"> => {
        let val = Value::Number(n.0, -n.1);
        Expression::Value(Location::new(file_idx, l), val)
    },
+    // finally, let people parenthesize expressions and get back to a
+    // lower precedence
    "(" <e:Expression> ")" => e,
 }
--- a/src/syntax/simplify.rs
+++ b/src/syntax/simplify.rs
@@ -1,63 +0,0 @@
-use crate::syntax::ast::{Expression, Program, Statement};
-
-impl Program {
-    pub fn simplify(mut self) -> Self {
-        let mut new_statements = Vec::new();
-        let mut gensym_index = 1;
-
-        for stmt in self.statements.drain(..) {
-            new_statements.append(&mut stmt.simplify(&mut gensym_index));
-        }
-
-        self.statements = new_statements;
-        self
-    }
-}
-
-impl Statement {
-    pub fn simplify(self, gensym_index: &mut usize) -> Vec<Statement> {
-        let mut new_statements = vec![];
-
-        match self {
-            Statement::Print(_, _) => new_statements.push(self),
-            Statement::Binding(_, _, Expression::Reference(_, _)) => new_statements.push(self),
-            Statement::Binding(_, _, Expression::Value(_, _)) => new_statements.push(self),
-            Statement::Binding(loc, name, value) => {
-                let (mut prereqs, new_value) = value.rebind(&name, gensym_index);
-                new_statements.append(&mut prereqs);
-                new_statements.push(Statement::Binding(loc, name, new_value))
-            }
-        }
-
-        new_statements
-    }
-}
-
-impl Expression {
-    fn rebind(self, base_name: &str, gensym_index: &mut usize) -> (Vec<Statement>, Expression) {
-        match self {
-            Expression::Value(_, _) => (vec![], self),
-            Expression::Reference(_, _) => (vec![], self),
-            Expression::Primitive(loc, prim, mut expressions) => {
-                let mut prereqs = Vec::new();
-                let mut new_exprs = Vec::new();
-
-                for expr in expressions.drain(..) {
-                    let (mut cur_prereqs, arg) = expr.rebind(base_name, gensym_index);
-                    prereqs.append(&mut cur_prereqs);
-                    new_exprs.push(arg);
-                }
-
-                let new_name = format!("<{}:{}>", base_name, *gensym_index);
-                *gensym_index += 1;
-                prereqs.push(Statement::Binding(
-                    loc.clone(),
-                    new_name.clone(),
-                    Expression::Primitive(loc.clone(), prim, new_exprs),
-                ));
-
-                (prereqs, Expression::Reference(loc, new_name))
-            }
-        }
-    }
-}
--- a/src/syntax/tokens.rs
+++ b/src/syntax/tokens.rs
@@ -4,8 +4,30 @@ use std::fmt;
 use std::num::ParseIntError;
 use thiserror::Error;

+/// A single token of the input stream; used to help the parsing go down
+/// more easily.
+///
+/// The key way to generate this structure is via the [`Logos`] trait.
+/// See the [`logos`] documentation for more information; we use the
+/// [`Token::lexer`] function internally.
+///
+/// The first step in the compilation process is turning the raw string
+/// data (in UTF-8, which is its own joy) in to a sequence of more sensible
+/// tokens. Here, for example, we turn "x=5" into three tokens: a
+/// [`Token::Variable`] for "x", a [`Token::Equals`] for the "=", and
+/// then a [`Token::Number`] for the "5". Later on, we'll worry about
+/// making sense of those three tokens.
+///
+/// For now, our list of tokens is relatively straightforward. We'll
+/// need/want to extend these later.
+///
+/// The [`std::fmt::Display`] implementation for [`Token`] should
+/// round-trip; if you lex a string generated with the [`std::fmt::Display`]
+/// trait, you should get back the exact same token.
 #[derive(Logos, Clone, Debug, PartialEq, Eq)]
 pub enum Token {
+    // Our first set of tokens are simple characters that we're
+    // going to use to structure NGR programs.
    #[token("=")]
    Equals,

@@ -18,12 +40,20 @@ pub enum Token {
    #[token(")")]
    RightParen,

+    // Next we take of any reserved words; I always like to put
+    // these before we start recognizing more complicated regular
+    // expressions. I don't think it matters, but it works for me.
    #[token("print")]
    Print,

+    // Next are the operators for NGR. We only have 4, now, but
+    // we might extend these later, or even make them user-definable!
    #[regex(r"[+\-*/]", |v| v.slice().chars().next())]
    Operator(char),

+    /// Numbers capture both the value we read from the input,
+    /// converted to an `i64`, as well as the base the user used
+    /// to write the number, if they did so.
    #[regex(r"0b[01]+", |v| parse_number(Some(2), v))]
    #[regex(r"0o[0-7]+", |v| parse_number(Some(8), v))]
    #[regex(r"0d[0-9]+", |v| parse_number(Some(10), v))]
@@ -31,12 +61,23 @@ pub enum Token {
    #[regex(r"[0-9]+", |v| parse_number(None, v))]
    Number((Option<u8>, i64)),

+    // Variables; this is a very standard, simple set of characters
+    // for variables, but feel free to experiment with more complicated
+    // things. I chose to force variables to start with a lower case
+    // letter, too.
    #[regex(r"[a-z][a-zA-Z0-9_]*", |v| ArcIntern::new(v.slice().to_string()))]
    Variable(ArcIntern<String>),

+    // the next token will be an error token
    #[error]
+    // we're actually just going to skip whitespace, though
    #[regex(r"[ \t\r\n\f]+", logos::skip)]
+    // this is an extremely simple version of comments, just line
+    // comments. More complicated /* */ comments can be harder to
+    // implement, and didn't seem worth it at the time.
    #[regex(r"//.*", logos::skip)]
+    /// This token represents that some core error happened in lexing;
+    /// possibly that something didn't match anything at all.
    Error,
 }

@@ -63,19 +104,28 @@ impl fmt::Display for Token {
    }
 }

+/// A sudden and unexpected error in the lexer.
 #[derive(Debug, Error, PartialEq, Eq)]
 pub enum LexerError {
+    /// The `usize` here is the offset that we ran into the problem, given
+    /// from the start of the file.
    #[error("Failed lexing at {0}")]
    LexFailure(usize),
 }

 #[cfg(test)]
 impl Token {
+    /// Create a variable token with the given name. Very handy for
+    /// testing.
    pub(crate) fn var(s: &str) -> Token {
        Token::Variable(ArcIntern::new(s.to_string()))
    }
 }

+/// Parse a number in the given base, return a pair of the base and the
+/// parsed number. This is just a helper used for all of the number
+/// regular expression cases, which kicks off to the obvious Rust
+/// standard library function.
 fn parse_number(
    base: Option<u8>,
    value: &Lexer<Token>,
--- a/src/syntax/validate.rs
+++ b/src/syntax/validate.rs
@@ -2,6 +2,13 @@ use crate::syntax::{Expression, Location, Program, Statement};
 use codespan_reporting::diagnostic::Diagnostic;
 use std::collections::HashMap;

+/// An error we found while validating the input program.
+///
+/// These errors indicate that we should stop trying to compile
+/// the program, because it's just fundamentally broken in a way
+/// that we're not going to be able to work through. As with most
+/// of these errors, we recommend converting this to a [`Diagnostic`]
+/// and using [`codespan_reporting`] to present them to the user.
 pub enum Error {
    UnboundVariable(Location, String),
 }
@@ -16,6 +23,13 @@ impl From<Error> for Diagnostic<usize> {
    }
 }

+/// A problem we found validating the input that isn't critical.
+///
+/// These are things that the user might want to do something about,
+/// but we can keep going without it being a problem. As with most of
+/// these things, if you want to present this information to the user,
+/// the best way to do so is via [`From`] and [`Diagnostic`], and then
+/// interactions via [`codespan_reporting`].
 #[derive(Debug, PartialEq, Eq)]
 pub enum Warning {
    ShadowedVariable(Location, Location, String),
@@ -37,6 +51,11 @@ impl From<Warning> for Diagnostic<usize> {
 }

 impl Program {
+    /// Validate that the program makes semantic sense, not just syntactic sense.
+    ///
+    /// This checks for things like references to variables that don't exist, for
+    /// example, and generates warnings for things that are inadvisable but not
+    /// actually a problem.
    pub fn validate(&self) -> (Vec<Error>, Vec<Warning>) {
        let mut errors = vec![];
        let mut warnings = vec![];
@@ -53,6 +72,15 @@ impl Program {
 }

 impl Statement {
+    /// Validate that the statement makes semantic sense, not just syntactic sense.
+    ///
+    /// This checks for things like references to variables that don't exist, for
+    /// example, and generates warnings for things that are inadvisable but not
+    /// actually a problem. Since statements appear in a broader context, you'll
+    /// need to provide the set of variables that are bound where this statement
+    /// occurs. We use a `HashMap` to map these bound locations to the locations
+    /// where their bound, because these locations are handy when generating errors
+    /// and warnings.
    pub fn validate(
        &self,
        bound_variables: &mut HashMap<String, Location>,