📜 Add better documentation across the compiler. (#3)

These changes pay particular attention to API endpoints, to try to
ensure that any rustdocs generated are detailed and sensible. A good
next step, eventually, might be to include doctest examples, as well.
For the moment, it's not clear that they would provide a lot of value,
though.

In addition, this does a couple refactors to simplify the code base in
ways that make things clearer or, at least, briefer.
This commit is contained in:
2023-05-13 14:34:48 -05:00
parent f4594bf2cc
commit 1fbfd0c2d2
28 changed files with 1550 additions and 432 deletions

View File

@@ -1,12 +1,32 @@
use crate::syntax::Location;
/// The set of valid binary operators.
pub static BINARY_OPERATORS: &[&str] = &["+", "-", "*", "/"];
/// A structure represented a parsed program.
///
/// One `Program` is associated with exactly one input file, and the
/// vector is arranged in exactly the same order as the parsed file.
/// Because this is the syntax layer, the program is guaranteed to be
/// syntactically valid, but may be nonsense. There could be attempts
/// to use unbound variables, for example, until after someone runs
/// `validate` and it comes back without errors.
#[derive(Clone, Debug, PartialEq)]
pub struct Program {
pub statements: Vec<Statement>,
}
/// A parsed statement.
///
/// Statements are guaranteed to be syntactically valid, but may be
/// complete nonsense at the semantic level. Which is to say, all the
/// print statements were correctly formatted, and all the variables
/// referenced are definitely valid symbols, but they may not have
/// been defined or anything.
///
/// Note that equivalence testing on statements is independent of
/// source location; it is testing if the two statements say the same
/// thing, not if they are the exact same statement.
#[derive(Clone, Debug)]
pub enum Statement {
Binding(Location, String, Expression),
@@ -28,6 +48,12 @@ impl PartialEq for Statement {
}
}
/// An expression in the underlying syntax.
///
/// Like statements, these expressions are guaranteed to have been
/// formatted correctly, but may not actually make any sense. Also
/// like Statements, the [`PartialEq`] implementation does not take
/// source positions into account.
#[derive(Clone, Debug)]
pub enum Expression {
Value(Location, Value),
@@ -54,7 +80,9 @@ impl PartialEq for Expression {
}
}
/// A value from the source syntax
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum Value {
/// The value of the number, and an optional base that it was written in
Number(Option<u8>, i64),
}

View File

@@ -4,11 +4,23 @@ use crate::eval::{EvalEnvironment, EvalError, Value};
use crate::syntax::{Expression, Program, Statement};
impl Program {
/// Evaluate the program, returning either an error or what it prints out when run.
///
/// Doing this evaluation is particularly useful for testing, to ensure that if we
/// modify a program in some way it does the same thing on both sides of the
/// transformation. It's also sometimes just nice to know what a program will be
/// doing.
///
/// Note that the errors here are slightly more strict that we enforce at runtime.
/// For example, we check for overflow and underflow errors during evaluation, and
/// we don't check for those in the compiled code.
pub fn eval(&self) -> Result<String, EvalError> {
let mut env = EvalEnvironment::empty();
let mut stdout = String::new();
for stmt in self.statements.iter() {
// at this point, evaluation is pretty simple. just walk through each
// statement, in order, and record printouts as we come to them.
match stmt {
Statement::Binding(_, name, value) => {
let actual_value = value.eval(&env)?;
@@ -40,6 +52,7 @@ impl Expression {
let mut arg_values = Vec::with_capacity(args.len());
for arg in args.iter() {
// yay, recursion! makes this pretty straightforward
arg_values.push(arg.eval(env)?);
}

View File

@@ -1,5 +1,9 @@
use codespan_reporting::diagnostic::{Diagnostic, Label};
/// A source location, for use in pointing users towards warnings and errors.
///
/// Internally, locations are very tied to the `codespan_reporting` library,
/// and the primary use of them is to serve as anchors within that library.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct Location {
file_idx: usize,
@@ -7,10 +11,22 @@ pub struct Location {
}
impl Location {
/// Generate a new `Location` from a file index and an offset from the
/// start of the file.
///
/// The file index is based on the file database being used. See the
/// `codespan_reporting::files::SimpleFiles::add` function, which is
/// normally where we get this index.
pub fn new(file_idx: usize, offset: usize) -> Self {
Location { file_idx, offset }
}
/// Generate a `Location` for a completely manufactured bit of code.
///
/// Ideally, this is used only in testing, as any code we generate as
/// part of the compiler should, theoretically, be tied to some actual
/// location in the source code. That being said, this can be used in
/// a pinch ... just maybe try to avoid it if you can.
pub fn manufactured() -> Self {
Location {
file_idx: 0,
@@ -18,27 +34,73 @@ impl Location {
}
}
/// Generate a primary label for a [`Diagnostic`], based on this source
/// location.
///
/// Note, this is just the [`Label`], you'll want to fill in the [`Diagnostic`]
/// with a lot more information.
///
/// Primary labels are the things that are they key cause of the message.
/// If, for example, it was an error to bind a variable named "x", and
/// then have another binding of a variable named "x", the second one
/// would likely be the primary label (because that's where the error
/// actually happened), but you'd probably want to make the first location
/// the secondary label to help users find it.
pub fn primary_label(&self) -> Label<usize> {
Label::primary(self.file_idx, self.offset..self.offset)
}
/// Generate a secondary label for a [`Diagnostic`], based on this source
/// location.
///
/// Note, this is just the [`Label`], you'll want to fill in the [`Diagnostic`]
/// with a lot more information.
///
/// Secondary labels are the things that are involved in the message, but
/// aren't necessarily a problem in and of themselves. If, for example, it
/// was an error to bind a variable named "x", and then have another binding
/// of a variable named "x", the second one would likely be the primary
/// label (because that's where the error actually happened), but you'd
/// probably want to make the first location the secondary label to help
/// users find it.
pub fn secondary_label(&self) -> Label<usize> {
Label::secondary(self.file_idx, self.offset..self.offset)
}
pub fn range_label(&self, end: &Location) -> Vec<Label<usize>> {
if self.file_idx == end.file_idx {
vec![Label::primary(self.file_idx, self.offset..end.offset)]
} else if self.file_idx == 0 {
// if this is a manufactured item, then ... just try the other one
vec![Label::primary(end.file_idx, end.offset..end.offset)]
/// Given this location and another, generate a primary label that
/// specifies the area between those two locations.
///
/// See [`Self::primary_label`] for some discussion of primary versus
/// secondary labels. If the two locations are the same, this method does
/// the exact same thing as [`Self::primary_label`]. If this item was
/// generated by [`Self::manufactured`], it will act as if you'd called
/// `primary_label` on the argument. Otherwise, it will generate the obvious
/// span.
///
/// This function will return `None` only in the case that you provide
/// labels from two different files, which it cannot sensibly handle.
pub fn range_label(&self, end: &Location) -> Option<Label<usize>> {
if self.file_idx == 0 {
return Some(end.primary_label());
}
if self.file_idx != end.file_idx {
return None;
}
if self.offset > end.offset {
Some(Label::primary(self.file_idx, end.offset..self.offset))
} else {
// we'll just pick the first location if this is in two different
// files
vec![Label::primary(self.file_idx, self.offset..self.offset)]
Some(Label::primary(self.file_idx, self.offset..end.offset))
}
}
/// Return an error diagnostic centered at this location.
///
/// Note that this [`Diagnostic`] will have no information associated with
/// it other than that (a) there is an error, and (b) that the error is at
/// this particular location. You'll need to extend it with actually useful
/// information, like what kind of error it is.
pub fn error(&self) -> Diagnostic<usize> {
Diagnostic::error().with_labels(vec![Label::primary(
self.file_idx,
@@ -46,6 +108,12 @@ impl Location {
)])
}
/// Return an error diagnostic centered at this location, with the given message.
///
/// This is much more useful than [`Self::error`], because it actually provides
/// the user with some guidance. That being said, you still might want to add
/// even more information to ut, using [`Diagnostic::with_labels`],
/// [`Diagnostic::with_notes`], or [`Diagnostic::with_code`].
pub fn labelled_error(&self, msg: &str) -> Diagnostic<usize> {
Diagnostic::error().with_labels(vec![Label::primary(
self.file_idx,

View File

@@ -1,14 +1,32 @@
//! The parser for NGR!
//!
//! This file contains the grammar for the NGR language; a grammar is a nice,
//! machine-readable way to describe how your language's syntax works. For
//! example, here we describe a program as a series of statements, statements
//! as either variable binding or print statements, etc. As the grammar gets
//! more complicated, using tools like [`lalrpop`] becomes even more important.
//! (Although, at some point, things can become so complicated that you might
//! eventually want to leave lalrpop behind.)
//!
use crate::syntax::{LexerError, Location};
use crate::syntax::ast::{Program,Statement,Expression,Value};
use crate::syntax::tokens::Token;
use internment::ArcIntern;
// one cool thing about lalrpop: we can pass arguments. in this case, the
// file index of the file we're parsing. we combine this with the file offset
// that Logos gives us to make a [`crate::syntax::Location`].
grammar(file_idx: usize);
// this is a slighlyt odd way to describe this, but: consider this section
// as describing the stuff that is external to the lalrpop grammar that it
// needs to know to do its job.
extern {
type Location = usize;
type Location = usize; // Logos, our lexer, implements locations as
// offsets from the start of the file.
type Error = LexerError;
// here we redeclare all of the tokens.
enum Token {
"=" => Token::Equals,
";" => Token::Semi,
@@ -22,57 +40,123 @@ extern {
"*" => Token::Operator('*'),
"/" => Token::Operator('/'),
// the previous items just match their tokens, and if you try
// to name and use "their value", you get their source location.
// For these, we want "their value" to be their actual contents,
// which is why we put their types in angle brackets.
"<num>" => Token::Number((<Option<u8>>,<i64>)),
"<var>" => Token::Variable(<ArcIntern<String>>),
}
}
pub Program: Program = {
// a program is just a set of statements
<stmts:Statements> => Program {
statements: stmts
}
}
Statements: Vec<Statement> = {
// a statement is either a set of statements followed by another
// statement (note, here, that you can name the result of a sub-parse
// using <name: subrule>) ...
<mut stmts:Statements> <stmt:Statement> => {
stmts.push(stmt);
stmts
},
// ... or it's nothing. This may feel like an awkward way to define
// lists of things -- and it is a bit awkward -- but there are actual
// technical reasons that you want to (a) use recursivion to define
// these, and (b) use *left* recursion, specifically. That's why, in
// this file, all of the recursive cases are to the left, like they
// are above.
//
// the details of why left recursion is better is actually pretty
// fiddly and in the weeds, and if you're interested you should look
// up LALR parsers versus LL parsers; both their differences and how
// they're constructed, as they're kind of neat.
//
// but if you're just writing grammars with lalrpop, then you should
// just remember that you should always use left recursion, and be
// done with it.
=> {
Vec::new()
}
}
pub Statement: Statement = {
// A statement can be a variable binding. Note, here, that we use this
// funny @L thing to get the source location before the variable, so that
// we can say that this statement spans across everything.
<l:@L> <v:"<var>"> "=" <e:Expression> ";" => Statement::Binding(Location::new(file_idx, l), v.to_string(), e),
// Alternatively, a statement can just be a print statement.
"print" <l:@L> <v:"<var>"> ";" => Statement::Print(Location::new(file_idx, l), v.to_string()),
}
// Expressions! Expressions are a little fiddly, because we're going to
// use a little bit of a trick to make sure that we get operator precedence
// right. The trick works by creating a top-level `Expression` grammar entry
// that just points to the thing with the *weakest* precedence. In this case,
// we have addition, subtraction, multiplication, and division, so addition
// and subtraction have the weakest precedence.
//
// Then, as we go down the precedence tree, each item will recurse (left!)
// to other items at the same precedence level. The right hand operator, for
// binary operators (which is all of ours, at the moment) will then be one
// level stronger precendence. In addition, we'll let people just fall through
// to the next level; so if there isn't an addition or subtraction, we'll just
// fall through to the multiplication/division case.
//
// Finally, at the bottom, we'll have the core expressions (like constants,
// variables, etc.) as well as a parenthesized version of `Expression`, which
// gets us right up top again.
//
// Understanding why this works to solve all your operator precedence problems
// is a little hard to give an easy intuition for, but for myself it helped
// to run through a few examples. Consider thinking about how you want to
// parse something like "1 + 2 * 3", for example, versus "1 + 2 + 3" or
// "1 * 2 + 3", and hopefully that'll help.
Expression: Expression = {
AdditiveExpression,
}
// we group addition and subtraction under the heading "additive"
AdditiveExpression: Expression = {
<e1:AdditiveExpression> <l:@L> "+" <e2:MultiplicativeExpression> => Expression::Primitive(Location::new(file_idx, l), "+".to_string(), vec![e1, e2]),
<e1:AdditiveExpression> <l:@L> "-" <e2:MultiplicativeExpression> => Expression::Primitive(Location::new(file_idx, l), "-".to_string(), vec![e1, e2]),
MultiplicativeExpression,
}
// similarly, we group multiplication and division under "multiplicative"
MultiplicativeExpression: Expression = {
<e1:MultiplicativeExpression> <l:@L> "*" <e2:AtomicExpression> => Expression::Primitive(Location::new(file_idx, l), "*".to_string(), vec![e1, e2]),
<e1:MultiplicativeExpression> <l:@L> "/" <e2:AtomicExpression> => Expression::Primitive(Location::new(file_idx, l), "/".to_string(), vec![e1, e2]),
AtomicExpression,
}
// finally, we describe our lowest-level expressions as "atomic", because
// they cannot be further divided into parts
AtomicExpression: Expression = {
// just a variable reference
<l:@L> <v:"<var>"> => Expression::Reference(Location::new(file_idx, l), v.to_string()),
// just a number
<l:@L> <n:"<num>"> => {
let val = Value::Number(n.0, n.1);
Expression::Value(Location::new(file_idx, l), val)
},
// a tricky case: also just a number, but using a negative sign. an
// alternative way to do this -- and we may do this eventually -- is
// to implement a unary negation expression. this has the odd effect
// that the user never actually writes down a negative number; they just
// write positive numbers which are immediately sent to a negation
// primitive!
<l:@L> "-" <n:"<num>"> => {
let val = Value::Number(n.0, -n.1);
Expression::Value(Location::new(file_idx, l), val)
},
// finally, let people parenthesize expressions and get back to a
// lower precedence
"(" <e:Expression> ")" => e,
}

View File

@@ -1,63 +0,0 @@
use crate::syntax::ast::{Expression, Program, Statement};
impl Program {
pub fn simplify(mut self) -> Self {
let mut new_statements = Vec::new();
let mut gensym_index = 1;
for stmt in self.statements.drain(..) {
new_statements.append(&mut stmt.simplify(&mut gensym_index));
}
self.statements = new_statements;
self
}
}
impl Statement {
pub fn simplify(self, gensym_index: &mut usize) -> Vec<Statement> {
let mut new_statements = vec![];
match self {
Statement::Print(_, _) => new_statements.push(self),
Statement::Binding(_, _, Expression::Reference(_, _)) => new_statements.push(self),
Statement::Binding(_, _, Expression::Value(_, _)) => new_statements.push(self),
Statement::Binding(loc, name, value) => {
let (mut prereqs, new_value) = value.rebind(&name, gensym_index);
new_statements.append(&mut prereqs);
new_statements.push(Statement::Binding(loc, name, new_value))
}
}
new_statements
}
}
impl Expression {
fn rebind(self, base_name: &str, gensym_index: &mut usize) -> (Vec<Statement>, Expression) {
match self {
Expression::Value(_, _) => (vec![], self),
Expression::Reference(_, _) => (vec![], self),
Expression::Primitive(loc, prim, mut expressions) => {
let mut prereqs = Vec::new();
let mut new_exprs = Vec::new();
for expr in expressions.drain(..) {
let (mut cur_prereqs, arg) = expr.rebind(base_name, gensym_index);
prereqs.append(&mut cur_prereqs);
new_exprs.push(arg);
}
let new_name = format!("<{}:{}>", base_name, *gensym_index);
*gensym_index += 1;
prereqs.push(Statement::Binding(
loc.clone(),
new_name.clone(),
Expression::Primitive(loc.clone(), prim, new_exprs),
));
(prereqs, Expression::Reference(loc, new_name))
}
}
}
}

View File

@@ -4,8 +4,30 @@ use std::fmt;
use std::num::ParseIntError;
use thiserror::Error;
/// A single token of the input stream; used to help the parsing go down
/// more easily.
///
/// The key way to generate this structure is via the [`Logos`] trait.
/// See the [`logos`] documentation for more information; we use the
/// [`Token::lexer`] function internally.
///
/// The first step in the compilation process is turning the raw string
/// data (in UTF-8, which is its own joy) in to a sequence of more sensible
/// tokens. Here, for example, we turn "x=5" into three tokens: a
/// [`Token::Variable`] for "x", a [`Token::Equals`] for the "=", and
/// then a [`Token::Number`] for the "5". Later on, we'll worry about
/// making sense of those three tokens.
///
/// For now, our list of tokens is relatively straightforward. We'll
/// need/want to extend these later.
///
/// The [`std::fmt::Display`] implementation for [`Token`] should
/// round-trip; if you lex a string generated with the [`std::fmt::Display`]
/// trait, you should get back the exact same token.
#[derive(Logos, Clone, Debug, PartialEq, Eq)]
pub enum Token {
// Our first set of tokens are simple characters that we're
// going to use to structure NGR programs.
#[token("=")]
Equals,
@@ -18,12 +40,20 @@ pub enum Token {
#[token(")")]
RightParen,
// Next we take of any reserved words; I always like to put
// these before we start recognizing more complicated regular
// expressions. I don't think it matters, but it works for me.
#[token("print")]
Print,
// Next are the operators for NGR. We only have 4, now, but
// we might extend these later, or even make them user-definable!
#[regex(r"[+\-*/]", |v| v.slice().chars().next())]
Operator(char),
/// Numbers capture both the value we read from the input,
/// converted to an `i64`, as well as the base the user used
/// to write the number, if they did so.
#[regex(r"0b[01]+", |v| parse_number(Some(2), v))]
#[regex(r"0o[0-7]+", |v| parse_number(Some(8), v))]
#[regex(r"0d[0-9]+", |v| parse_number(Some(10), v))]
@@ -31,12 +61,23 @@ pub enum Token {
#[regex(r"[0-9]+", |v| parse_number(None, v))]
Number((Option<u8>, i64)),
// Variables; this is a very standard, simple set of characters
// for variables, but feel free to experiment with more complicated
// things. I chose to force variables to start with a lower case
// letter, too.
#[regex(r"[a-z][a-zA-Z0-9_]*", |v| ArcIntern::new(v.slice().to_string()))]
Variable(ArcIntern<String>),
// the next token will be an error token
#[error]
// we're actually just going to skip whitespace, though
#[regex(r"[ \t\r\n\f]+", logos::skip)]
// this is an extremely simple version of comments, just line
// comments. More complicated /* */ comments can be harder to
// implement, and didn't seem worth it at the time.
#[regex(r"//.*", logos::skip)]
/// This token represents that some core error happened in lexing;
/// possibly that something didn't match anything at all.
Error,
}
@@ -63,19 +104,28 @@ impl fmt::Display for Token {
}
}
/// A sudden and unexpected error in the lexer.
#[derive(Debug, Error, PartialEq, Eq)]
pub enum LexerError {
/// The `usize` here is the offset that we ran into the problem, given
/// from the start of the file.
#[error("Failed lexing at {0}")]
LexFailure(usize),
}
#[cfg(test)]
impl Token {
/// Create a variable token with the given name. Very handy for
/// testing.
pub(crate) fn var(s: &str) -> Token {
Token::Variable(ArcIntern::new(s.to_string()))
}
}
/// Parse a number in the given base, return a pair of the base and the
/// parsed number. This is just a helper used for all of the number
/// regular expression cases, which kicks off to the obvious Rust
/// standard library function.
fn parse_number(
base: Option<u8>,
value: &Lexer<Token>,

View File

@@ -2,6 +2,13 @@ use crate::syntax::{Expression, Location, Program, Statement};
use codespan_reporting::diagnostic::Diagnostic;
use std::collections::HashMap;
/// An error we found while validating the input program.
///
/// These errors indicate that we should stop trying to compile
/// the program, because it's just fundamentally broken in a way
/// that we're not going to be able to work through. As with most
/// of these errors, we recommend converting this to a [`Diagnostic`]
/// and using [`codespan_reporting`] to present them to the user.
pub enum Error {
UnboundVariable(Location, String),
}
@@ -16,6 +23,13 @@ impl From<Error> for Diagnostic<usize> {
}
}
/// A problem we found validating the input that isn't critical.
///
/// These are things that the user might want to do something about,
/// but we can keep going without it being a problem. As with most of
/// these things, if you want to present this information to the user,
/// the best way to do so is via [`From`] and [`Diagnostic`], and then
/// interactions via [`codespan_reporting`].
#[derive(Debug, PartialEq, Eq)]
pub enum Warning {
ShadowedVariable(Location, Location, String),
@@ -37,6 +51,11 @@ impl From<Warning> for Diagnostic<usize> {
}
impl Program {
/// Validate that the program makes semantic sense, not just syntactic sense.
///
/// This checks for things like references to variables that don't exist, for
/// example, and generates warnings for things that are inadvisable but not
/// actually a problem.
pub fn validate(&self) -> (Vec<Error>, Vec<Warning>) {
let mut errors = vec![];
let mut warnings = vec![];
@@ -53,6 +72,15 @@ impl Program {
}
impl Statement {
/// Validate that the statement makes semantic sense, not just syntactic sense.
///
/// This checks for things like references to variables that don't exist, for
/// example, and generates warnings for things that are inadvisable but not
/// actually a problem. Since statements appear in a broader context, you'll
/// need to provide the set of variables that are bound where this statement
/// occurs. We use a `HashMap` to map these bound locations to the locations
/// where their bound, because these locations are handy when generating errors
/// and warnings.
pub fn validate(
&self,
bound_variables: &mut HashMap<String, Location>,