This post looks at how to use C# Source Generators to build an external DSL to represent mathematical expressions.
The code for this post is on the roslyn-sdk repository.
A recap of C# Source Generators
There are two other articles describing C# Source Generators on this blog, Introducing C# Source Generators and New C# Source Generator Samples. If you’re new to generators, you might want to read them first.
Let’s just remind ourselves of what they are. You can think of a Source Generator as a function that runs at compile time. It takes some inputs and produces C# code.
Program Parse Tree -> Additional Files -> File Specific Options -> C# Code
This conceptual view is implemented in the ISourceGenerator
interface.
public interface ISourceGenerator {
void Execute(GeneratorExecutionContext context);
void Initialize(GeneratorInitializationContext context);
}
You implement the Execute
method and get the inputs through the context
object. The Initialize
function is more rarely used.
The context
parameter to Execute
contains the inputs.
context.Compilation
is the parse tree for the program and everything else needed by the compiler (settings, references, etc.).context.AdditionalFiles
gives you the additional files in the project.context.AnalyzerConfigOptions.GetOptions
provides the options for each additional file.
The additional files are added to the project file using this syntax. Also, notice the file specific options that you can retrieve in your generator code.
<AdditionalFiles Include="Cars.csv" CsvLoadType="OnDemand" CacheObjects="true" />
You are not limited to these inputs. A C# generator is just a bit of code that runs at compile time. The code can do whatever it pleases. For example, it could download information from a website (not a good idea). But the three inputs above are the most logical ones as they are part of the project. It is the recommended way to do it.
As a side note, a different source generators’ metaphor is the anthropomorphization of the compiler. Mrs. Compiler goes about her business of generating the parse tree and then she stops and asks you: “Do you have anything to add to what I have done so far?”
The scenario
You work for an engineering company that employes many mathematicians. The formulas that underpin the business are spread out through the large C# codebase. The company would like to centralize them and make them easy to write and understand for their mathematicians.
They would like the calculations to be written in pure math, but have the same performance as C# code. For example, they would like the code to end up being inlined at the point of usage. Here is an example of what they would like to write:
AreaSquare(l) = pow(l, 2)
AreaRectangle(w, h) = w * h
AreaCircle(r) = pi * r * r
Quadratic(a, b, c) = {-b + sqrt(pow(b,2) - 4 * a * c)} / (2 * a)
GoldenRatio = 1.61803
GoldHarm(n) = GoldenRatio + 1 * ∑(i, 1, n, 1 / i)
D(x', x'', y', y'') = sqrt(pow([x'-x''],2) + pow([y'-y''], 2))
You notice several things that differentiate this language from C#:
- No type-annotations.
- Different kinds of parenthesis.
- Invalid C# characters in identifiers.
- Special syntax for the summation symbol (
∑
).
Despite the differences, the language structure is similar to C# methods and properties. You think you should be able to translate each line of the language to a snippet of valid C# code.
You decide to use Source Generators for this task because they plug directly into the normal compiler workflow and because in the future the code might need to access the parse tree for the enclosing program.
One could use Regex substitutions to go from this language to C#, but that approach is problematic for two reasons.
- The language structure is not completely identical to C# (i.e., you need to generate special code for
∑
) - More importantly, you expose yourself to code injection attack. A disgruntled mathematician could write code to mint bitcoins inside your language. By properly parsing the language you can whitelist the available functions.
Hooking up the inputs
Here is the implementation of the Execute
method for the ISourceGenerator
interface.
public void Execute(GeneratorExecutionContext context)
{
foreach (AdditionalText file in context.AdditionalFiles)
{
if (Path.GetExtension(file.Path).Equals(".math", StringComparison.OrdinalIgnoreCase))
{
if(!libraryIsAdded)
{
context.AddSource("___MathLibrary___.cs", SourceText.From(libraryCode, Encoding.UTF8));
libraryIsAdded = true;
}
// Load formulas from .math files
var mathText = file.GetText();
var mathString = "";
if(mathText != null)
{
mathString = mathText.ToString();
} else
{
throw new Exception($"Cannot load file {file.Path}");
}
// Get name of generated namespace from file name
string fileName = Path.GetFileNameWithoutExtension(file.Path);
// Parse and gen the formulas functions
var tokens = Lexer.Tokenize(mathString);
var code = Parser.Parse(tokens);
var codeFileName = $@"{fileName}.cs";
context.AddSource(codeFileName, SourceText.From(code, Encoding.UTF8));
}
}
}
The code scans the additional files from the project file and operates on the ones with the extension .math
.
Firstly, it adds to the project a C# library file containing some utility functions. Then it gets the text for the Math file (aka the formulas), parses the language, and generates C# code for it.
This snippet is the minimum code to hook up a new language into your C# project. You can do more here. You can inspect the parse tree or gather more options to influence the way the language is parsed and generated, but this is not necessary in this case.
Writing the parser
This section is standard compiler fare. If you are familiar with lexing, parsing, and generating code, you can jump directly to the next section. If you are curious, read on.
We are implementing the following two lines from the code above.
var tokens = Lexer.Tokenize(mathString);
var code = Parser.Parse(tokens);
The goal of these lines is to take the Math language and generate the following valid C# code. You can then call any of the generated functions from your existing code.
using static System.Math;
using static ___MathLibrary___.Formulas; // For the __MySum__ function
namespace Maths {
public static partial class Formulas {
public static double AreaSquare (double l ) => Pow ( l , 2 ) ;
public static double AreaRectangle (double w ,double h ) => w * h ;
public static double AreaCircle (double r ) => PI * r * r ;
public static double Quadratic (double a ,double b ,double c ) => ( - b + Sqrt ( Pow ( b , 2 ) - 4 * a * c ) ) / ( 2 * a ) ;
public static double GoldenRatio => 1.61803 ;
public static double GoldHarm (double n ) => GoldenRatio + 1 * ___MySum___ ((int) 1 ,(int) n ,i => 1 / i ) ;
public static double D (double xPrime ,double xSecond ,double yPrime ,double ySecond ) => Sqrt ( Pow ( ( xPrime - xSecond ) , 2 ) + Pow ( ( yPrime - ySecond ) , 2 ) ) ;
}
}
I just touch on the most important points of the implementation, the full code is here.
This is not production code. For the sake of simplicity, I had to fit it in one sample file without external dependencies. It is probably wiser to use a parser generator to future-proof the implementation and avoid errors.
With such caveats out of the way, the lexer is Regex based. It uses the following Token
definition and Regexps.
public enum TokenType {
Number,
Identifier,
Operation,
OpenParens,
CloseParens,
Equal,
EOL,
EOF,
Spaces,
Comma,
Sum,
None
}
public struct Token {
public TokenType Type;
public string Value;
public int Line;
public int Column;
}
/// ... More code not shown
static (TokenType, string)[] tokenStrings = {
(TokenType.EOL, @"(rn|r|n)"),
(TokenType.Spaces, @"s+"),
(TokenType.Number, @"[+-]?((d+.?d*)|(.d+))"),
(TokenType.Identifier, @"[_a-zA-Z][`'""_a-zA-Z0-9]*"),
(TokenType.Operation, @"[+-/*]"),
(TokenType.OpenParens, @"[([{]"),
(TokenType.CloseParens, @"[)]}]"),
(TokenType.Equal, @"="),
(TokenType.Comma, @","),
(TokenType.Sum, @"∑")
};
The Tokenize
function just goes from the source text to a list of tokens.
using Tokens = System.Collections.Generic.IEnumerable<MathsGenerator.Token>;
static public Tokens Tokenize(string source) {
It is too long to show here. Follow the link above for the gory details.
The parser’s grammar is described below.
/* EBNF for the language
lines = {line} EOF
line = {EOL} identifier [lround args rround] equal expr EOL {EOL}
args = identifier {comma identifier}
expr = [plus|minus] term { (plus|minus) term }
term = factor { (times|divide) factor };
factor = number | var | func | sum | matrix | lround expr rround;
var = identifier;
func = identifier lround expr {comma expr} rround;
sum = ∑ lround identifier comma expr comma expr comma expr rround;
*/
It is implemented as a recursive descendent parser.
The Parse
function is below and illustrates a few of the design decisions.
public static string Parse(Tokens tokens) {
var globalSymbolTable = new SymTable();
var symbolTable = new SymTable();
var buffer = new StringBuilder();
var en = tokens.GetEnumerator();
en.MoveNext();
buffer = Lines(new Context {
tokens = en,
globalSymbolTable = globalSymbolTable,
symbolTable = symbolTable,
buffer = buffer
});
return buffer.ToString();
}
globalSymbolTable
is used to store the symbols that are whitelisted and the global symbols that are generated during the parsing of the language.symbolTable
is for the parameters to a function and gets cleared at the start of each new line.buffer
contains the C# code that is generated while parsing.Lines
is the first mutually recursive function and maps to the first line of the grammar.
A typical example of one of such recursive functions is below.
private static void Line(Context ctx) {
// line = {EOL} identifier [lround args rround] equal expr EOL {EOL}
ctx.symbolTable.Clear();
while(Peek(ctx, TokenType.EOL))
Consume(ctx, TokenType.EOL);
ctx.buffer.Append("tpublic static double ");
AddGlobalSymbol(ctx);
Consume(ctx, TokenType.Identifier);
if(Peek(ctx, TokenType.OpenParens, "(")) {
Consume(ctx, TokenType.OpenParens, "("); // Just round parens
Args(ctx);
Consume(ctx, TokenType.CloseParens, ")");
}
Consume(ctx, TokenType.Equal);
Expr(ctx);
ctx.buffer.Append(" ;");
Consume(ctx, TokenType.EOL);
while(Peek(ctx, TokenType.EOL))
Consume(ctx, TokenType.EOL);
}
This shows the manipulation of both symbol tables, the utility functions to advance the tokens stream, the call to the other recursive functions, and emitting the C# code.
Not very elegant, but it gets the job done.
We whitelist all the functions in the Math
class.
static HashSet<string> validFunctions =
new HashSet<string>(typeof(System.Math).GetMethods().Select(m => m.Name.ToLower()));
For most Tokens, there is a straightforward translation to C#.
private static StringBuilder Emit(Context ctx, Token token) => token.Type switch
{
TokenType.EOL => ctx.buffer.Append("n"),
TokenType.CloseParens => ctx.buffer.Append(')'), // All parens become rounded
TokenType.OpenParens => ctx.buffer.Append('('),
TokenType.Equal => ctx.buffer.Append("=>"),
TokenType.Comma => ctx.buffer.Append(token.Value),
// Identifiers are normalized and checked for injection attacks
TokenType.Identifier => EmitIdentifier(ctx, token),
TokenType.Number => ctx.buffer.Append(token.Value),
TokenType.Operation => ctx.buffer.Append(token.Value),
TokenType.Sum => ctx.buffer.Append("MySum"),
_ => Error(token, TokenType.None)
};
But identifiers need special treatment to check the whitelisted symbols and replace invalid C# characters with valid strings.
private static StringBuilder EmitIdentifier(Context ctx, Token token) {
var val = token.Value;
if(val == "pi") {
ctx.buffer.Append("PI"); // Doesn't follow pattern
return ctx.buffer;
}
if(validFunctions.Contains(val)) {
ctx.buffer.Append(char.ToUpper(val[0]) + val.Substring(1));
return ctx.buffer;
}
string id = token.Value;
if(ctx.globalSymbolTable.Contains(token.Value) ||
ctx.symbolTable.Contains(token.Value)) {
foreach (var r in replacementStrings) {
id = id.Replace(r.Key, r.Value);
}
return ctx.buffer.Append(id);
} else {
throw new Exception($"{token.Value} not a known identifier or function.");
}
}
There is a lot more that could be said about the parser. In the end, the implementation is not important. This one is far from perfect.
Practical advice
As you build your own Source Generators, there are a few things that make the process smoother.
- Write most code in a standard
Console
project. When you are happy with the result, copy and paste it to your source generator. This gives you a good developer experience (i.e., step line by line) for most of your work. - Once you have copied your code to the source generator, and if you still have problems, use
Debug.Launch
to launch the debugger at the start of theExecute
function. - Visual Studio currently has no ability to unload a source generator once loaded. Modifications to the generator itself will only take effect after you closed and reopened your solution.
These are teething problems that hopefully will be fixed in new releases of Visual Studio. For now, you can use the above workarounds.
Conclusion
Source generators allow you to embed external DSLs into your C# project. This post shows how to do this for a simple mathematical language.
The post Using C# Source Generators to create an external DSL appeared first on .NET Blog.
source https://devblogs.microsoft.com/dotnet/using-c-source-generators-to-create-an-external-dsl/
Comments
Post a Comment