gRPC performance improvements in .NET 5

gRPC is a modern open source remote procedure call framework. There are many exciting features in gRPC: real-time streaming, client-to-server code generation, and great cross-platform support to name a few. The most exciting to me, and consistently mentioned by developers who are interested in gRPC, is performance.

Last year Microsoft contributed a new implementation of gRPC for .NET to the CNCF. Built on top of Kestrel and HttpClient, gRPC for .NET makes gRPC a first-class member of the .NET ecosystem.

In our first gRPC for .NET release, we focused on gRPC’s core features, compatibility, and stability. In .NET 5, we made gRPC really fast.

gRPC and .NET 5 are fast

In a community run benchmark of different gRPC server implementations, .NET gets the highest requests per second after Rust, and is just ahead of C++ and Go.

This result builds on top of the work done in .NET 5. Our benchmarks show .NET 5 server performance is 60% faster than .NET Core 3.1. .NET 5 client performance is 230% faster than .NET Core 3.1.

Stephen Toub discusses dotnet/runtime changes in his Performance Improvements in .NET 5 blog post. Check it out to read about improvements in HttpClient and HTTP/2.

In the rest of this blog post I’ll talk about the improvements we made to make gRPC fast in ASP.NET Core.

HTTP/2 allocations in Kestrel

gRPC uses HTTP/2 as its underlying protocol. A fast HTTP/2 implementation is the most important factor when it comes to performance. Our gRPC server builds on top of Kestrel, a HTTP server written in C# that is designed with performance in mind. Kestrel is a top contender in the TechEmpower benchmarks, and gRPC benefits from a lot of the performance improvements in Kestrel automatically. However, there are many HTTP/2 specific optimizations that were made in .NET 5.

Reducing allocations is a good place to start. Fewer allocations per HTTP/2 request means less time doing garbage collection (GC). And CPU time “wasted” in GC is CPU time not spent serving HTTP/2 requests.

The performance profiler above is measuring allocations over 100,000 gRPC requests. The live object graph’s sawtooth shaped pattern indicates memory building up, then being garbage collected. About 3.9KB is being allocated per request. Lets try to get that number down!

dotnet/aspnetcore#18601 adds pooling of streams in a HTTP/2 connection. This one change almost cuts allocations per request in half. It enables reuse of internal types like Http2Stream, and publicly accessible types like HttpContext and HttpRequest, across multiple requests.

Once streams are pooled a range of optimizations become available:

dotnet/aspnetcore#19356 reuses input and output Pipe instances. Pipe is the single biggest contributor to allocations.
dotnet/aspnetcore#19431 reuses known header string values. Related to header reuse, dotnet/aspnetcore#19457 adds HTTP/2 pseudo headers as known headers. String allocations use the third most bytes.
dotnet/aspnetcore#19695 and dotnet/aspnetcore#19629 reuses some smaller per-request objects.
While pooling is great when a server is under load, we want to free up memory that is no longer used. dotnet/aspnetcore#24767 removes streams from the pool if they haven’t been used by a HTTP request in the last 5 seconds.

There are many smaller allocation savings. dotnet/aspnetcore#19783 removes allocations in Kestrel’s HTTP/2 flow control. A resettable ManualResetValueTaskSourceCore<T> type replaces allocating a new object each time flow control is triggered. dotnet/aspnetcore#19273 replaces an array allocation with stackalloc when validating the HTTP request path. dotnet/aspnetcore#19277 and dotnet/aspnetcore#19325 eliminate some unintended allocations related to logging. dotnet/aspnetcore#22557 avoids allocating a Task<T> if a task is already complete. And finally dotnet/aspnetcore#19732 saves a string allocation by special casing content-length of 0. Because every allocation matters.

Per-request memory in .NET 5 is now just 330 B, a decrease of 92%. The sawtooth pattern has also disappeared. Reduced allocations means garbage collection didn’t run at all while the server processed 100,000 gRPC calls.

Reading HTTP headers in Kestrel

A hotpath in HTTP/2 is reading and writing HTTP headers. A HTTP/2 connection supports concurrent requests over a TCP socket, a feature called multiplexing. Multiplexing allows HTTP/2 to make efficient use of connections, but only the headers for one request on a connection can be processed at a time. HTTP/2’s HPack header compression is stateful and depends on order. Processing HTTP/2 headers is a bottleneck so has to be as fast as possible.

dotnet/aspnetcore#23083 optimizes the performance of HPackDecoder. The decoder is a state machine that reads incoming HTTP/2 HEADER frames. The approach here is good, the state machine allows Kestrel to decode frames as they arrive, but the decoder was checking state after parsing each byte. Another problem is literal values, the header names and values, were copied multiple times. Optimizations in this PR include:

Tighten parsing loops. For example, if we’ve just parsed a header name then the value must come afterwards. There is no need to check the state machine to figure out the next state.
Skip literal parsing all together. Literals in HPack have a length prefix. If we know the next 100 bytes are a literal then there is no need to inspect each byte. Mark the literal’s location and resuming parsing at its end.
Avoid copying literal bytes. Previously literal bytes were always copied to an intermediary array before passed to Kestrel. Most of the time this isn’t necessary and instead we can just slice the original buffer and pass a ReadOnlySpan<byte> to Kestrel.

Together these changes significantly decrease the time it takes to parse headers. Header size is almost no longer a factor. The decoder marks the start and end position of a value and then slices that range.

private HPackDecoder _decoder = CreateDecoder();
private byte[] _smallHeader = new byte[] { /* HPack bytes */ };
private byte[] _largeHeader = new byte[] { /* HPack bytes */ };
private IHttpHeadersHandler _noOpHandler = new NoOpHeadersHandler();

[Benchmark]
public void SmallDecode() =>
    _decoder.Decode(_smallHeader, endHeaders: true, handler: _noOpHandler);

[Benchmark]
public void LargeDecode() =>
    _decoder.Decode(_largeHeader, endHeaders: true, handler: _noOpHandler);

Method	Runtime	Mean	Ratio
SmallDecode	.NET Core 3.1	111.20 ns	1.00
SmallDecode	.NET 5.0	71.90 ns	0.65
LargeDecode	.NET Core 3.1	49,083.00 ns	1.00
LargeDecode	.NET 5.0	98.68 ns	0.002

Once headers have been decoded, Kestrel needs to validate and process them. For example, special HTTP/2 headers like :path and :method need to be set onto HttpRequest.Path and HttpRequest.Method, and other headers need to be converted to strings and added to the HttpRequest.Headers collection.

Kestrel has the concept of known request headers. Known headers are a selection of commonly occuring request headers that have been optimized for fast setting and getting. dotnet/aspnetcore#24730 adds an even faster path for setting HPack static table headers to the known headers. The HPack static table gives 61 common header names and values a number ID that can be sent instead of the full name. A header with a static table ID can use the optimized path to bypass some validation and quickly be set in the collection based on its ID. dotnet/aspnetcore#24945 adds extra optimization for static table IDs with a name and value.

Adding HPack response compression

Prior to .NET 5, Kestrel supported reading HPack compressed headers in requests, but it didn’t compress response headers. The obvious advantage of response header compression is less network usage, but there are performance benefits as well. It’s faster to write a couple of bits for a compressed header than it is to encode and write the header’s full name and value as bytes.

dotnet/aspnetcore#19521 adds initial HPack static compression. Static compression is pretty simple: if the header is in the HPack static table then write the ID to identify the header instead of the longer name.

Dynamic HPack header compression is more complicated, but also provides bigger gains. Response header names and values are tracked in a dynamic table and are each assigned an ID. As a response’s headers are written, the server checks to see if the header name and value are in the table. If there is a match then the ID is written. If there isn’t then the full header is written, and it is added to the table for the next response. There is a maximum size of the dynamic table, so adding a header to it may evict other headers with a first in, first out order.

dotnet/aspnetcore#20058 adds dynamic HPack header compression. To quickly search for headers the dynamic table groups header entries using a basic hash table. To track order and evict the oldest headers, entries maintain a linked list. To avoid allocations, removed entries are pooled and reused.

Using Wireshark, we can see the impact of header compression on response size for this example gRPC call. .NET Core 3.x writes 77 B, while .NET 5 is only 12 B.

Protobuf message serialization

gRPC for .NET uses the Google.Protobuf package as the default serializer for messages. Protobuf is an efficient binary serialization format. Google.Protobuf is designed for performance, using code generation instead of reflection to serialize .NET objects. There are some modern .NET APIs and features that can be added to it to reduce allocations and improve efficiency.

The biggest improvement to Google.Protobuf is support for modern .NET IO types: Span<T>, ReadOnlySequence<T> and IBufferWriter<T>. These types allow gRPC messages to be serialized directly using buffers exposed by Kestrel. This saves Google.Protobuf allocating an intermediary array when serializing and deserializing Protobuf content.

Support for Protobuf buffer serialization was a multi-year effort between Microsoft and Google engineers. Changes were spread across multiple repositories.

protocolbuffers/protobuf#7351 and protocolbuffers/protobuf#7576 add support for buffer serialization to Google.Protobuf. This is by far the biggest and most complicated change. Three attempts were made to add this feature before the right balance between performance, backwards compatibility and code reuse was found. Protobuf reading and writing uses many performance oriented features and APIs added to C# and .NET Core:

Span<T> and C# ref struct types enables fast and safe access to memory. Span<T> represents a contiguous region of arbitrary memory. Using span lets us serialize to managed .NET arrays, stack allocated arrays, or unmanaged memory, without using pointers. Span<T> and .NET protects us against buffer overflow.
stackalloc is used to create stack-based arrays. stackalloc is a useful tool to avoid allocations when a small buffer is required.
Low-level methods such as MemoryMarshal.GetReference(), Unsafe.ReadUnaligned() and Unsafe.WriteUnaligned() convert directly between primitive types and bytes.
BinaryPrimitives has helper methods for efficiently converting between .NET primitive types and bytes. For example, BinaryPrimitives.ReadUInt64LittleEndian reads little endian bytes and returns an unsigned 64 bit number. Methods provided by BinaryPrimitive are heavily optimized and use vectorization.

A great thing about modern C# and .NET is it is possible to write fast, efficient, low-level libraries without sacrificing memory safety. When it comes to performance, .NET lets you have your cake and eat it too!

private TestMessage _testMessage = CreateMessage();
private ReadOnlySequence<byte> _testData = CreateData();
private IBufferWriter<byte> _bufferWriter = CreateWriter();

[Benchmark]
public IMessage ToByteArray() =>
    _testMessage.ToByteArray();

[Benchmark]
public IMessage ToBufferWriter() =>
    _testMessage.WriteTo(_bufferWriter);

[Benchmark]
public IMessage FromByteArray() =>
    TestMessage.Parser.ParseFrom(CreateBytes());

[Benchmark]
public IMessage FromSequence() =>
    TestMessage.Parser.ParseFrom(_testData);

Method	Runtime	Mean	Ratio	Allocated
ToByteArray	.NET 5.0	1,133.82 ns	1.00	184 B
ToBufferWriter	.NET 5.0	589.05 ns	0.51	64 B
FromByteArray	.NET 5.0	409.88 ns	1.00	1960 B
FromSequence	.NET 5.0	381.03 ns	0.92	1776 B

Adding support for buffer serialization to Google.Protobuf is just the first step. More work is required for gRPC for .NET to take advantage of the new capability:

grpc/grpc#18865 and grpc/grpc#19792 adds ReadOnlySequence<byte> and IBufferWriter<byte> APIs to the gRPC serialization abstraction layer in Grpc.Core.Api.
grpc/grpc#23485 updates gRPC code generation to glue the changes in Google.Protobuf to Grpc.Core.Api.
grpc/grpc-dotnet#376 and grpc/grpc-dotnet#629 updates gRPC for .NET to use the new serialization abstractions in Grpc.Core.Api. This code is the integration between Kestrel and gRPC. Because Kestrel’s IO is built on top of System.IO.Pipelines, we can use its buffers during serialization.

The end result is gRPC for .NET serializes Protobuf messages directly to Kestrel’s request and response buffers. Intermediary array allocations and byte copies have been eliminated from gRPC message serialization.

Wrapping Up

Performance is a feature of .NET and gRPC, and as cloud apps scale it is more important than ever. I think all developers can agree it is fun to make fast apps, but performance has real world impact. Lower latency and higher throughput means fewer servers. It is an opportunity to save money, reduce power use and build greener apps.

As is obvious from this tour, a lot of changes have gone into gRPC, Protobuf and .NET aimed at improving performance. Our benchmarks show a 60% improvement in gRPC server RPS and a 230% improvement in gRPC client RPS.

.NET 5 RC2 is available now, and the official .NET 5 release is in November. To try out the performance improvements and to get started using gRPC with .NET, the best place to start is the Create a gRPC client and server in ASP.NET Core tutorial.

We look forward to hearing about apps built with gRPC and .NET, and to your future contributions in the dotnet and grpc repos!

The post gRPC performance improvements in .NET 5 appeared first on ASP.NET Blog.

Language Lassi