Can we test it? Yes, was can! - Mitchell Hashimoto - https://www.youtube.com/watch?v=MqC3tudPH6w Hello, everyone. Thanks for coming in. I'll try to make it positive, I guess. I think it's optimistic, but let's just dive right into it. So this is something we've all heard before. You know, a PR is opened. Maybe it's trivial, maybe it's not trivial. Who knows? Maybe the author proactively said this, maybe they didn't. Maybe you ask why there aren't tests, but usually you. You'll see it. This word, this can't be tested, often followed by the phrase, but it's OK because I just verified it manually. I've said this before. You've said this before. Someone in this room has probably said this in the past 24 to 48 hours. It's fine. And the truth, however, is usually this can't be tested yet or. Or replace yet with easily. And easily is okay, because sometimes you can make the argument that something is could be tested, but the work involved in testing it is not worth it. Tons of stuff like that exists. Maybe if there's a catastrophic bug, it's not an issue. Maybe you're just running this in, you know, reproducible environments. It's not. Not a big deal. That's all fine. There's loads of those situations. This talk is not at all about when to test. I'll let you make those value decisions yourself. What I want to talk about is how things can be tested. So my goal with this talk is to really give you a tour of concepts and strategies that can be applied to a variety of different situations. Try to give you some, like, pattern matching capability and how you might be able to apply those. And I think it's very similar to. I think all of us in here would agree that there's. There's value in learning many different programming languages, even if you work every day with only a single one. Learning multiple different languages tends to make you a better programmer. Similarly, I think that just being exposed to different testing challenges, even if you may not directly hit the examples I show you, I think it'll make everyone a better tester. And so that's really my goal today. Okay, so you understand the goals of talk. Why should you sort of listen to me? What's my experience? I'll go through this relatively quickly. My background is I started a company called Hashicorp. We made a bunch of tools. I was part of all those initial engineering teams. So it's either I'm sorry or you're welcome. I don't know if you love them, feel free to tell me if you don't Love them then I didn't do it, it was somebody else. But yeah, I did that for about 12 years. And importantly, I think I was part of a lot of the initial testing laying out a lot of the initial testing strategies for this software and sort of some of the properties there, they tended to be network systems. Mostly there was a large scale in terms of usually nodes or that kind of scale, that dimension, those axes. And there was security sensitivity since one of the things we made was vault. More recently, I left Hashicorp at the end of 2023, if someone didn't know that. But more recently, for the past few years, I've been working on a project called Go See, which is a terminal emulator. It's not a company, it's just a passion project, it's just for fun. But it has dramatically different properties. This is desktop software. It's cross platform across Mac OS and Linux. It has GPU rendering in it. And I think this is interesting because each of these experiences has given me these dramatically different testing environments. And I'm someone who loves testing. So I wanted to figure this out and I want to share sort of how what I, what I've figured out over the past 15 years. All right, so going back in 2017, I gave a talk at Go for Con titled Advanced Testing with go. And as the title and venue may suggest, this is fairly GO specific or this was heavily. A lot of examples were GO here. But over the years a lot of people have come up. This is sort of one of the most watched talks I've ever given. A lot of people have come to me and said, even if I don't write go, I found value in watching this talk. And it's been almost 10 years and I felt that having a sequel to this talk would be a good idea because I've learned a lot more, I've experienced more complex situations and I was sort of cut short in this and I wanted to talk about more. But an important part of that is I'm trying not to overlap at all with this talk. So there's one concept that overlaps with this only because we're going to use it in the future. Things I talk about. But if some of the things I talk about maybe seem like I'm jumping into something advanced, this has a lot more. Even though this has advanced in the title, this has a lot more elementary concepts. And even if you don't write GO like that, if you like this talk, I think you would like watching that one as well. Okay, so let's start Getting into it, here's what we're taught, right? Like in school or in testing chapters of programming books or something like that. We're sort of taught that there's some operation, there's some output, you run it and you expect or assert that output. And some stuff is like this. And if the stuff, if what you're doing is like this, then great, it's easy, we all have a great time and generally we'll write tests. But the issue is that let me make sure it gets right. The reality is that in a lot of real world cases you might run into something where add is actually sub processing to a shell, running BC for some reason and you wanted it to work on Windows and now it doesn't work on Windows. This is silly. This is not reality. The only ecosystem that would possibly do this is probably JavaScript. But the whole point is that reality is actually messy, right? Like most tests really don't come down to that simple run a thing, get an output, you're happy situation. At least in my experience. A lot of the things that I actually care to work don't fall into that category. They fall into these categories where they have side effects or there's a complex state of the world that needs to be created in order to even run it in the first place. Networking, other devices like GPUs, concurrency and, and so on and so forth, the list goes on. And I'm not going to go into detail about this. I just hope that people agree and have noticed this in their own code. And a lot of these are what give rise to you work in one of these categories and say, hey, this couldn't be tested, but it works maybe. Ok, so the last thing I want to talk about, one more setup slide before we sort of get into it, is the two sides to testing. So there's two parts to testing. They're both equally important. This is also gonna be the slide format. So if the slide header starts with a test tube, we're gonna be talking about testing strategy. If the slide header starts with soap and bubbles, we're gonna be talking about testability. So testing strategy doesn't need an explanation. That's how to test something. Without this test don't exist. Everyone gets this one, it's easy. The second one is just as important. But the one that I find, regardless of engineering experience gets ignored more often is testability. If software isn't made testable, there actually are cases where you can say this can't be tested and you would be telling the truth. So you need to pair testability with actual testing strategy in order to be able to test everything. And the way you achieve testability generally are things like proper code structure, making the right APIs available, making something friendly for automation, things like that. So I'm gonna be talking about both of these intermixed in this again with the slide design guiding it in order to get us to be able to test more things. All right, so let's get started with our first testing strategy. Oh, another important point. The slides are roughly in order of simplest to more complicated or more advanced. So if you feel that some of this stuff is like, oh, this is obvious and this whole talk's gonna be obvious. I suspect it won't. I hope it won't. But based on earlier talk, this is a pretty advanced group, so maybe so, but let's get started with the first one. Snapshot testing. Sometimes this is also called golden files testing, ground truth testing, other things like that. Snapshot testing is. I used to call it some of those phrases, but I feel like snapshot testing is more of the norm nowadays culturally. So that's what I'm gonna call it here. The scenario where you want to do snapshot testing generally is when you have some sort of complex output format where programmatically coding the comparisons is difficult. That's, I think if you are familiar with snapshot testing, the one that you'd be familiar with a lesser discussed scenario. But I think possibly more powerful for snapshot testing is that it often gives you better diffs to understand why something failed. And I'm going to show you an example of this. So here's. Oh, this came out way smaller than I expected. Here's an example directly from Ghosty, my terminal emulator. When I previewed this locally, that image was really small and the code is really big. So I don't know. But the code, you don't need to read it. The code is zig. I'm going to use zig examples in this. This talk is totally language agnostic. It doesn't matter, so ignore that. But basically what we're doing here is go see has an embedded sprite font within the terminal emulator. There's a bunch of glyphs that terminals render that depend on the grid size to be perfect. The ones you're probably most Familiar with are PowerShell type glyphs, like the arrows. If you end up using a font. If you see this right now in your terminal, you use a font and the arrows don't quite perfectly match up with your grid size, then the terminal's not doing this. What Ghosty does is those glyph code points we programmatically rasterize, given the current grid dimensions, so they're always pixel perfect. This is not uncommon. A handful of terminals do this, but the issue is that how do you test this? How do we make sure it works? Specifically, one of the first things we found was that we had an off by one error when the grid size was odd in either dimension and we were one pixel off. And so one of the ways. Oh, and also new glyphs are introduced all the time. We've changed out the library of how we rasterize and things like that. How do we make sure we don't regress this stuff? This is a perfect case for snapshot testing. What we do is we embed a. We render a bitmap of every glyph that we programmatically rasterize, we commit that directly to the repository and we compare it pretty straightforward. That's the snapshot and that's how we compare. What the code was trying to show you there is that there's a basic step at the bottom where we read this ground truth, we run the actual rasterization and we just compare it. There's nothing fancy there. But one of the things I talked about is that one of the helpful properties of ground truth testing, snapshot testing is better diffs. And in this case, since our comparison is an image, we could apply standard image diffing techniques to this. And we do do this also in the repo. So when the test fails, we also generate the diff and dump it to the file system as part of the test run. And again, I thought this would be a little bit more readable. But what I did was I artificially modified our code to draw vertical bars, one pixel off to the right. It produced a failure. And the diffs on the right, it's a bit hard to see, but there are some green lines in there. That's the diff that gets generated. And the thing that I want to point out here is that this is a more helpful diff because prior to this we did have some tests and it would just basically say it didn't match or this pixel in this place was wrong. One thing that snapshot testing gives you is snapshots usually have more context than a single assertion. Like, yes, this one pixel is wrong, but here's all the pixels so you could see. And usually standard assertions in like a unit test type environment don't also include the greater context. And snapshots tend to so in this case, I could see very clearly that vertical bars are an issue. I could see it's across multiple glyphs. And me, with my experience with the code base, when I see something like this, immediately think some common function that draws vertical bars is wrong. And that's what I ended up, you know, breaking here. This doesn't just apply to images. Images make for an easy visual example. But for example, for Terraform, about a decade ago, one thing that we ran into was that the first step that Terraform does is builds a resource graph that gets executed. And we had a bunch of tests around building that graph and we would get failures saying this expected node, this expected edge doesn't exist and it would be very difficult. We were spending a lot of time as engineers figuring out what the graph changed. So one of the things I changed was even though it wasn't hard to program those assertions, the diffing was very hard for us. So I ended up starting to dump the expected graph and the graph we got and then doing generating a dot format that actually colored the edges that were missing red, the bonus edges, green vertices, so on. Then you could load the whole graph and debugging these things became much, much easier. That was actually a text diffing format. You could render that into an image, but that was just text stiffing. But again, since we're missing one vertice, but we could see all the vertices, the greater context around it, debugging became easier. So I think this is actually the bigger benefit of snapshot testing, but something to keep in mind. Okay, this next section I didn't mean to keep in here, so I'm actually gonna skip it. That one overlaps as well with the GopherConTalk, so you could look into that one. It was also kind of like an opinionated one that people tend to get upset about. So it's fine. We should skip it anyway. Okay, let's talk about a testability subject. So soap and bubbles isolating side effects. I actually think if, if there's no other part of this talk that you pay attention to, this is the number one force multiplying strategy that sort of exists to make code testable. This comes up over and over in different contexts. And the scenario is this you have some sort of behavior you want to test, but it's reliant on some sort of external IO or otherwise complex system that's sort of filled with side effect type behavior. And this is a really common case of this can't be tested. And what you're looking for, the Goal with this is within this I O complexity state soup, you're trying to find the purely functional behavior stuck in there, extract it out, reorder things in order to get something that you could mostly test. In this case, you could usually get most of the complexity tested as a result of this. What you get is something that is obviously testable. I'm going to show you an example of this. You get something that's obviously testable, but it becomes a garbage in, garbage out type of test. And what I mean by that is you can't in this case simulate the external side effecty things. So you're gonna artificially provide those inputs. And so if you provide garbage and you test garbage, you're gonna get garbage out. You need to make sure that the inputs that you're gonna be providing to these sorts of tests are actually realistic in real world. And again, that's conceptual. Let's see it in practice. Okay, so here is something. We're gonna start visually and then see code, but I actually doubt we're gonna see any code given the slide sizes. But let's try to understand this visually. Here's a simplified but real example from Go see Again, one of the things a terminal emulator has to do is probably the main thing that it has to do is when you press something on the keyboard, it has to encode that into some format which is then sent to your shell or whatever running program. And then it does whatever basic things, like if you press Control R at your shell prompt, you tend to expect a reverse search to show up, so on and so forth. Early versions of Ghosty had a keyboard input handler that looked like the above image. Basically, what would happen here is a user would press a key, we would read mouse state, which seems odd, but depending on mouse state, it actually affects whether a key should be encoded at all. Like if you're actually actively highlighting something and you press certain keys, you might want to move the highlight, shift the highlight, things like that. So we would first grab mousetate, check, respond in some way. Then we would check keyboard state, because we need to know what keys are pressed, what modifiers are pressed. Is it a repeat, is it a first press? You know, things like that, do stuff around that. Then we would read terminal settings. There's a variety of settings that affect how it's encoded. Are we doing legacy encoding, kitty encoding within each of those? Are we encoding alternate Unicode code points? Are we encoding control as differently? There's a bunch of stuff. And then finally we encode the key and then write it out to the actual PTY that we have. And this was untested because setting up mouse keyboard state, stuff like that is non trivial. And I originally approached this as oh, this isn't a testable thing. And I punted it to a full end to end test. I figured one day I would probably spin up a VM or something, synthesize inputs and assert something. I just punted it away. We'll figure it out later. But then I sort of got punched in the face enough with this code. Constant regressions happening here. A lot of complexity that I realized I had to do something, even if that something was that VM based test right now. And I actually sat down and focused on I need to make this testable because this can't go on. And what I realized is what these colors are showing. If you could see the colors, the yellow is the stuff that's dependent on external logic and the green is the stuff that, that doesn't need any, is sort of pure. It just has some inputs and gives you a set of outputs and doesn't touch any external systems. And it's really easy to see here because it's colorized. And I simplified it into distinct categories of function calls and also made it alternating. But hopefully people could understand that the reality of this function at the time was that all of this was intermixed and we would grab state when we needed it and run conditionals on it. And it wasn't, at least to me. It took me a few hours of really staring at this code to see the shape of suddenly something emerge. And what I saw emerge was this at the bottom. If I actually took all of this stuff that grabbed external state, moved it to the top, turn into a read process, write order, then I could isolate that, provide artificial inputs there, test this green thing that is pure, just has inputs and gives you an output and then it becomes that testing 101 expect add one plus two equals three sort of environment. And in this case most of the complexity, most of the bugs, most of the issues was in this green thing. So we were able to really dramatically eliminate a bunch of issues. And also that made it much easier to fuzz and things like that. And I'm not going to talk about fuzzing in this. There's enough of that here. Yep, that's what I expected. This is the actual code from the key encoder. I wish you could see the bottom. The bottom is more important, but the top just know that each line, I think you could all see lines, just not the text that's in the line. Each line is a piece of state, whether it's structure, boolean, an integer character. It's a piece of state that's needed to do the key encoding. And what I'm just trying to visualize here is how much state is actually required for a terminal to produce a valid key encoding. There's something like. In total, there's something like 15 different fields here. Some of it is produced by the operating system, some of it's produced by the terminal, its internal settings. But the bottom is an actual test I copied out directly just verbatim zero edits to show the types of regressions that we can now test against and what the bottom test is doing. Just believe me, and you could look at the slides later. It's testing how we encode a certain input from a Russian keyboard layout with kitty keyboard settings encoded to also add alternate Unicode characters and predicts expects we get the right thing. This is the reason I had to test this because every time I would fix a feature or fix a bug or implement a feature, I would regress some to me very foreign layout that I of course was not just running adult speaker type Russian. And also getting this specific kitty layout is very difficult. So you know, this is very common. There's Russian, Japanese, Chinese, Hungarian, like all sorts of very language specific test cases in there to make sure we constantly do the right thing. And I think because of this, Go See has one of the most complete and stable sort of encoding key encoding features out there. And so this helped a lot. And this is sort of the key point of isolating the side effects. And basically every section from here on out is going to continue to show examples of this. We're going to take snapshot testing, we're going to take isolating side effects and we're going to bring them together to do something more and more complicated. All right. GPUs. What a crazy time for GPUs. So thankfully I don't work at OpenAI, so I'm not going to talk about AI at all in this talk. I'm just going to talk about GPU programming in general. It could apply to AI, but in my case it applies to rendering and some compute. But I want to talk about GPU programming and GPU testing actually. So background Ghosty is a GPU rendered terminal emulator. And what that means is when you run the terminal, that main thing you see with your cursor blinking, that whole thing is just an image that I'm rendering via the gpu. We also do A little bit of compute on the gpu. It was my personally, it was my first foray into writing any kind of GPU code, starting a few years ago. Prior to that, I lived purely on the cpu. And given that, when I started coming into it, one of the first things I asked was, all right, how do I test this? My initial feeling and response was the classic can't be tested. This seemed like a very obvious thing that I should spin up a VM and take screenshots of. I was rendering after all. So that seemed to be the right solution. But for years our renders didn't have unit tests. And for years, similar to key encoding, we would just whack a mole regressions constantly in the renderer, in the GPU code. And so finally I sat down and said, I need to figure this out. Resources on how to test GPU logic is surprisingly scant. Like if you do a web search of how to test GPUs, it's one of those rare things where Google the first page is just completely garbage. There's one response that has an idea that's kind of interesting, but it was just an idea that no one implemented. So it kind of leads you in a direction that you then have to figure out on your own. I went from that to going to the Kronos Group, Apple and Microsoft and downloading the reference material for DirectX, direct 3D, metal and OpenGL, Vulkan and all their docs, anything they could provide me in big text format. And I did command F searching, LLVM assistance searching on that. Everything I looked for test verify, snapshot, bug stability. I looked for all these terms and the amazing thing is the total reference material across those three vendors for their language specification, driver specifications and so on is something like 4,000 pages of PDF. And I didn't get a single hit on any of those terms. Like the word didn't pop up one time. So as far as I could find, there are zero official resources on how to do testing with GPUs, and very, very few people have even cared to ask the question, at least publicly, to where it was indexed. So that's where I was left. I took that mostly as a challenge to see if. Well, I feel pretty good about testing. I like testing. Maybe I could figure this out. And I think I did at least something that works for me pretty well. So this is where I'm just going to add a quick disclaimer that you could probably tell based on my experience here, that I'm not an expert on GPUs. I never worked on complicated 3D games. My terminal emulator is one of the only things beyond like toy advent of code examples that ever used a gpu. I'm not quite sure that my techniques here actually generalize that well. In my defense, Ghostly does have about 15,000 lines of renderer code split across metal and OpenGL. We have separate renderers for both systems, and that includes both the CPU code and the shaders for the GPU. The shaders are about 2,000 lines. So 13,000 on the CPU, 2,000 on the GPU. That's what I was working with. That's what I was trying to test. The core realization I came up with here is that GPU programming requires two sides. I'm going to go into that background. There's the CPU side to prepare the data and then process the results usually. And the GPU part that actually runs the shaders. And I felt that we can test each of those in isolation and specifically really clearly a GPU is just a pure function evaluator, which makes testing really easy, but setting up the workloads really hard, which is kind of a funny thing to run into. But let's go ahead and look at this more visually and start with the GPU side, the CPU side. We could actually read this basic thing. So as a point of background, for those less familiar with GPU programming, the CPU does have to do some work to prepare the gpu. The CPU has to put together the right data, the right steps, basically these little job descriptions that it then eventually offloads and submits the GPU and says, here's a bunch of crap, go do it and I'll come back, or you tell me when you're done, I'll read this later. That's the really general way to think about a gpu. The work the CPU does in preparation for the GPU can be roughly thought of as this top function. It's a function that takes in some sort of state of the world. If you're doing rendering, this might be called the scene state. It's what monsters are on the level. Where are they? Where's the camera? Where's the player? What planet am I on? That's the scene state that exists. It brings in the scene state, and as a result it produces three sets of values. It produces a graph, so you get vertices and edges, and it produces data attachments for that graph, which nodes need which access to what data. And that's a gpu. The graph is just a graph. It has vertices, edges, and the vertices are Operations, right? It's stuff like vertex shading, fragment shading, compute shading, things like that. Edges are the data dependencies between the steps. The vertex shader is going to produce some sort of output that the fragment shader needs to bring in, or the vertex shader needs, the fragment shader needs access to a certain texture, things like that. There's these edges that exist in the graph and the data attachments are literally byte buffers. Historically you'd probably call these textures. More modern graphics APIs out there, Vulcan, Metal and later direct 3D, they all tend to really just call these buffers now because it's really just a set of bytes. Sometimes the bytes have structure to them. They're RGBA with dimensions and stride and you have a texture. But sometimes they're just bytes because you're just computing stuff. So data is just bytes and therefore you could do whatever you want with it. So to test the CPU side, we have to apply that technique of isolating the side effects. In this case, the side effects are all the API calls to the GPU itself in order to submit, prepare and submit this workload. And so what I ended up doing was creating an intermediary where we bring in the scene state, we produce the graph and the data, and, and then I assert that the graph has the right shape, which I have a bunch of experience with the terraform and the data. I do snapshot testing, bringing that back because it's just bytes. And so we're able to do some structured sort of snapshot verification on that. And then after that there's a small amount of simple untested code which just translates the graph and the data attachments into GPU API calls. That stuff never really changes. I'm happy to keep it untested until there's end to end tests. And in this way we're able to test that given a certain scene state, we're producing the right workloads for the gpu. But we're not sure the GPU is going to do the right thing with that. But this is still one big part of the equation. I mean, this is 13 out of the 15,000 lines of code that go see does in order to render a scene. So this is a big one. Then on the GPU side, it's visually much simpler. GPUs have no access to disk, they have no access to networking, they, they have no access to any other peripherals, they only have access to their own memory. And so a GPU by definition is pretty much just a pure function evaluator it has some sort of computation, it has data and it outputs data. And that's something that's really juicy to test, really easy to test. But the funny thing about it is, like I said, the hard part is actually submitting the workload. So the general idea with the GPU side is I want to artificially construct some set of input buffers that my pipeline's going to expect. I'm going to ensure the output buffers are CPU readable. So instead of writing to a frame buffer that might never come back to the cpu, it's going to render this screen. Don't render the screen, render to this other CPU readable memory I have. Then I submit actual GPU work. We run unit tests on the cpu. On the cpu, we're going to run unit tests for the GPU on the gpu, submit the actual GPU work and then compare the output buffers. Just hand waving, you know, snapshot testing, actually parsing the data, whatever you want to do, but compare the output in some way. That's the general idea. It's hard, right? So in practice, what I found and what I've really only gotten to work well enough that I've shipped it, is full render passes with snapshot testing. What I do is I artificially create the scene state, usually a very small terminal, like a two by two terminal. I send it to the actual gpu, I get an image out and then I compare images and I expect pixel byte, equivalent images. It sounds kind of like end to end testing. It is in a certain way, I would say it's not quite a unit test, it's more of an integration test, but it's very, I think it's still a very robust, powerful test for two reasons. One, it's much faster. So getting a window made, submitting GPU work, grabbing a screenshot and, and comparing it is instant on any modern computer. It's really, really fast and it is very robust. In this case, we're really tightly controlling the input scene state. We're running one render pass, we're grabbing exactly one frame of results out of the other side and comparing against it. We don't have to worry about standard end to end tests with timing and synthesized inputs and, you know, window positions and chroming and like all this other stuff around the edges, like we get exactly the image perfectly cropped for what we're trying to compare and it's a single frame. So it's very robust. It works. For the future, I do want to test shaders in more isolation and I've already done a bunch of this. I'm going to kind of go through a few of the things I've done as proof of concepts and they work, but they all have these trade offs I'm not quite happy with. So I haven't actually shipped this in any way. So that's my disclaimer for this, that the proof of concepts all do work though. So the full render pass obviously tests a full input to image, but individual shaders themselves have quite a lot of complexity, or can have quite a lot of complexity. There's conditionals, there's loops, there's obvious edge cases that I see that I want to test in some way. And it's sometimes hard to elicit those edge cases through an initial scene state, or at least to visualize them in a resulting output state. So I want to get closer to a unit test with shaders. And so to do that I've been trying a variety of techniques, trying to figure out the best way forward. Again, these are all things that I haven't found a lot of people trying that much. I'm not going to talk about each in detail because I'm still learning quite a bit about it, but I'm going to just cover the high level ones. People know more details about this, I'd love to hear about it. But for example, sort of just a couple concepts. OpenGL. I know a lot of people are hyped about Vulkan and things like that, but OpenGL still works. OpenGL has this feature called transform feedback and what it basically allows you to do is capture the output of some shaders, some types of shaders, not all of them into a CPU readable buffer. It's actually a really nice API because you literally just add strings, say the variables that you want, and it just grabs them and throws them in order into an output buffer. And that's perfect. I don't know what this was made for, I don't know, I've never seen it. I did a source graph search and things like that. I don't see it used as very few API calls. It doesn't really exist in Vulkan, so clearly there wasn't enough value to move it forward. But it is kind of perfect for testing some kind of shaders. And that's an issue. It only applies to some on the metal side. Metal doesn't have transform shaders, neither does direct 3D. So what I found I've had to do there is extract shared logic into compute shaders. So non rendering, just compute shaders, make Each side just call this shared library and then run it through a compute shader and kind of build my own transferred feedback mechanism. That requires a lot more code. That requires code restructuring of GPU code, which standard for testing, but isn't great. And it works. The nice thing about that is that works for every kind of shader, no matter what. But you have to be able to extract it into a compute shader, which I've never found you can't do. So there's a lot of promise here. I unfortunately just didn't get to a production state of the future side. So I don't know, hopefully one day I could blog or talk about it and have it all figured out. But I think the result of this is I feel confident that we're now able to test our renderers. The full render pass thing I have is very robust and fast. And you don't actually need a GPU like hardware to do it because you could run it against software drivers. We could assume the drivers work, that we're not trying to verify drivers here, so just run it against software drivers. And again, we're just running one frame. So it's super fast. And yeah, it's interesting and I think it highlights snapshot and isolating side effects really well. Okay, the last sort of topic I want to talk about is VM testing. There are some things that do end up requiring this specifically to test those yellow boxes that I had earlier. The only way to really do it is to really make it happen. And the only way to simulate things like keyboard and mouse and other types of events is through things like VMs, not just keyboard or mouse. Right. This is also network failure stuff that antithesis is really good at. Disk failures, things like that. It's sort of best done through a hypervisor layer. I'm gonna apologize here because we're gonna mention nix. And I know a lot of people feel that Nick's enthusiasm is pretty exhausting and don't want to hear about it. So I am sorry. In my defense, I feel pretty confident that I have a good grasp about dev test environments, virtualization, containerization. Like it was my whole career for like 15 years. And I don't know any other technology that could achieve what I'm about to show you very well. So I'm going to use nix and I'm going to. And I'm sorry, but not sorry at the same time. Okay, so let's first talk about VM testing without the nix part. Okay, so if you're like having an emotional reaction, we'll start here. Okay. There are some things, like I said, that just require an end to end test. And VMs are the best for that. You can maybe get away with containers, but just use them interchangeably if you want. But in this case, I'm just keep using the word vm. VM testing lets you model really complex, pretty much arbitrarily complex states of the world. Specific kernel software versions, specifically the interplay between those. For me, on the desktop side, it lets me simulate different locales, different keyboard layouts, excuse me, things like that. And you sometimes just need them. So the idea is that you spin up a full vm, you actually run software, synthesize events, and then somehow assert that what you wanted to happen happened. That's usually through screenshots or SSH commands. You know, SSH commands would be like, this process is running, this file exists, this file has this contents, whatever. But those are usually the two mechanisms you do it. Okay, now we bring in the nix. Why do we have to bring in the nix? I'm glad we could read this actually. So I'm bringing in Nix because Nix provides a full first party testing framework that has access to Nix. And these three properties are really important because first party, Nix actually uses this to test Nix itself. And so it's not going away. It's running every day. It's running right now. There's thousands of jobs queued up right now by the next project in order to run these types of tests. Second, it's an actual test framework. It doesn't just define how to spin up a vm. It has a full API for writing tests and asserting they pass. And that's important because it's not just like run a Docker image or something like Docker provides the runtime for a container, but it's not gonna give you any of the tools to actually test in that case. This is giving you both sides of the equation. And then third, oops, sorry, go back one. Since it is powered by Nix, this is a benefit because you get full access. Well, the language is probably a detriment, I'm gonna be honest. But the access to Nix packages is a benefit because you get access to basically every version of every piece of semi popular software that has existed for the past decades. And this is really important because you could pin specific versions of everything down to kernel libc, everything. So this is the only way I've found, when the most annoying desktop users of all time, Debian users, bring an ancient version of long term support software and say, this thing doesn't work. This is the only way I've been able to actually verify it works. So let's take a look at what this looks like. The first step for any VM test is actually define the machine. I put optional pluralization because Nix lets you define multiple machines and do networking between them. We're not going to talk about that. That's like the whole talk. It's probably like a whole degree for machine configuration. It's just. And these air quotes are doing a lot of work. It is just the Nix OS configuration. You could put anything in there that you would configure a full NIX installation with. So that means, like I said, that means anything. Kernels, drivers, users, packages, everything. Step two is actually defining the tests. The tests are written in Python, they're not written in nix. So you actually put a string or embed a file with your tests here and Nix gives you this full Python API that gives you some nice high level stuff like waiting for systemd units. Since Nix OS uses systemd. Actually you could even do ocr. You could wait for certain text to appear on the screen and it just handles that for you. And it's just Python. And one of the things you get out of this is you actually can access a repl, so you could have the VM and just use the REPL to be playing around with your tests. So that's good. In this case we're defining a test that the ALICE user. I forgot to mention this, a previous one, we installed Firefox for Alice. The ALICE user has Firefox and Root does not have Firefox, obviously a toy example, but you could do anything here. And then step three is you have to run them again, hand waving going on here, as you have to do whenever you talk about Nix. But basically the important thing is you have a mechanism built in to the framework runtime to run the tests, to get a repl, to debug the tests, to develop the tests, to run a single test. Everything is sort of like there for you within a single command of some sort. And so this is just a full end to end thread thing in order to handle VM tests. And like I said, I just haven't found any other thing that's focused on providing this level of flexibility with a focus on testing. Right. There's tons of frameworks, I built some of them to spin up machines, spin up VMs, but not to just complete the end to end part of it in practice. What am I actually using Nixos VMs for? And again, stuff you really can't test without a full sandbox environment. The thing that really triggered me, the thing that really made me do this was input methods. I'm a fairly calm person, but input methods on Linux made me pound the table a few times. For those that don't know, an input method is basically any sort of how you input sort of certain Asian languages, emoji keyboards are an input method. It's the ability to input any character that's not represented on your physical keyboard is an input method. Handwriting as well as an input method. On Linux, the input method, the input method framework, the windowing system slash compositor, they're all developed by different people. And this is where I think the struggles of Linux really shine. Because of this very specific versions of different things just behave wildly differently. And it drove me crazy. So I needed to use VM testing to test input methods. I love Linux, I use NEXT os, but that's in contrast to something like Apple where you could really clearly tell there's some vertical mandate that all these things must work together in lockstep. So much easier as an app developer, but, you know, it's what I have to work with. And then there's other stuff, sort of desktop integrations, making sure that Open and Ghostly appears on right click. How can you possibly test that without actually like taking a screenshot and right clicking and taking a screenshot? These are things that also just break constantly in Linux with various version upgrades. So perfect for VMs. There's a lot of complaining about Linux up here, but it's just like this is the work you have to do to make, in my opinion, a stable desktop app experience for desktop Linux. I'm just going to put this here for later. Take a picture or just download the slides later. Here are resources where you could actually learn a lot more about VM testing. It's extremely powerful, but like everything in Nyx, the learning curve is like a sheer vertical cliff. So, you know, if you want to traverse the wall from Game of Thrones, then this is the resources that you're going to need to use. I think the benefit is the payoff from doing this. For the right type of testing that you need, there's nothing that compares. And so that's it. Thank you. I know we only covered five topics here, but I think it was a lot. Most importantly, again, if we got nothing else out of it, isolating side effects is a super powerful technique and I wanted to show multiple examples of that. And that's what I tried to do. And if you want more, see the go for contact from 2017, so thank you.