번역: 38C3 - The Design Decisions behind the first Open-Everything FABulous FPGA

Transcript Translation

38C3 - The Design Decisions behind the first Open-Everything FABulous FPGA - https://www.youtube.com/watch?v=3Lll9_-gYGg

Foreign. So our next speaker will be Dirk and he will be talking about the design decisions behind the first open everything Fabulous fpga. Because we all like to have FPGA to do stuff really fast and please give him a warm welcome and enjoy the talk. Yeah thanks and for the kind introduction and welcome to my talk. So I'm a university guy and our group is called Novel Computing Technologies and we are mostly working on FPGAs looking into cat tools, doing a lot of work on partial reconfiguration, looking into reliability aspects like you have on a modern data center fpga a gigabit configuration data and that may be impacted by single event upsets. We do applications related to FPGAs. Today we'll be on our work on embedded FPGAs. We also look into hardware security and we have one fancy project that is putting 16,000 RISC V threads on a single FPGA. We get a performance out of more than one RISC V MIPS per lookup table. If you want to know how it's done, there is somewhere Riyadh who will join us later in a breakout. Okay, one slide on security before I really go to the main topic because there are many people working on security. Usually people try to minimize power on a chip. We ask the question the other way around. How much power could you possibly burn on a chip? Yeah, and we call this power hammering potential. And the power hammering potential on a data center FPGA that you get at Amazon Web service is something like 15 kilowatts. How do I get to this number? You can take a 6 GHz ring oscillator and you fan out on all wires and then you record your power and then you can say OK, with eight lookup tables with we can manage to achieve 100 milliwatts. And the data center fpga has 1.2 million lookup tables and that gives you the 15 kilowatts that we have. And by the way, the fastest switching activity we managed was 8 GHz. Then we are trying to fry these chips and we are ramping up the power. And interestingly that high power density which is in the same order of magnitude and the heat density on the surface of the sun. The chip survives 25 watts. But we are not giving up. We saw within one week of 5% aging. But we are also working on hardware trojans directly at bitstream. But anyway, back to the main topic. Short introduction into FPGAs for the people not so familiar with this and how you compare it maybe to a cpu. This is the idea of a fundamental von Neumann model. So you have your CPU somewhere and your alu, and you get an instruction stream in that tells you what operands to fetch, what, what operation to do, maybe where to write your result back. And the important thing is that you have a relative expensive memory operation to get your instruction in with each cycle and you have some effort to decode, and that is costly. So in order to amortize the cost, people started using simd. So instead doing one load, you do four or eight or whatever, many loads and corresponding multiplication and so forth. Yeah. And then maybe we have billions of transistors on a chip, we do copy and paste kind of. Yeah. And that is kind of super simplified. The model of a GPU on an fpga we could play the same game, but usually what we try to do is like more a data path. And the way it works is that what we do here in the CPU side, in a sequential execution, we basically unroll in space on the chip. Then instead of a load instruction, I put maybe a DMA unit in, and instead of a multiply instruction, I put a multiplier in and so forth. Or if I have an instruction like a rotate, that's the thing in the middle. This may be just wiring. So I would not even count this as an instruction in the FPGA case. And then the idea is that you operate this more like an assembly line. So if say the multiplier puts its result out and it goes into the adder, you will always start with the next multiplication. A way to look at it is also that we can say on a cpu, if you program, you program against a given architecture or instruction set, right? On an fpga, there is no such thing as an architecture. You have to do the architecture in the first place. And that allows you to do an awful amount of optimizations. However, all this programmability comes at a cost. And this is why you may want to customize what you want to program. And a nice way to visualize how FPGAs actually work is you only put in your chip what you need to solve your problem. You get the idea. Okay, how do FPGAs work? Well, the basic function generator is a lookup table, the most trivial thing that you could imagine. So you have a truth table and the output you may take out to build bigger combinatorial path, or you may go through a flip flop. Yeah. And if you want to implement, let's say a NAND gate or an and gate. Here, the green one is an and gate. You just fill out your logic table and you say I have zeros everywhere except for the last entry where all my inputs are one, I get to one out. The red one is an example for an or. Okay. The way this is implemented, this is shown in the left side of the figure, is basically just a large multiplexer. And the blue boxes that you see, those are the configuration bits that you will program if you upload your bit stream. And then in front of your lookup table, you have again, multiplexers. But the difference is that inside the lookup table, you see that the blue configuration boxes, these latches, they are connected to the data input to the multiplexer. And for the routing, it's connected to the select inputs to the multiplexer. But the message I want to do is that regardless if it's logic or routing, fundamentally FPGAs are made of huge multiplexes and an awful amount of configuration latches. That means if you want to optimize something, you will actually look into optimizing multiplexes. And those ledges. And that is relatively simple, but gives you most of the efficiency gain that you can get. And then you compose a big fabric out of that. This is actually a screenshot from within the tools. So with our fabric toolchain later you can do bitstream generation and then you can load your netlist in this editor and you see actually how it's physically implemented on the chip. Yeah, it looks pretty familiar to the tools from Synyx, Itera, whatever FPGA vendors you use. And we have different blocks. The small squares or the big square that you see is a switch matrix. This is where all the routing is condensed. The small blocks is like logic blocks. And then we have different columns. Some are for I O, some are for logic. The red columns, what you see in the very right of the small fabric, this is memory. The green ones are DSP blocks for arithmetic. But you get the idea. Okay. Whenever you start a new project, I think you have to justify why you do this. Sometimes it's for the sake of curiosity to do this, but here I had really a reason to do this. And the origins of my project, of this project comes from a major project where we wanted to make reconfigurable chips for memristors. So we got in the UK 6.3 million to investigate memristor technologies to build reconfigurable chips. And our task was, let's build FPGAs out of memristors. Yeah, and memristors are basically analog tunable resistors that you can have. And these resistors, they need high voltage to be programmed. Then you have a resistive state. But at the end of the day we want to have zeros and ones. So we need discriminators, we need some programming logic, voltage shifters and so forth. So we did all the heavy lifting engineering for our first chips. Then we also start doing the first version of a chip. This was done with industry tools, but this was showing us that we can actually build FPGAs. And in order to do such work, we need an ecosystem. So we need something that generates the FPGA fabric, but also we need something that allows us to program that thing at the end of the day. And there is a tool chain out VPR done by the University of Toronto. That thing is more than 20 years old and that's kind of the gold standard. And it is a good tool. Nothing to complain about it, but it's difficult to control it. And I'm a little bit of a control freak. I want to have full control over the fabric aspects. So this is why we did something on our own. There are other open source FPGA projects like Open fpga, there's a project from Princeton, they are all based around dpr, but they are not very optimized, like in terms of area for, for example, and I didn't like that one because at the end of the day we want to show that we can eventually gain a benefit. Right? So we need reasonably good quality of results to get credible results. Yes. So it was basically time to do something new. And we called this fabless. So what is Fabless doing? Well, it's a fully integrated framework for building embedded FPGAs. And here in the top, this is fundamentally the code generation. So at the very top you specify your fabric. It's basically just a few CSV files that you throw in. We have a primitive library, for example, with lookup tables, arithmetic blocks, these kinds of things. And then Fabless generates you the RTL code, some constraints, but also in the middle, an architecture graph that we need later for place and route. Then in the bottom we are running a backend flow to generate the actual chip. And the output is so called gds. That's the file that you send to your fab and that contains basically the information for the masks that you have. And on the right hand side in the figure, that's actually the bitstream compilation flow. And most of that work is actually done by other projects. And also for the compilation we can use industry tools like from Cadence, but there's also open source tools like openlane. And openlane is in these days incredibly stable and allows basically civilians to make their own chips. Okay. We aimed to make our framework pretty versatile. That allows you, for example, it allows you to define awkward shapes like in this case, it's like an inverse T shape. You can put in your own logic or DSP or whatever. Tiles, you can have custom tiles. For example, if you want to do something specific for machine learning or let's say post quad Encryptor, we have some users that exactly play that game. You can do this. Or if you want to adjust the routing, you have also possibilities to do this. So think of that. You have a sensor here. You do some processing until it goes into a memory. So you may have a data flow in one direction, so you may need less wires in that direction. And that can help you eventually saving things. If you want to do FPGAs or start doing FPGAs, somehow you are a bit crazy or mental in the sense that this is a figure that EE times composed once. This is about all the FPGA starts up over a couple of years in this and the list is by far not complete. And the vast amount of these companies failed at the end of the day. So there must be something that's not so easy and a common pitfall. Sometimes these companies do architecture that are inherently flawed. For example, there was a company called Tabula. What they wanted to do in the architecture could never ever work. But there are other companies that usually miss out on the tool aspects behind. So if you want to make FPGAs the tools to do for example, the bitstream compilation is the much harder problem than doing actually the silicon on its own. One comment a little bit on let's say reflection on this one is that unfortunately a lot of these FPGA companies do chips that are used for military purposes. And I see a lot of people that are happy about Govin, which is a cheap Chinese fpga. It's basically a lattice clone and they have pretty good ties or tight ties with the Chinese military. So keep this in mind if you work on these chips. But anyway, back to the design decisions. So if you want to make we know that now lookup tables are the fundamental function generator and we have to decide how big should the lookup table be. And this was a large body of work that had been done in this direction and people worked out well, let's take a 4 input, a 5 input, a 6 input lookup table and see how well this actually works and how well they are utilized. And the idea is that I say, okay, if I get more bigger lookup tables, I may need less of them, but quite often I will not utilize all the inputs right? And that leaves me something's unused. And what this table actually shows is that for the different colors here, it's like the blue ones is for lut4, the red ones is for lat5, and the yellow for lat6. And what it shows is that if you go for large lookup tables, you will not even use half of them, or something like one third only will be used at its full capacity. So it looks like that having large lookup tables is maybe not the very best idea. However, large lookup tables have one advantage. They are usually better in power and latency because you have less levels of logic on average, and that helps you to save something. However, when you buy an fpga, you think you buy lookup tables or logic, but in the end of the day or at the end of the day, you pay for the routing. So it's nice that you have all these logic blocks that you can wire up together, but this is really expensive. On an asic, I just draw basically a metal wire from A to Z. On an fpga, I have to go through all kinds of switches to get there. And 70 to 80% of your chip area is dedicated to the routing. It's that expensive. So if you go to Xilinx, Altera, Lattice, you name it, you think you buy lookup tables. Well, you pay for the routing. Yeah. And so. So you have to be very mindful in deciding what your routing fabric should look like. And one thing that you do is a simple observation that in a netlist, if you build hardware, you have some components that are tightly connected and some components that are more loosely connected. And you try to reflect this in your fabric by clustering some of these lookup tables together, giving them, let's say, say, more denser connectivity, and then have maybe a little bit less connectivity from cluster to cluster. And this also was research that had been done before where people tried out different lookup table sites and different cluster sizes. And the sweet spot is the one that is circled in the middle, that is that you take four lookup tables and you group something like four to 10 together. And this is exactly what we did in fabulous. However, if you would go to a more advanced process node, let's say 22nm these curves, they move a little bit and they move a little bit towards large lookup table sizes. And this is why, for example, Xilinx Altera these days use lat6. That's not the only reason, but that's one of the major ones. This is an interesting one. So fundamentally it looks like, well, it's just multiplexes, right? But if you want to implement this on a chip, there are sometimes constraints you have to obey and you should obey. It's not mandatory. But what this figure shows you here is that if you want to implement a wide multiplexer, you can't just take 16 input multiplexer. It's done over multiple levels. And in this case it's shown for two input multiplexers. And you see, I do this over three layers or levels. And what I want to show here is consider the last green level. If, for example, the top path would be faster than the bottom path, then your latency would depend on this input two that you have on the green input. Okay? And that is a nasty problem if you do chip analysis because you would. Your timing has to be aware of the logic state. And this is not how we usually do timing. You could compute this, but this is a nasty one. You don't have to understand all the details. I just want to make the point that just throwing random gates on the chip to get a high quality fabric is sometimes not enough. And we did actually a tool together with somebody at Berkeley that implements these things automatically in a tool. And the way it works is that you have kind of a recursive structure to implement all these figures. Okay, then I may want to implement bigger lookup tables. And one way to do this is that I can take adjacent lookup tables with a multiplexable bigger ones. Another way what people do is that they take what we call fractionable lookup tables, that you build large lookup tables and you put them into pieces. Again, this is what ALTERA is actually doing. However, if you look at lookup tables, the input routing multiplexes, they're actually more expensive than just the lookup tables here. And I put this figure in because it looks cool. But I tell you what it is. Yeah. What you see here is this is for older Xilinx FPGAs, they have eight lookup tables A to H. Each lookup table has four inputs. And what I have shown is like how many inputs that go into my switch matrix. So I have a switch matrix, I have a wire that comes in, okay. And that goes to eventually one lookup Table input and it goes maybe to another lookup table input. And if they share this port, I count this, okay? And so on the diagonals, that tells you how many inputs I have on that specific multiplexer. And what it shows is that basically the lookup tables for A, B, C, D, they share exactly the same inputs. Okay, so that observation is something that we used to reduce some cost in the routing. Okay, but this is low level details. Another important thing is that if you build, for example, arithmetic, you usually use something like carry chains. It's like a carry ripple adder that you may remember from your foundations in computer engineering, where you have like a full adder and it goes from one cell to the next cell to the next next cell. You ripple through a carry. And this is what actually lettuce is doing in front of their lookup table. They have like this carry logic in front of the lookup table. And while this works, it only allows you to do addition fundamentally. And I will skip this one for time reasons that I don't run short. That's not so important. Xilinx, however, they do it in a different way. They do the carry chain after the lookup table. It's not important to understand all the details, but if you do this after the lookup table, you can also do fast arithmetic. But you can also use this to implement wide logic gates. Sometimes you have, for example, wide comparators, you have wide and gates or gates, these kinds of things, and you can implement that one. An interesting one is that Xilinx, they fundamentally keep the principle of operation in their logic. What they did since, let's say 20 years ago is fundamentally the same thing that they do today. The only thing is that the lookup tables have become bigger. So these eight blocks that you see highlighted there that are those lookup tables, then you can combine multiple lookup tables to form bigger multiplexes through those multiplexes. Bigger lookup tables, and that gives you larger lookup tables. Okay, to summarize this, there is not really a winner. So all these different FPGAs, and they haven't mentioned everything here they have their individual pros and cons. And the real point I want to make with everything here is they fundamentally all work. And the three most important things, if you want to do FPGAs, is number one, tools, number two is tools, and number three is, well, tools. So when we did our design decisions, we actually looked, what are this, what support do we get from synthesis tools? What can we do with place and route? And we made Decisions accordingly. Then I mentioned routing is the elephant in the room. And what this figure shows you is the adjacency of Xilinx FPGAs. And what does it mean? Each dot that you see in this picture is a possible programmable connection on the columns. This represents one multiplexer. And all the horizontal that gets in that are the inputs into my switch matrix. Okay. What this means is like I have this large block that does my routing and then I have wires that come from, let's say, other clusters. Right. And that would be my inputs. Okay. And then I have maybe lookup table inputs or wires that go to other ones. That would be my columns and so forth. Yeah. And this is shown here. And what the figure should express here or what it shows you is that the routing is incredibly sparse that you have. And so point one, point two is also what this shows is how much of the routing is lookup table inputs, how much of the routing is for adjacent wires. I don't drive you too much into the details here. I'm very low level. I know, but the message is that, that an awful amount of routing is just the inputs to your lookup tables. That was this figure with these blue boxes that you saw. And so if you implement such a routing fabric, you will have to be a little bit mindful with designing these switches. And we almost get similar density for our routing here. A few chips that we did. There should be a video. The video should be. Actually that was the first bigger fpga. Maybe you can see it a little bit. But this was like the first major FPGA that we did. And it's this street animation. It uses DSP blocks. The bitmap that you see is stored on internal block ramps. We even do a square root for the rendering in the logic and you get basically all. All the elements that you get in a modern fpga and they work together. We had an interesting bug with that chip. And the way you do the clock distribution is shown there that you have like a central clock point where you pump your clock in. And then it's distributed from tile to tile to tile to tile. Okay. And what we did differently is that the block ramps, these are the large blocks at the right that you see in this chip. They got their clock through a different clock tree. And there is like a little bit of a phase relationship between the clock that we had in the fabric versus these block ramps, which ended up in having hold time violations, which is actually a nasty thing, which you can compensate with the routing tools. And it took us something like. For the demo that you saw, it took us something like 100 times routing. Before we got a demo that was working. It couldn't be done in a more structured way that we can do today. But when we did this demo, we took something like 100 times to get it routed. This was our first almost everything fpga. I say almost because there was one tiny step that we couldn't do with open source tools and we actually had to use an industry tool, Innovus. And however, this was done in a skyward 100 nanometer process. So it comes with an open PDK and you can basically download this, go on the git repo and you can explore the chip all the way down to the transistor geometry. Yeah. Yes, we did another chip, this is done entirely with open source tools. If you check out that git repo, it compiles through all the way to the final fabric without any user intervention. And it's a tiny FPGA for today's standards. Yes. You have an idea Data set FPGA is more than a million lookup tables. We are not getting even a thousand. So it's to some extent tiny, but you have to be aware that it's just something like 10 square millimeters area that we have here. And we are talking about 130 nanometer process. Okay. And if you want to investigate this, that's the git ripple. The next step is we don't have a proper timing model for that one. So if you do place and route, we look a little bit in the crystal pipe if it works. But we know that we can do better. So where there's some homework to be done, we have to do a little bit more on the physical constraints to automate this. A couple of more optimizations. We want to put in more tape outs. So we want to do Global Foundries 22, maybe TMC, EHRP and so forth. And we want to do actually an open everything RISC V plus embedded FPGA board. This is like, oops, it's in a audio cassette packet. So this is like the first open everything chip that we did. Yeah. And we want to develop this a little bit further and put RISC V core in with an embedded fpga. And one thing, I'm not so sure what I should do. Where I will ask you what you would prefer is we can take memory macros that have very poor performance, very poor density. So they give me only like 12 megahertz and they give me only like 1 kilobyte or I can use Siemens macros that are not open where I get two to three times the capacity and 100 megahertz. What would you prefer? Open everything and sacrifice on performance and density. Or would you like to have the industry macro so you can download the chip then later but the memory macro would be a black box for you. So what would the audience prefer? Who would go for open everything? Okay, actually I climb to that. Anybody on the industry side? Okay, I guess that is a clear vote home. I have to stop here. If you want to look a little bit more you can scan the QR code on my logo here. That gets you to the documentation website. More on that one. We actually run a summer school FPGA Ignite in Heidelberg. The chip that you see on the right hand side is a. Sorry. During this summer school we do actually chips that get taped out and participants get them back. This was like a chip that we did. If you are interested in doing such things and make your own chip, stay tuned and go on the website. Thanks to all the people that make this research actually possible. And one final thing I skipped this one is that we have a meetup later on open source FPGAs and open ASIC design meetup. It's at 4 o'clock today, close to the cows post. So see us there if you want to see more. Thanks for your attention. Thanks for the talk. Do we have questions? I see a question. Let's start there. Is this on? You asked which we wanted the the denser SRAM or the open sram? I want both. The Siemens cells are denser because they break the foundry rules. They use the special rules for SRAM cells that violate the other design rules. I think you should have someone design your own push design rules, high density sram. But that's just me. I wanted to thank you for bringing back partial reconfiguration. That's something the industry has ignored for many many years because the defense customers don't care about it. So I'm very glad to see that we now have FPGAs that we can do partial configuration research on. It really is the killer app for FPGAs. My question for you is not so much about the technical material, but about the naming in the software world. The definition of free software includes three fundamental freedoms and one of those is to modify any part of the software. In order for software to be considered free software open source, you must be able to modify any part. On slides 31 and 33 you had two die photos of chips. And I noticed quite distinctly in those die photos the inclusion of the Caravel management engine at the bottom edge of this die photo and at the far right edge of the slide photo on slide 33. I know this because I for a very long time was targeting Skywater 130. Unfortunately, Google requires the inclusion of this management engine on every chip that goes through their tape outs. And the management engine sits between your design and the outside world with the ability to man in the middle every single pin. So my question for you is if the inclusion of this Caravel management engine is mandatory and it cannot be modified, does this really qualify as AN Open everything fPGA? Yes. So what you see there is not considered. That's just the pat ring. So there is no Caravel at all. Is this IHP or Skywater? That's Skywater. So they're allowing tape outs without the management engine. Now we did a paid one at the Google. The Google program is dead anyway. Yeah. So that tape out costs 10 grand that you have an. I asked them for this three years ago and offered them more than ten grand and they wouldn't do it. So I'm glad to see they've changed their minds. And good news is you will get way more iOS. So there's something coming up that gives you over 100 iOS. Actually the configuration of the I O pins is done through the FPGA fabric. So wire configuration bids out and you can configure the I O pins so it almost feels like what you get from the big vendors. Thank you. Next question. Same microphone. These are microphones nobody likes. Yes. I have a question about the timing analysis. Do you also want to design your FPGA or actually the. The routing part so that you can also model the stray capacitance and inductance of your nets? Yes. So what you have to do is it's basically a spice simulation to get the exact arrival times for your clocks and then you want to have. Basically you have to query your net list for all the let's say multiplexer to multiplexer or multiplex flexitool lookup table wire segments. Yeah, in reality it's a bit more complicated because if you do pass transistor multiplexes then switching on multiple paths may change like slew rate fan. So different fan outs have impact on that one. So that has to feed all back into the model. Yeah. So but it's. But that is exactly the part that we are missing. So that you can do full blown place and route. Yes, please go yeah, so thank you professor for the. For explaining all this. This is all proprietary information most of the time. And how these FPGAs work, we don't know. We just use it as a user. So this was very enlightening. My question would be about the slide where we. You showed the cpu, the GPU SIM and. And then the data flow of the fpga. I think it was, yeah. So my question would be now why would these vendors or you also later would also want a soft cpu, RISC V or nfpga? What would be the use case when you have literally built your own pipeline here? Right. Okay. I call it embedded FPGAs because reconfigurability is expensive and you only want to put it in where it helps you. And if you need a cpu, it will be a hardened cpu and you would put the fabric beside that to just do the things that have to be reconfigurable. So if you want to. Like the last thing that I was referring to, this box that we want to do is. Will be hardened core. Yeah. And we use. Actually the fabric is different, designed to also do embedded CPUs or soft CPUs. And the reason for that one is I just told you the most useful thing to run on an fpga. Embedded FPGA is a cpu. And everybody asks me, have you done this and this and this CPU on your fabric? And they will only say your fabric is useful if you can run this and this and this cpu. Well, knowing that this is something you would never really run, run on that one. Ergo, we put some support in like distributed memory for a register file or so to accommodate that one. But this also expresses that you can tailor the fabric to your need. So if you do like let's say just hardcore arithmetic, you can put in more DSP blocks or specialized DSP blocks or Tensor blocks or you name it. Thank you, Michael, for number one. Yeah, thanks for the talk. It was really enlightening. My question is regarding the tape out, you said that you will bring up a little box. Is there some way to get notified when this is coming up? And also to get not only the modules, but maybe also the bare chips to implement in own designs. Not sure if I get your question right. Is it related to the summer school or. Because we are not a chip broker or anything like we run these things basically in house. To be fair, we used efabless and they have a relatively streamlined service. You submit their gds and they do the fabrication, the packaging and you even get like these white thing, if you see it from the distance, they even sold the chips on a small breakout thing and you get even a pcb, that's our PCB that we made. But you get something from them back that's kind of streamlined I think entry price. I don't want to make advertisement for them. I hope that there will be more vendors coming up with this one. But that's like a 10 grand entry level price that you pay. There's cheaper options like tiny tape out and if you want to get it for free, join us in our next summer school. Okay. Yeah. The question was more than regarding if there's any tape out bayou planned for. If you have something, we may find some space for you. If you have a good idea, we may just throw it in. So. Yes, ping us. Thank you. Microphone 3 hi. First of all, thank you for the talk. You mentioned memory stores and I would like to ask you do you see potentials for spintronics technologies like MTGAs to improve FPGAs design like in density, efficiency or reliability? Yes. Thank you. Okay, so hold on. Yes. So one thing where it may fly actually high is if you recall the multiplexer tree for my lookup table. If I can store basically in an analog form 2 bit in each cell by using an RM cell, so I use it in analog mode then I need only half as many cells. My multiplexer tree is only half the size, so that's great. However, now I have the problem to discriminate four states into two bits. That costs me something. Okay. If you can do this cheap, you will win. And we found a way to do this cheap. It's still research. Okay, but this is one of the games for example that we want to play. Reliable is Aram Technology is in theory inherent reality reliable in the sense that it's not. It's resistant to single event upsets. However, the technology itself is headroom for improvement. Yes, if this answers the question. Thank you. Unfortunately we are at the end of our time slot but I'm sure you can still talk to our speaker afterwards so please give him another round of applause.

38C3 - The Design Decisions behind the first Open-Everything FABulous FPGA - https://www.youtube.com/watch?v=3Lll9_-gYGg

외국어. 다음 연사는 Dirk입니다. 그는 최초의 완전 오픈 소스 FPGA의 설계 결정에 대해 이야기할 것입니다. 우리 모두 빠른 작업을 위해 FPGA를 사용하고 싶어 하니까요. 그를 따뜻하게 환영해 주시고 강연을 즐겨주세요. 네, 친절한 소개 감사합니다. 제 강연에 오신 것을 환영합니다. 저는 대학 소속이고 우리 그룹은 Novel Computing Technologies라고 합니다. 우리는 주로 FPGA 작업을 하며 CAD 도구를 연구하고, 부분 재구성에 대해 많은 작업을 하고, 현대 데이터센터 FPGA의 신뢰성 측면을 살펴봅니다. 예를 들어 기가비트 구성 데이터가 단일 이벤트 오류의 영향을 받을 수 있습니다. 우리는 FPGA 관련 응용 프로그램도 개발합니다. 오늘은 임베디드 FPGA에 대한 우리의 작업을 다룰 것입니다. 우리는 또한 하드웨어 보안을 연구하고 있으며, 하나의 FPGA에 16,000개의 RISC-V 스레드를 넣는 흥미로운 프로젝트도 진행 중입니다. 우리는 룩업 테이블당 1 RISC-V MIPS 이상의 성능을 얻고 있습니다. 어떻게 하는지 알고 싶으시면 어딘가에나중에 분과 세션에 참여할 리야드입니다. 좋습니다. 본 주제로 들어가기 전에 보안에 대해 한 슬라이드 보여드리겠습니다. 보안 분야에서 많은 사람들이 일하고 있죠. 보통 칩의 전력을 최소화하려 합니다. 우리는 반대로 질문을 던졌습니다. 칩에서 얼마나 많은 전력을 소모할 수 있을까요? 우리는 이를 전력 해머링 잠재력이라고 부릅니다. 잠재력이요. 아마존 웹 서비스에서 얻을 수 있는 데이터 센터 FPGA의 전력 해머링 잠재력은 약 15킬로와트입니다. 어떻게 이 수치를 얻었을까요? 6GHz 링 오실레이터를 사용하고 모든 와이어에 팬아웃을 하면 전력을 기록할 수 있습니다. 그러면 이렇게 말할 수 있죠. 8개의 룩업 테이블로 100밀리와트를 달성할 수 있다고요. 데이터 센터 FPGA에는 120만 개의 룩업 테이블이 있어 15킬로와트가 나옵니다. 참고로, 우리가 달성한 가장 빠른 스위칭 활동은 8GHz였습니다. 우리는 이 칩들을 태우려고 시도하며 전력을 높이고 있습니다. 흥미롭게도 이 높은 전력 밀도는 태양 표면의 열 밀도와 같은 수준의 크기입니다. 칩은 25와트에서 살아남습니다. 하지만 우리는 포기하지 않습니다. 1주일 만에 5%의 노화를 관찰했습니다. 우리는 포기하지 않고 있습니다.하지만 우리는 직접 비트스트림에서 하드웨어 트로이목마도 작업하고 있습니다. 그러나 다시 주제로 돌아가겠습니다. FPGA에 대해 잘 모르는 분들을 위한 간단한 소개와 CPU와 비교해 보겠습니다. 이것이 기본적인 폰 노이만 모델의 아이디어입니다. CPU가 어딘가에 있고 ALU가 있습니다. 그리고 어떤 피연산자를 가져올지, 어떤 연산을 수행할지, 결과를 어디에 쓸지 알려주는 명령 스트림을 받습니다. 중요한 점은 각 사이클마다 명령어를 가져오는 비교적 비싼 메모리 작업이 있고 디코딩하는 데 노력이 필요하며 이는 비용이 많이 듭니다. 비용을 상쇄하기 위해 사람들은 SIMD를 사용하기 시작했습니다. 한 번의 로드 대신 4개나 8개 또는 그 이상의 로드와 해당하는 곱셈 등을 수행합니다. 그리고 칩에 수십억 개의 트랜지스터가 있으면 일종의 복사 붙여넣기를 합니다. 네, 이것이 매우 단순화된 GPU 모델입니다. FPGA에서도 같은 방식을 사용할 수 있지만 보통 우리가 하려는 것은 더 데이터 경로에 가깝습니다. 작동 방식은 CPU 쪽에서 순차적으로 실행하는 것을 기본적으로 칩 공간에 펼쳐놓는 것입니다. 그래서 로드 명령어 대신 DMA를 넣을 수 있습니다.유닛을 넣고, 곱셈 명령어 대신 곱셈기를 넣는 식으로 합니다. 또는 회전 같은 명령어가 있다면 그것은 중간에 있는 것입니다. 이는 단순히 배선일 수 있습니다. 따라서 FPGA의 경우 이를 명령어로 간주하지 않을 것입니다. 그리고 이 아이디어는 이를 조립 라인처럼 작동시키는 것입니다. 예를 들어 곱셈기가 결과를 내놓고 그것이 가산기로 들어가면, 항상 다음 곱셈을 시작할 것입니다. 이를 바라보는 또 다른 방법은 CPU에서 프로그래밍할 때 주어진 아키텍처나 명령어 세트에 대해 프로그래밍한다고 말할 수 있습니다. 하지만 FPGA에서는 아키텍처라는 것이 없습니다. 먼저 아키텍처를 만들어야 합니다. 이를 통해 엄청난 양의 최적화를 할 수 있습니다. 하지만 이러한 프로그래밍 가능성은 비용이 듭니다. 그래서 프로그래밍하고자 하는 것을 맞춤화하고 싶을 수 있습니다. FPGA가 실제로 어떻게 작동하는지 시각화하는 좋은 방법은 문제를 해결하는 데 필요한 것만 칩에 넣는다는 것입니다. 이해하셨죠. 자, FPGA는 어떻게 작동할까요? 기본 함수 생성기는 룩업 테이블입니다. 상상할 수 있는 가장 간단한 것이죠. 진리표가 있고 출력을 가져와 구축할 수 있습니다더 큰 조합 경로를 거치거나 플립플롭을 통과할 수 있습니다. 네. 예를 들어 NAND 게이트나 AND 게이트를 구현하려면 여기 초록색은 AND 게이트입니다. 논리 표를 채우고 모든 입력이 1인 마지막 항목을 제외하고 모든 곳에 0을 넣습니다. 그러면 출력이 1이 됩니다. 빨간색은 OR의 예시입니다. 네. 이것이 구현되는 방식은 그림의 왼쪽에 나와 있는데, 기본적으로 큰 멀티플렉서입니다. 보이는 파란색 상자들은 비트스트림을 업로드할 때 프로그래밍할 구성 비트입니다. 그리고 룩업 테이블 앞에는 다시 멀티플렉서가 있습니다. 차이점은 룩업 테이블 내부에서 파란색 구성 박스, 이 래치들이 멀티플렉서의 데이터 입력에 연결된다는 것입니다. 라우팅의 경우 멀티플렉서의 선택 입력에 연결됩니다. 제가 전달하고 싶은 메시지는 논리든 라우팅이든 상관없이 FPGA는 기본적으로 거대한 멀티플렉서와 엄청난 양의 구성 래치로 이루어져 있다는 것입니다. 즉, 뭔가를 최적화하고 싶다면 실제로 멀티플렉서와 그 래치를 최적화하게 될 것입니다. 이는 비교적 간단하지만대부분의 효율성 향상을 얻을 수 있습니다. 그리고 나서 그것으로 큰 패브릭을 구성합니다. 이것은 실제로 도구 내부의 스크린샷입니다. 우리의 패브릭 툴체인으로 나중에 비트스트림 생성을 할 수 있고 이 편집기에 넷리스트를 로드하면 실제로 칩에 물리적으로 구현된 모습을 볼 수 있습니다. 네, 꽤 Synopsys, Altera 등의 도구와 비슷해 보입니다. 여러분이 사용하는 FPGA 벤더와 유사합니다. 그리고 우리는 다양한 블록을 가지고 있습니다. 보이는 작은 사각형이나 큰 사각형은 스위치 매트릭스입니다. 여기서 모든 라우팅이 집중됩니다. 작은 블록들은 논리 블록입니다. 그리고 다양한 열이 있습니다. 일부는 I/O용이고 일부는 논리용입니다. 빨간색 열, 작은 패브릭의 맨 오른쪽에 보이는 것은 메모리입니다. 초록색은 산술 연산을 위한 DSP 블록입니다. 이해하셨죠. 좋습니다. 새 프로젝트를 시작할 때마다 왜 이걸 하는지 정당화해야 합니다. 때로는 호기심 때문에 하기도 하지만, 여기서는 정말 이유가 있었습니다. 이 프로젝트의 기원은 멤리스터를 위한 재구성 가능한 칩을 만들고자 했던 큰 프로젝트에서 왔습니다. 영국에서 630만 파운드를 받아메모리스터 기술을 조사하여 재구성 가능한 칩을 만들고자 했습니다. 우리의 임무는 메모리스터로 FPGA를 만드는 것이었죠. 메모리스터는 기본적으로 아날로그 조절 가능한 저항기입니다. 가질 수 있는 것이죠. 이 저항기들은 프로그래밍하려면 고전압이 필요합니다. 그러면 저항 상태가 됩니다. 하지만 결국 우리는 0과 1을 원합니다. 따라서 판별기, 프로그래밍 로직, 전압 변환기 등이 필요합니다. 그래서 우리는 첫 번째 칩을 위해 모든 어려운 엔지니어링 작업을 했습니다. 그 다음 첫 번째 버전의 칩도 만들기 시작했죠. 이것은 산업용 도구로 만들어졌지만, 우리가 실제로 FPGA를 만들 수 있다는 것을 보여주었습니다. 이런 작업을 하려면 생태계가 필요합니다. FPGA 패브릭을 생성하는 것이 필요하고, 또한 그것을 프로그래밍할 수 있게 해주는 것도 필요합니다. 그리고 토론토 대학에서 만든 VPR이라는 툴체인이 있습니다. 그것은 20년 이상 된 것으로 일종의 금본위 표준이라고 할 수 있습니다. 좋은 도구이고 불평할 게 없지만, 제어하기가 어렵습니다. 저는 약간 통제광입니다. 패브릭 측면을 완전히 제어하고 싶어요. 그래서 우리는 다른 것을 만들었습니다.우리 스스로 할 수 있습니다. Open FPGA와 같은 다른 오픈소스 FPGA 프로젝트들이 있고, 프린스턴의 프로젝트도 있는데, 이들은 모두 DPR을 기반으로 하지만 매우 최적화되어 있지는 않습니다. 예를 들어 면적 측면에서 그렇죠. 그래서 저는 그것이 마음에 들지 않았습니다. 결국 우리는 궁극적으로 이점을 얻을 수 있다는 것을 보여주고 싶기 때문입니다. 그래서 우리는 꽤 좋은 품질의 결과를 얻어 신뢰할 수 있는 결과를 얻어야 합니다. 네. 그래서 기본적으로 새로운 것을 할 때였습니다. 우리는 이것을 Fabless라고 불렀습니다. Fabless는 무엇을 하나요? 음, 임베디드 FPGA를 구축하기 위한 완전히 통합된 프레임워크입니다. 여기 상단에서 이것은 근본적으로 코드 생성입니다. 맨 위에서 패브릭을 지정합니다. 기본적으로 몇 개의 CSV 파일을 넣기만 하면 됩니다. 우리는 룩업 테이블, 산술 블록 등이 있는 프리미티브 라이브러리를 가지고 있습니다. 이런 종류의 것들이죠. 그러면 Fabless가 RTL 코드와 제약 조건을 생성하지만, 중간에는 나중에 배치 및 라우팅에 필요한 아키텍처 그래프도 생성합니다. 그 다음 하단에서 실제 칩을 생성하기 위한 백엔드 플로우를 실행합니다. 그리고 출력은 소위 GDS라고 불리는 파일입니다. 이것은팹에 보내는 파일에는 기본적으로 마스크에 대한 정보가 포함되어 있습니다. 그리고 오른쪽 그림에서는 실제로 비트스트림 컴파일 과정을 보여줍니다. 그 작업의 대부분은 실제로 다른 프로젝트에서 수행됩니다. 그리고 컴파일을 위해 Cadence와 같은 산업용 도구를 사용할 수 있지만, 오픈소스 도구인 openlane도 있습니다. 요즘 openlane은 매우 안정적이어서 일반인들도 자신만의 칩을 만들 수 있게 해줍니다. 우리는 프레임워크를 매우 다재다능하게 만들고자 했습니다. 예를 들어, 이 경우처럼 역 T자 모양과 같은 특이한 형태를 정의할 수 있습니다. 여러분만의 로직이나 DSP 또는 다른 것들을 넣을 수 있고, 맞춤형 타일도 가능합니다. 예를 들어 머신러닝이나 양자 후 암호화를 위한 특정 작업을 하고 싶다면 가능합니다. 일부 사용자들이 정확히 그렇게 하고 있죠. 라우팅을 조정하고 싶다면 그것도 가능합니다. 이렇게 생각해보세요. 여기 센서가 있고, 메모리로 들어가기 전까지 일부 처리를 합니다. 한 방향으로 데이터 흐름이 있을 수 있으므로, 그 방향으로 더 적은 와이어가 필요할 수 있습니다. 이는결국 물건을 절약하는 데 도움이 됩니다. FPGA를 하고 싶거나 시작하고 싶다면, 어떤 면에서 당신은 약간 미쳤거나 정신이 이상한 것입니다 이것은 EE 타임즈가 한 번 만든 그림입니다. 이는 몇 년 동안의 모든 FPGA 스타트업에 관한 것이고 이 목록은 결코 완전하지 않습니다. 그리고 이 회사들의 대부분은 결국 실패했습니다. 그래서 쉽지 않은 무언가가 있고 흔한 함정이 있어야 합니다. 때때로 이 회사들은 본질적으로 결함이 있는 아키텍처를 만듭니다. 예를 들어, Tabula라는 회사가 있었습니다. 그들이 하고자 했던 아키텍처는 절대 작동할 수 없었습니다. 하지만 대개 도구 측면을 놓치는 다른 회사들도 있습니다. 그래서 FPGA를 만들고 싶다면 도구, 예를 들어 비트스트림 컴파일을 하는 것이 실제로 실리콘을 만드는 것보다 훨씬 더 어려운 문제입니다. 그 자체로는 말이죠. 이에 대한 약간의 성찰 코멘트를 하자면, 불행히도 이러한 FPGA 회사들 중 많은 수가 군사용으로 사용되는 칩을 만듭니다. 그리고 저는 Govin에 대해 기뻐하는 많은 사람들을 봅니다. 이는 저렴한 중국산 FPGA입니다. 기본적으로 래티스 클론이고 그들은 꽤 좋은중국 군대와 밀접한 관계가 있습니다. 이 칩들을 다룰 때 이점을 명심하세요. 하지만 어쨌든, 설계 결정으로 돌아가겠습니다. 우리는 이제 룩업 테이블이 기본적인 함수 생성기라는 것을 알고 있습니다. 그리고 룩업 테이블의 크기를 결정해야 합니다. 이에 대해 많은 연구가 이루어졌고 사람들은 4입력, 5입력, 6입력 룩업 테이블을 사용해 얼마나 잘 작동하고 활용되는지 확인했습니다. 아이디어는 이렇습니다. 더 큰 룩업 테이블을 사용하면 적은 수가 필요할 수 있지만, 종종 모든 입력을 활용하지 않게 됩니다. 그래서 일부가 사용되지 않습니다. 이 표가 보여주는 것은 다른 색상들이 의미하는 바입니다. 파란색은 LUT4, 빨간색은 LUT5, 노란색은 LUT6입니다. 이 표는 큰 룩업 테이블을 사용하면 절반도 사용하지 않거나 약 3분의 1만 최대 용량으로 사용된다는 것을 보여줍니다. 따라서 큰 룩업 테이블이 최선의 선택이 아닐 수 있습니다. 하지만 큰 룩업 테이블에는 한 가지 장점이 있습니다.일반적으로 논리 단계가 적어서 전력과 지연 시간에서 유리합니다 이는 무언가를 절약하는 데 도움이 됩니다 하지만 FPGA를 구매할 때 룩업 테이블이나 로직을 사는 것 같지만 결국에는 라우팅에 돈을 지불하게 됩니다. 연결할 수 있는 많은 로직 블록이 있다는 것은 좋지만 이는 매우 비쌉니다. ASIC에서는 기본적으로 A에서 Z까지 금속 와이어를 그리면 됩니다. FPGA에서는 그곳에 도달하기 위해 온갖 스위치를 거쳐야 합니다. 그리고 70~80%의 칩 면적이 라우팅에 할당됩니다 그만큼 비쌉니다. Xilinx, Altera, Lattice 등에서 룩업 테이블을 산다고 생각하지만 실제로는 라우팅에 돈을 지불하는 겁니다. 네. 그래서. 라우팅 패브릭이 어떤 모습이어야 할지 결정할 때 매우 신중해야 합니다. 그리고 한 가지 방법은 단순한 관찰입니다 넷리스트에서 하드웨어를 만들 때 일부 구성 요소는 밀접하게 연결되고 일부는 느슨하게 연결됩니다. 이를 패브릭에 반영하려고 노력합니다 이러한 룩업 테이블 중 일부를 함께 클러스터링하고 더 밀도 높은 연결성을 제공한 다음클러스터 간의 연결성이 조금 더 적을 수 있습니다. 이것은 이전에 수행된 연구이기도 했습니다 사람들이 다양한 룩업 테이블 크기와 다양한 클러스터 크기를 시도해 봤죠. 최적의 지점은 중간에 원으로 표시된 부분입니다. 4개의 룩업 테이블을 사용하고 4~10개를 그룹화하는 것입니다. 이것이 바로 우리가 Fabulous에서 한 일입니다. 하지만, 더 발전된 공정 노드, 예를 들어 22nm로 가면 이 곡선들이 약간 이동하고 큰 룩업 테이블 크기 쪽으로 약간 이동합니다. 이것이 Xilinx와 Altera가 요즘 LUT6를 사용하는 이유 중 하나입니다. 유일한 이유는 아니지만 주요 이유 중 하나입니다. 이것이 유일한 이유는 아니지만 주요 이유 중 하나입니다. 이것은 흥미로운 부분입니다. 기본적으로 단순히 멀티플렉서처럼 보이지만, 칩에 구현하려면 때때로 따라야 할 제약 조건이 있습니다. 의무는 아니지만 따르는 것이 좋습니다. 이 그림이 보여주는 것은 넓은 멀티플렉서를 구현하려면 16개 입력 멀티플렉서를 그냥 사용할 수 없다는 것입니다. 여러 레벨에 걸쳐 수행됩니다. 이 경우 2개 입력 멀티플렉서에 대해 보여주고 있습니다. 보시다시피 3개 층 또는 레벨에 걸쳐 이를 수행합니다.여기서 보여주고 싶은 것은 마지막 녹색 레벨을 고려해 보는 것입니다. 예를 들어, 상단 경로가 하단 경로보다 빠르다면 지연 시간은 녹색 입력에 있는 이 두 번째 입력에 의존하게 될 것입니다. 알겠죠? 이는 칩 분석을 할 때 매우 까다로운 문제입니다. 왜냐하면 타이밍이 논리 상태를 인식해야 하기 때문입니다. 이는 우리가 일반적으로 타이밍을 다루는 방식이 아닙니다. 계산할 수는 있지만 매우 복잡합니다. 모든 세부 사항을 이해할 필요는 없습니다. 제가 말하고 싶은 요점은 고품질 패브릭을 얻기 위해 무작위로 게이트를 칩에 배치하는 것만으로는 때때로 충분하지 않다는 것입니다. 우리는 실제로 버클리의 누군가와 함께 이러한 것들을 자동으로 구현하는 도구를 만들었습니다. 이 도구는 모든 이러한 그림들을 구현하기 위해 일종의 재귀적 구조를 사용하는 방식으로 작동합니다. 좋습니다. 그다음 더 큰 룩업 테이블을 구현하고 싶을 수 있습니다. 이를 위한 한 가지 방법은 인접한 룩업 테이블을 멀티플렉서로 더 큰 것으로 만드는 것입니다. 다른 방법으로는 분할 가능한 룩업 테이블이라고 부르는 것을 사용하는데, 큰 룩업 테이블을 만들고 이를 조각으로 나누는 것입니다. 이것이 실제로 ALTERA가 하고 있는 방식입니다. 하지만 룩업 테이블을 살펴보면,입력 라우팅 멀티플렉서는 실제로 여기 있는 룩업 테이블보다 더 비쌉니다. 이 그림을 넣은 이유는 멋져 보여서입니다. 하지만 이게 무엇인지 설명해 드리겠습니다. 여기 보이는 것은 구형 Xilinx FPGA에 대한 것입니다. A부터 H까지 8개의 룩업 테이블이 있고, 각 룩업 테이블에는 4개의 입력이 있습니다. 제가 보여드린 것은 스위치 매트릭스로 들어가는 입력의 수입니다. 스위치 매트릭스가 있고, 와이어가 들어옵니다. 그리고 그것은 결국 하나의 룩업 테이블 입력으로 가고 다른 룩업 테이블 입력으로도 갑니다. 그리고 이 포트를 공유하면 이것을 카운트합니다. 대각선에서는 특정 멀티플렉서에 얼마나 많은 입력이 있는지를 알려줍니다. 이것이 보여주는 것은 기본적으로 A, B, C, D에 대한 룩업 테이블들이 정확히 같은 입력을 공유한다는 것입니다. 이 관찰은 라우팅에서 일부 비용을 줄이는 데 사용했습니다. 하지만 이는 낮은 수준의 세부사항입니다. 또 다른 중요한 점은 예를 들어 산술 연산을 구현할 때 보통 캐리 체인을 사용한다는 것입니다. 디지털 로직 수업에서 기억할 수 있는 캐리 리플 가산기와 같은 것입니다.컴퓨터 공학의 기초에서, 당신은 전가산기가 한 셀에서 다음 셀로, 그 다음 셀로 이어지는 것을 볼 수 있습니다. 캐리가 리플됩니다. 이것이 실제로 래티스가 룩업 테이블 앞에서 하는 일입니다. 그들은 룩업 테이블 앞에 이런 캐리 로직을 가지고 있습니다. 이것이 작동하긴 하지만, 근본적으로 덧셈만 할 수 있게 합니다. 시간 관계상 이 부분은 넘어가겠습니다. 그리 중요하지 않습니다. 하지만 자일링스는 다른 방식으로 합니다. 그들은 캐리 체인을 룩업 테이블 이후에 합니다. 모든 세부사항을 이해할 필요는 없지만, 룩업 테이블 이후에 이를 수행하면 빠른 산술 연산도 할 수 있습니다. 또한 넓은 논리 게이트를 구현하는 데도 사용할 수 있습니다. 때때로 예를 들어, 넓은 비교기가 있고, 넓은 AND 게이트나 OR 게이트, 이런 종류의 것들이 있는데, 이것을 구현할 수 있습니다. 흥미로운 점은 자일링스가 그들의 로직에서 기본적으로 작동 원리를 유지한다는 것입니다. 그들이 약 20년 전부터 해온 것은 근본적으로 오늘날 하는 것과 같습니다. 유일한 차이점은 룩업 테이블이 더 커졌다는 것입니다. 여기서 보이는 8개의 블록은강조된 부분은 룩업 테이블입니다. 그런 다음 여러 룩업 테이블을 결합하여 더 큰 멀티플렉서를 형성할 수 있습니다. 이를 통해 더 큰 룩업 테이블이 만들어지고 그렇게 더 큰 룩업 테이블이 생성됩니다. 요약하자면, 실제로 승자는 없습니다. 이 모든 다양한 FPGA들은 (여기서 모두 언급되지는 않았지만) 각자의 장단점이 있습니다. 여기서 제가 말씀드리고 싶은 핵심은 이 모든 것들이 근본적으로 작동한다는 것입니다. FPGA를 하고 싶다면 가장 중요한 세 가지는 첫째, 도구, 둘째, 도구, 그리고 셋째도 도구입니다. 우리가 설계 결정을 할 때, 실제로 이것들을 살펴봤습니다. 합성 도구에서 어떤 지원을 받을 수 있는지, 배치와 라우팅으로 무엇을 할 수 있는지 보고 그에 따라 결정을 내렸습니다. 그리고 제가 언급했듯이 라우팅이 가장 큰 문제입니다. 이 그림이 보여주는 것은 Xilinx FPGA의 인접성입니다. 이것이 무엇을 의미하는지 살펴보겠습니다. 이 그림에서 보이는 각 점은 열에서 가능한 프로그래밍 가능한 연결입니다. 이는 하나의 멀티플렉서를 나타냅니다. 그리고 모든 수평선은 스위치 매트릭스로의 입력입니다. 이것이 의미하는 바는제가 라우팅을 처리하는 큰 블록이 있고 그 다음에 다른 클러스터에서 오는 와이어가 있습니다. 이것이 제 입력이 되겠죠. 그리고 룩업 테이블 입력이나 다른 곳으로 가는 와이어가 있을 수 있습니다. 그것이 제 열이 되겠죠. 여기서 보여지는 것처럼요. 이 그림이 표현하고자 하는 것 또는 보여주는 것은 라우팅이 믿을 수 없을 정도로 희소하다는 것입니다. 당신이 가지고 있는 것처럼요. 그래서 첫 번째와 두 번째 포인트는 이것이 보여주는 것은 라우팅의 얼마나 많은 부분이 룩업 테이블 입력인지, 얼마나 많은 부분이 인접 와이어를 위한 것인지입니다. 세부사항에 너무 깊이 들어가지는 않겠습니다. 매우 낮은 레벨이라는 걸 알아요. 하지만 메시지는 라우팅의 상당 부분이 단순히 룩업 테이블에 대한 입력이라는 것입니다. 그것이 여러분이 본 파란색 상자가 있는 그림이었죠. 그래서 이런 라우팅 패브릭을 구현한다면, 이 스위치들을 설계할 때 조금 신경 써야 할 것입니다. 그리고 우리는 거의 비슷한 밀도의 라우팅을 여기서 얻게 됩니다. 우리가 만든 몇 개의 칩에서요. 비디오가 있어야 합니다. 실제로 그 비디오는 있어야 합니다. 비디오가 있어야 해요. 사실 그게첫 번째 대형 FPGA였습니다. 약간 보이시나요. 이것이 우리가 만든 첫 주요 FPGA였습니다. 이것은 거리 애니메이션입니다. DSP 블록을 사용하고 보이는 비트맵은 내부 블록 RAM에 저장됩니다. 렌더링을 위해 로직에서 제곱근도 계산하며 현대 FPGA의 모든 요소를 기본적으로 얻을 수 있습니다. 모든 요소들이 함께 작동합니다. 그 칩에 흥미로운 버그가 있었습니다. 클록 분배 방식은 거기에 나와 있듯이 중앙 클록 지점이 있어서 클록을 주입합니다. 그리고 타일에서 타일로 분배됩니다. 네. 우리가 다르게 한 것은 블록 RAM, 이 칩에서 오른쪽에 보이는 큰 블록들이 다른 클록 트리를 통해 클록을 받았다는 것입니다. 패브릭의 클록과 이 블록 RAM 사이에 약간의 위상 관계가 있었고, 이로 인해 홀드 타임 위반이 발생했습니다. 이는 실제로 까다로운 문제입니다. 라우팅 도구로 보완할 수 있습니다. 보신 데모를 위해 약 100번의 라우팅이 필요했습니다. 작동하는 데모를 얻기 전까지요.오늘날처럼 구조화된 방식으로는 할 수 없었습니다. 하지만 이 데모를 할 때, 우리는 약 100배 정도의 시간을 들여 라우팅을 했습니다. 이것은 우리의 첫 번째 거의 모든 것이 FPGA였습니다. 거의 모든 것이라고 말하는 이유는 오픈 소스 도구로 할 수 없는 아주 작은 단계가 하나 있었기 때문입니다. 그래서 실제로 Innovus라는 산업용 도구를 사용해야 했습니다. 하지만 이것은 Skywater 100 나노미터 공정으로 만들어졌습니다. 오픈 PDK가 함께 제공되어 기본적으로 이것을 다운로드하고 Git 저장소에 가서 트랜지스터 구조까지 칩을 모두 탐색할 수 있습니다. 네. 네, 우리는 또 다른 칩을 만들었습니다. 이것은 완전히 오픈 소스 도구로만 만들어졌습니다. 그 Git 저장소를 확인해보면, 사용자 개입 없이 최종 패브릭까지 모두 컴파일됩니다. 오늘날의 기준으로 보면 아주 작은 FPGA입니다. 네. 최신 FPGA는 백만 개 이상의 룩업 테이블을 가지고 있습니다. 우리는 천 개도 되지 않습니다. 그래서 어느 정도 작지만, 이것이 단지 10 제곱 밀리미터 정도의 면적이라는 것을 알아야 합니다. 여기서 우리가 다루는 면적입니다. 그리고 우리는 130나노미터 공정입니다. 좋습니다. 이것을 조사하고 싶다면 그게 git ripple입니다. 다음 단계는 우리에게 적절한 타이밍 모델이 없다는 것입니다. 그래서 배치 및 라우팅을 하면, 크리스탈 파이프에서 약간 살펴보면 작동하는지 알 수 있습니다. 하지만 더 잘할 수 있다는 걸 압니다. 해야 할 숙제가 있는데, 이를 자동화하기 위해 물리적 제약에 대해 조금 더 작업해야 합니다. 몇 가지 최적화가 더 필요합니다. 더 많은 테이프 아웃을 넣고 싶습니다. Global Foundries 22, 아마도 TMC, EHRP 등을 하고 싶습니다. 그리고 실제로 오픈 소스 RISC-V와 임베디드 FPGA 보드를 만들고 싶습니다. 이것은 마치 오디오 카세트 패킷에 들어있는 것과 같습니다. 이것이 우리가 만든 첫 번째 완전 오픈 소스 칩입니다. 네. 우리는 이것을 조금 더 발전시켜 RISC-V 코어를 임베디드 FPGA와 함께 넣고 싶습니다. 그리고 한 가지, 제가 무엇을 해야 할지 잘 모르겠습니다. 여러분에게 물어보고 싶은 것은 여러분이 무엇을 선호하는지입니다. 우리는 성능이 매우 낮고 밀도가 매우 낮은 메모리 매크로를 사용할 수 있습니다. 그래서 12MHz 정도의 속도와 1킬로바이트 정도의 용량만 제공합니다. 또는저는 공개되지 않은 지멘스 매크로를 사용할 수 있습니다 그걸로 2-3배의 용량과 100MHz를 얻을 수 있죠. 여러분은 어떤 것을 선호하시겠습니까? 모든 것을 공개하고 성능과 밀도를 희생하는 것? 아니면 산업용 매크로를 사용해서 나중에 칩을 다운로드할 수 있지만 메모리 매크로는 블랙박스가 되는 것? 청중 여러분은 어떤 것을 선호하시나요? 누가 모든 것을 공개하는 쪽을 선택하시겠습니까? 네, 실제로 저도 그쪽에 동의합니다. 산업계 쪽은 누구 계신가요? 알겠습니다, 명확한 투표 결과인 것 같네요. 여기서 마무리해야겠습니다. 더 자세히 보고 싶으시면 제 로고의 QR 코드를 스캔하세요. 문서 웹사이트로 연결됩니다. 그것에 대해 더 자세히 알아볼 수 있습니다. 우리는 실제로 하이델베르크에서 FPGA Ignite 여름 학교를 운영합니다. 오른쪽에 보이는 칩은... 죄송합니다. 이 여름 학교에서 우리는 실제로 테이프아웃되는 칩을 만들고 참가자들은 그것을 받아갑니다. 이것이 우리가 만든 칩입니다. 이런 것들에 관심이 있고 자신만의 칩을 만들고 싶다면, 계속 지켜봐 주시고 웹사이트를 방문해 주세요. 이 연구를 가능하게 해준 모든 분들께 감사드립니다. 마지막으로 한 가지 더, 제가 언급을 건너뛴 것이 있는데...나중에 오픈소스 FPGA와 오픈 ASIC 디자인에 대한 밋업이 있습니다. 오늘 4시에 열리며, 소들 근처에서 열립니다. 더 자세히 알고 싶으시면 거기서 만나요. 경청해 주셔서 감사합니다. 발표 감사합니다. 질문 있으신가요? 질문이 있는 것 같네요. 거기서 시작하겠습니다. 이게 켜져 있나요? 고밀도 SRAM과 오픈 SRAM 중 어느 것을 원하는지 물으셨는데, 저는 둘 다 원합니다. Siemens 셀이 더 고밀도인 이유는 파운드리 규칙을 어기기 때문입니다. SRAM 셀에 대한 특별 규칙을 사용해서 다른 설계 규칙을 위반합니다. 누군가가 자체적으로 설계 규칙을 밀어붙여 고밀도 SRAM을 설계해야 한다고 봅니다. 하지만 이건 제 개인적인 의견입니다. 부분 재구성을 다시 가져와 주셔서 감사합니다. 방위산업 고객들이 관심이 없어서 업계에서 오랫동안 무시해왔던 부분인데요. 이제 부분 구성 연구를 할 수 있는 FPGA가 생겨서 정말 기쁩니다. 이것이 정말 FPGA의 킬러 앱이라고 할 수 있죠. 제가 드리고 싶은 질문은 기술적인 내용보다는 소프트웨어 세계의 명명에 관한 것입니다. 자유 소프트웨어의 정의에는 세 가지 근본적인 자유가 포함되는데, 그 중 하나가 소프트웨어의 모든 부분을 수정할 수 있는 자유입니다. 소프트웨어가 자유 소프트웨어나 오픈 소스로 간주되려면 모든 부분을 수정할 수 있어야 합니다. 소프트웨어의 어떤 부분이든 수정할 수 있어칩을 살펴보았습니다. 그리고 그 다이 사진에서 매우 분명하게 다이 사진의 하단 가장자리와 슬라이드 33의 오른쪽 가장자리에 Caravel 관리 엔진이 포함된 것을 발견했습니다. 저는 이를 알고 있습니다. 왜냐하면 오랫동안 Skywater 130을 목표로 했기 때문입니다. 불행히도 Google은 그들의 테이프 아웃을 통과하는 모든 칩에 이 관리 엔진을 포함시킬 것을 요구합니다. 그리고 관리 엔진은 모든 핀을 중간에서 조작할 수 있는 능력으로 당신의 설계와 외부 세계 사이에 위치합니다. 그래서 제 질문은 이 Caravel 관리 엔진의 포함이 필수이고 수정할 수 없다면, 이것이 정말로 완전히 오픈된 FPGA라고 할 수 있는지입니다? 예. 네. 그래서 당신이 거기서 본 것은 고려되지 않습니다. 그것은 단지 패드 링일 뿐입니다. Caravel은 전혀 없습니다. 이것이 IHP인가요 아니면 Skywater인가요? Skywater입니다. 그래서 그들은 관리 엔진 없이 테이프 아웃을 허용하고 있습니다. 우리는 Google에서 유료로 하나를 했습니다. Google 프로그램은 어쨌든 끝났습니다. 네. 그 테이프 아웃은 1만 달러가 들었습니다. 당신은 있습니다. 저는 3년 전에 이것을 요청했고 1만 달러 이상을 제안했지만 그들은 하지 않았습니다. 그들이 마음을 바꾼 것을 보니 기쁩니다. 그리고 좋은 소식은 당신이 훨씬 더 많은 IO를 얻게 될 것이라는 점입니다. 실제로 100개 이상의 IO를 제공하는 무언가가 곧I/O 핀의 구성은 FPGA 패브릭을 통해 이루어집니다. 구성 비트를 외부로 연결하여 I/O 핀을 구성할 수 있어 대형 벤더들의 제품과 비슷한 느낌입니다. 감사합니다. 다음 질문입니다. 같은 마이크로폰입니다. 아무도 좋아하지 않는 마이크로폰이죠. 네. 타이밍 분석에 대해 질문이 있습니다. FPGA를 설계할 때 라우팅 부분도 설계하여 네트의 기생 정전용량과 인덕턴스도 모델링하고 싶으신가요? 네. 해야 할 일은 기본적으로 SPICE 시뮬레이션을 통해 클록의 정확한 도착 시간을 얻는 것입니다. 그리고 기본적으로 넷리스트를 쿼리하여 모든 멀티플렉서 대 멀티플렉서 또는 멀티플렉서 대 룩업 테이블 와이어 세그먼트를 확인해야 합니다. 실제로는 좀 더 복잡합니다. 패스 트랜지스터 멀티플렉서를 사용하면 여러 경로를 스위칭하면 슬루 레이트 팬이 변할 수 있습니다. 다른 팬 아웃도 영향을 미칩니다. 그래서 이 모든 것을 모델에 반영해야 합니다. 네. 하지만 그것이 정확히 우리가 놓치고 있는 부분입니다. 완전한 배치 및 라우팅을 수행할 수 있도록 말이죠.네, 계속 말씀해 주세요 네, 교수님 설명해 주셔서 감사합니다. 이것들은 대부분 독점 정보입니다. FPGA가 어떻게 작동하는지 우리는 모릅니다. 우리는 그저 사용자로서 사용할 뿐입니다. 그래서 이 설명이 매우 유익했습니다. 제 질문은 CPU, GPU, FPGA의 데이터 흐름을 보여주신 슬라이드에 관한 것입니다. 제 생각에는 그 슬라이드였던 것 같습니다. 질문은 이 벤더들이나 나중에 언급하신 소프트 CPU, RISC-V를 FPGA에 왜 원하는지입니다. 이미 자체 파이프라인을 구축했는데 어떤 사용 사례가 있을까요? 네. 저는 이를 임베디드 FPGA라고 부릅니다. 재구성 가능성은 비용이 많이 들기 때문에 도움이 되는 곳에만 사용하고 싶어 합니다. CPU가 필요하다면 하드 CPU를 사용하고 그 옆에 패브릭을 배치해 재구성이 필요한 부분만 처리하게 할 것입니다. 그래서 만약 제가 마지막에 언급한 박스를 만들고 싶다면 그것은 하드 코어가 될 것입니다. 네. 그리고 우리는 실제로 패브릭을 사용합니다다르며, 임베디드 CPU나 소프트 CPU도 수행하도록 설계되었습니다 이유는 방금 말씀드린 것처럼 FPGA에서 가장 유용하게 실행할 수 있는 것이 임베디드 FPGA의 CPU이기 때문입니다. 모든 사람들이 저에게 묻습니다 당신의 패브릭에서 이런 저런 CPU를 실행해 보셨나요? 그리고 그들은 이런 저런 CPU를 실행할 수 있어야만 당신의 패브릭이 유용하다고 말할 것입니다 하지만 실제로는 이런 것을 절대 실행하지 않을 것입니다 따라서 우리는 레지스터 파일을 위한 분산 메모리와 같은 지원을 추가했습니다 하지만 이는 또한 필요에 맞게 패브릭을 맞춤화할 수 있다는 것을 보여줍니다 예를 들어 하드코어 산술 연산만 한다면 더 많은 DSP 블록이나 특수 DSP 블록 또는 텐서 블록 등을 넣을 수 있습니다 마이클, 첫 번째 질문에 감사합니다 네, 매우 유익한 강연이었습니다. 제 질문은 테이프아웃에 관한 것입니다. 작은 박스를 가져오겠다고 하셨는데 이것이 언제 나오는지 알림을 받을 수 있는 방법이 있나요? 그리고 모듈뿐만 아니라 직접 설계에 구현할 수 있는 베어 칩도 받을 수 있나요? 질문을 제대로 이해했는지 모르겠습니다. 여름 학교와 관련된 건가요? 우리는 칩 중개인이 아닙니다 여름 학교와 관련된 질문인가요? 우리는 칩 중개인이 아니기 때문에우리는 이런 것들을 기본적으로 사내에서 운영합니다 솔직히 말해서, 우리는 efabless를 사용했고 그들은 비교적 간소화된 서비스를 제공합니다. GDS를 제출하면 그들이 제조, 패키징을 수행하고 이 하얀 것처럼 멀리서 보면, 그들은 심지어 작은 브레이크아웃 보드에 칩을 납땜해 주고 PCB도 받습니다. 그건 우리가 만든 PCB입니다. 하지만 당신은 그들로부터 간소화된 것을 받습니다. 진입 가격이 있다고 생각합니다. 그들을 광고하고 싶지는 않습니다. 더 많은 업체들이 이런 서비스를 제공하길 바랍니다. 하지만 그건 1만 달러 정도의 진입 수준 가격입니다. 더 저렴한 옵션으로 tiny tape out이 있고 무료로 얻고 싶다면, 다음 여름 학교에 참여하세요. 네. 질문은 계획된 테이프아웃이 있는지에 대한 것이었습니다. 만약 뭔가 있다면, 우리가 공간을 찾아볼 수 있습니다. 좋은 아이디어가 있다면, 그냥 포함시킬 수도 있습니다. 그러니까, 연락 주세요. 감사합니다. 마이크 3번 안녕하세요. 먼저 강연 감사합니다. 메모리 저장장치에 대해 언급하셨는데, MTGA와 같은 스핀트로닉스 기술의 잠재력에 대해 어떻게 생각하시나요? 이 기술들이 현재 메모리 기술을 대체할 수 있다고 보시나요?FPGA 설계를 개선하는 방법이 있나요 밀도, 효율성 또는 신뢰성 측면에서요? 네. 감사합니다. 알겠습니다. 잠시만요. 네. 한 가지 가능성이 있는 부분은 기억하시겠지만 룩업 테이블의 멀티플렉서 트리입니다. 만약 ReRAM 셀을 사용해 각 셀에 2비트를 아날로그 형태로 저장할 수 있다면, 아날로그 모드로 사용하면 필요한 셀 수가 절반으로 줄어듭니다. 멀티플렉서 트리 크기도 절반으로 줄어들어 좋습니다. 하지만 이제 4가지 상태를 2비트로 구분해야 하는 문제가 생깁니다. 이것도 비용이 듭니다. 이를 저렴하게 할 수 있다면 이점이 있습니다. 우리는 저렴하게 할 수 있는 방법을 찾았습니다. 아직 연구 중이지만요. 이것이 우리가 하고자 하는 게임 중 하나입니다. 신뢰성 측면에서 ReRAM 기술은 이론적으로 본질적으로 신뢰할 수 있습니다. 단일 이벤트 오류에 강하다는 의미에서요. 하지만 기술 자체는 아직 개선의 여지가 있습니다. 네, 이것이 질문에 대한 답변이 되었기를 바랍니다. 네, 만약 이것이 질문에 대한 답변이 되었다면 좋겠습니다. 감사합니다. 아쉽게도 시간이 다 되었습니다우리의 시간은 다 되었지만 여러분은 아직 연사님과 대화를 나누실 수 있을 것입니다. 그럼 다시 한 번 큰 박수 부탁드립니다.