How to Read Math in Deep Learning Paper? - https://www.youtube.com/watch?v=YXWxVxQ6AeY Here you are reading an interesting deep learning paper. The abstract looks awesome. The introduction is great. Then you get into the methods section. At first everything is fine until you hit the wall of formulas. Your mind is foggy a bit and you decide to glance over them quickly. Guess what comes afterward? More formulas. At this point, your mind is completely blanked out and you have a hard time even following the text information. You finally set the paper aside. In this video, I'll show you how to read the math section of deep learning paper with a simple method I use. We will be using the quasi hyperbolic anim gradient descent formula as an example for those that are new to the channel. I'm Yasin, a published researcher and machine learning practitioner who love to teach. The method I use has these five steps which we will go through. The core idea is that you need to slow down when you read math in deep learning paper and work through the formula by hand. So let's get started by taking a deep breath. Feeling overwhelmed by math is completely natural. This feeling is experienced by many others, even seasoned machine learning practitioner. First thing you should realize though, is that you're supposed to take one glance at the formula and get it, especially if the formulas have arbitrary single letter names like x, a, b. What you're supposed to do is you work through the formula and understand each of the elements that compose it and make multiple attempts at understanding how everything connects together. At the end of this exercise, you should be left with an intuition about why something is working. That intuition is what you will be recalling next time you see the formula. But first off, take a deep breath. Now that we have the right mindset, the first step is to identify all the formulas in the paper that are shown or referred. The ones that are shown in the paper should be kept within their logical block. What I mean by that is that each of the formula blocks are trying to come down to a result of some sort. Identify these groups of results. Sometimes it's not too clear, so do your best. The one that you need to make sure you track down and note are those that are referred in the paper but not shown. Those are usually a good source of confusion since the others will assume you know what they mean already. Also be aware that the later formula usually connects logically to the one shown prior in the paper. Once you identify all of them, I would strongly advise you to take them out of the digital world into the real world, aka on a piece of paper. This might be just a personal preference, but I've always found that you need to have as much degrees of freedom as possible to move around mathematic, the variable aren't just variable, they mean something. Being able to take your pen and sketch out the motion the formula is going through or how a variable refer to a specific topic is critical for your understanding. Prepare to work on the formulas just as you would a lego building set. After you took all this formula out, now is the time to translate the symbols you are seeing into meaning. Math is as much about symbols as a poem is about letters. The meaning behind the symbols is absolutely crucial for you to get at a deep level to be able to read the formula. The best way I found to parse mat when encountering a new formula is to slowly and surely study each of the symbols and understand their interaction together. Here's how I do it. We'll use this formula from the quasi hyperbolic Adam paper. I'll show you an overview, and then I'll walk through it myself. First, I want to understand what each of the individual symbols means and put a name on them. Just in these examples, there are twelve symbols that we need to understand, which aren't explicitly explained in the image. As you work through this first step, you will realize that a lot of the formulas are reusing the same symbols. Hopefully, they are not changing their meaning within the same paper when reading them. From now on, name them in your head instead of reading alpha or epsilon or beta. After that, I look at the connection between these different entities and how this interaction is transforming some things with one meaning into a completely new one. This is where I do a lot of the work to build out an intuition, take example, and work through them to really unveil the motions these ideas are going through. Finally, I then study how each of these individual ideas construct the final result for a logical block. Ideally, at the end, I should be able to walk someone or rub a duck on how we are going from these atomic entities to these larger concepts and how they all connect together. This can take a while, by the way, especially if I'm missing core concept that I need to fetch externally. All right, let's jump into the formula now and work through the core one, which is so that you see how it's done. So remember the first step, like, not freak out. So we're gonna be reviewing this paper. Crazy probabilic momentum. And, Adam, for deep learning, we're not gonna actually review the paper, just the formulas. This is a cool one because it looks scary, but it's actually pretty simple, and there's some information that is not in the paper. So you have to get it out. So we're gonna go through the whole flow together. But first thing first, we take a deep breath. This is not that hard. You see here we have some formulas. You see another formula here, right? And then there's a whole bunch of them. And we have the big one, which is this scary one, all the mumbo jumbo. So let's get to it. Okay, so step one, now we have to identify all the formulas that are shown and refer. So the first one that we have here is not an actual formula. It's kind of like a template, right? But it's important to know it's this one, like, definition of optimization algorithm. And then you have that thing, right? So, like, the parameter at time t plus one is the parameter at time t minus something. So this is cool with this paper, is that they really break it down from like, very, like the primitive to the actual novelty. So let's take it out and put it into our stuff. Next one we have is the plane stochastic gradient descent, which is just over here. It's also pretty simple. So we have like, this minus alpha times this stuff. Second one we have is this momentum version of the stochastic gradient descent. Okay? So we're making some, some good leeway right now. Then we get the first thing, which is the quasi parabolic momentum. So this is not the qH edem, it's a qhM. And here we are starting to get a lot of letters, but it builds up into the thing that we've seen before. Let's take it out. Okay. Afterward, we have a whole bunch of stuff. They're doing comparison between that thing, QHM, with, like, Nesterov accelerated gradient Pid control. We have a synthesized, synthesized Nesterov variant, an accelerated stochastic gradient descent. A whole bunch of them. We don't need to dive into them. Basically, they're making the claim that, like, QHm recover all of them. If you change the parameter properly, this is not too important. So we skip them. Then we get to QH Adam, which is where things get a bit out of control. So let's take this whole thing out and let's stop a bit here. Let's look at our flow. We're starting with the primitive. And then we have this. We have this one, the stochastic gradient descent. We have momentum one. Cool. We have the Qhm. So far, everything makes sense. Then we jump straight to QH Adam, right? So it's the Qhm thing applied to Adam, but we. They haven't mentioned Adam. So we need to get that out. And this is external to the paper. So here we are in the Adam paper. So this is the original one. And then we have this beautiful algorithm that walks us through like the whole thing that we need. So we can use this one. And there we go. Now we have the whole gang. The whole gang is there. We have this one, and then we have qhlm jobs done. Okay, now ideally, you take these out and you're gonna like take them out of here and start to work on them on the, on a piece of paper like I'm doing over here. Reason for this, right? Like I said prior, is that, like, I don't have a lot of degree of freedom here. Like I have my mouse here. Crappy, crappy mouse, and I'm gonna write like this and I'm gonna work on them. I much prefer have like my ends and then try to like circle stuff and like name something here, take another piece of paper and try to do like some, some derivation if I need to, and then bring it back and have all of my stuff on the table that I can move around and look at it, right? Like for demonstration purposes, I'm gonna do it all over here, but idly I will. I just took them out already and I worked through them on a real piece of paper. Okay, so let's work through this piece of Alphabet soup. So what we have here, if we start with the qh atom, you're gonna freak out like for sure you're gonna freak out like, what the hell is happening here? We have like these ones, like are these derivative? No, like we have g, t plus first, plus one. You have s here you have a bunch of variable beta. This look important. It's look like a hyperparameters, right? And you have these on top, the v's that we saw previously, but that's this thing and absolutely unreadable. Like this is total madness. So the trick trick, right, if you have a whole flow, a paper that will do the whole flow and the whole journey, there's a reason behind it. There's a reason why in the paper this is like section five and not section zero, right? So take a step back. Okay, here we are, took a step back. Right? We start with the primitive. Here we have theta, which is the model parameters, right? So every time we're going to see a data. Now don't say that on your head. Say parameter at t plus one here, parameter at t plus one is equal, right, to the parameters at time t minus something. That's how you should read it. So now we know what theta is, then we have the l of theta. And what this means is that it's just a loss function to be minimized via the parameters that we have. So the l in everything that we're going to see, we should say in our head loss function, then we have l hat, because if you see all of them, like, they have a little add on it, there's never like l alone. And this is the little thing that will confuse you. Like, you have l of theta hat. What the heck does that mean? It's like. It just means that it's an approximator of the loss function, right? So over a mini batch, that's all it means to. So if you were coding this thing, you are giving a mini batch, right? So like about 50 or whatever thing. And then through this, you average them out. And then this is how you make a step for the parameter to that mini batch. You don't take the whole of the data set every time to make like. I think this is why it's an approximation of the last function. You're kind of approximating that this is the number for the loss. In some cases, if you're doing like, literally stochastic readiness, you're going to take one of the data points. Okay, so we're good. Now, what the heck is a little triangle? Inverted l, right? It's the gradient of the function l. Again, you're not never going to see that. You're not going to see the gradient of l. You're going to see the gradient of here, lithe hat of approximation, because you're taking one sample data point or multiple at time t. So just with this, you have four different symbols, but this is what it means. It's like, it's an approximation of the gradient at time t. Okay, we're making some headways here, like all of these stuff, right? It's already like a whole bunch of the unknowns that are kind of put out, but you have to kind of top and then really put in your head what all of these means. And then you have the vectors are operational element wise. Good. And then you have blob, GS, vW. Like, all of these are auxiliary buffer, right? So what it means, a buffer in programming sense, is just like an array of some sort where you're going to store something like in between x number of steps, right? You see here, like there's two steps. So this is one buffer, right? Here it is like two buffers, because you have multiple step and you have all of these which are related, right, to these buffers. So there's operation on buffers that are happening. Okay, so we know. So whenever we see GSVW and even a, sometimes they're gonna be a buffer. And g is special. It's the momentum buffer. We're going to see it also in momentum. Again, you see here. So there, finally, all of these stuff are subscribable by t, which is a step in time for the optimization. So, good job. Now we have like a good understanding about what heck is happening. We can move one step further into the, our road to understand qh atom. So here we have the stochastic grid in this, the plain one. So if you see that we have the parameter at t plus one, right, is equal to the parameter at time t minus, right. You see same form alpha time, the gradient of the approximator of the loss with the parameter theta at D. So what does that mean? Let's break it down a bit, right? So this is the parameter, the current one here that we jam. And then we're going to calculate the approximation of the loss function, and we're going to take the gradient of that. So this we know. And then there's this a here, right here. This is an alpha, actually, it's not the a that is here. And alpha, we're saying here, this is called a learning rate. This is a hyperparameter that you can tune so that you can take small step or big step. So that's it. The only thing that is new here is this learning rate. So now every time we're going to see alpha, afterward, we're going to keep in mind that it means the learning rate parameter, right? And if we see a instead, maybe a buffer, we're good here. We got stochastic grid in a set. And let's also not forget the form that this thing is taking, right? Like the form I feel is almost as important as a symbol, as long as soon as you have the idea in your head. Now, when I'm going to see this, right, this thing for me is telling me like, this is an, an optimization step that I'm taking, right? So if I see the same form, somebody, somewhere else, I'll know that this is what they mean. Let's look at the momentum. For example, you see, if I just glance over these two formula, this one and this one, I know for sure. This is the optimization step, right? It's the same form as this one over here, and the same form as this one. The other one is not the same form, it's something else, right? But you see that this one thing is connected to that. So now we recognize the bit, the form that is taking, and we see that there are some stuff that are added. We still have our learning rate here, right? We still have our good old parameters. Over here we have g, which is now called the momentum buffer. Buffer, right? And then we have b, which is called the exponential discount factor. B beta, actually. So this thing is now the exponential discount factor in my head. And what is happening here, right? If we have a beta of zero. So this get kills out and this goes to one, right, we just recover the stochastic grid, in a sense. So there seemed to be like a weighted average between two things, right? Here it's between this, which is the gradient of the estimator of the last for the parameter at t, right? And this other thing, which is what, this other thing is what happened previously. So the previous buffer, right? So what it effectively is doing is it's taking into consideration the previous optimization we've done. If it was super big before and now it's super small and the beta is controlling, like, let's say it's at 0.9. What you will do is you will take into consideration what was happening before, right, but dampen it down just a bit. So this is how effectively you're getting. You're getting this ball to be rolling. Because in this case, right, if we had a big, big, big optimization step at t minus one, and now we're at time t and the step is small, we're going to go big step and then we're going to go small step, right? It doesn't matter. It will move. If you see it in your head, it will move, like kind of in a stop motion type of way. You can have a big motion and then small motion, big motion, medium motion. It doesn't matter what happened previously here, though. It's not, it's not this. It's taking into consideration what happened in the past. So if the past was big, right, and technically this step should be small, it would actually take this one and this one. So it will still be big. It will still be a big step. A bit lesser than this one, though, right? It will be maybe something like this, right? And if the all the step that we calculate are still small, it will go gradually slower and slower and slower. So now, in your head, your intuition, you can understand why this is called momentum, because it's literally, if you see the parameter as a ball that is rolling through this lost landscape, it actually has a momentum. It has some. Some prior acceleration from before, and there's some friction in this plane. Okay, so now you take a good snapshot of this, right, because it will have the same form with QH momentum. Okay, so we have the two together. This was a stuff prior, and this is now qhm, right. Let's just take a look at. And if I add them on paper, we'll just, like, take a look at them like one after the other. If you were to superimpose, you have the b, g t here, same thing. You have one minus b, same thing. Like the first one, the ghdev t plus one. So that this momentum buffer is exactly the same thing. There's nothing. That nothing has changed here. This is good. We know what this means. Why happening here. That is different is this whole part, right. Remember when I said that this is kind of an average weighted of two things? It's doing the same trick again. Right. This thing is the weighted average between this quantity, right, which is very similar to this, right? But it's this quantity, which is something plus the rest here, which is what was before. Right. So what is going on? Effectively, here we have how they call the v, the v they call an immediate discount factor v, right. So it's a weighted average of the momentum update step and the plain stochastic reading update. Stephen, let's take a look back at stochastic gradient descent. So here we are. We have both of them. You have stochastic gradient descent here, and then we have momentum here. What the v is actually doing, right, is it's going to take an average between what is happening if we only were following stochastic gradient descent for now. And then what will happen if we were only following the momentum step, right? Because momentum is this and stochastic gradient descent is this. So this is what this means, right? There's nothing inherently complicated here that is happening. We're either going to follow momentum more or like, stochastic gradient descent more, and we're going to keep those two things in running in parallel. That's it. That's all that QHM is doing, essentially. There's nothing more fancy than this. That's the whole idea. And it explained it in the paper. But when you see it here, it also become very, very obvious. So now we're getting there. Let's move into Adam. And if you take something out of the paper right from another one, you have to be very careful about, like the variable this, where these variable might be the same, most likely than the other ones that you saw. So you have to take this idea and maybe rewrite it a bit and then you can incorporate it back into your stuff. So let's take a look at this. You can see right now it's a bit different. It's not an actual formula, it's an algorithm. Right. So the other decided to do it this way instead of having a formula. So you have to be aware that there's going to be some slight difference and be flexible. So here, step size, this is like our alpha. So it's also the learning rate. You have b one, b two, which is our exponent decay rate for the momentum estimated. Right. This, these b's are the same, these betas are the same betas as in the momentum. But you have two now. So you have f, which is stochastic objective function with parameter theta, which should be l. In our case, you have theta, which are the parameter vector. This is good. This, there's a moment vector one, and then there's a second moment vector. So there's two moment vector in the momentum. You have only one vector, t is, t is something. So now what we see, we calculate the gradient and then we calculate the two momentum vector at the same time. The momentum vector will have a specific formula. This one, same thing as before, right? Nothing different here. This one is different because we're taking g at exponent two at taide. So it's a gt and a g t exponent two. That's all there is. That's the only difference between the two. Then it does something a bit different. It's doing a bias correction of the first moment estimate and the bias correction of the second moment estimate. So in those two things, they are corrected before getting implemented into the formula. So this is why you're going to see four buffer, right? You have the two buffer because you're just adding one more than the momentum. And then you're going to take these two, correct them so that these matters, not the other ones anymore. And then you take these two quantity, and then you jam them into the update rule. And the update rule, if you look at, take a look quickly at it, you see the same thing, right? You have previous parameter, some learning rate and something, whatever it is, right? It's just that in this case it's this bias corrected first momentum divided by the square root of the second momentum plus some epsilon, which is usually some error that you're going to use to not have to divide by zero if this thing is way too small. You see here it's very small and that's kind of that that's the form of atom. Now you know it, right? Keep it in your head before going into the other stuff. Okay, now we're ready for QH atom. So, if you take a quick look over here. Yes, there's a lot of letters, but one thing to notice is that it's the same stuff here as Adam. If you were to rewrite this stuff, this, the first four buffers, exactly the same. So if you work through this, you understand pretty well what is going on here. Right? Where the difference lies is over here, right? You see, we have the exact same idea of Qhm. If you look back at Qhm, you see we have the idea of like the parameter v, which is the immediate discount factor that is oscillating between two things, right? In this case, qhm, it's either stochastic gradient descent if v zero, or momentum if v equal one. You can have like a continuous type of stuff between one or the other. It look like it's the same thing in Qh atom, but not exactly, right? Because if you take a quick look, we have v one and then we have v two, right? So, one, if I had like a, on my piece of paper, what I would do here is I would try a bunch of variation of these to understand what actually the algorithm is doing. So. But like here, they will already tell you, if you have v one and v two equal one, right? You get atom back, because if you have that, this will go to zero. So this is knocked out. This is knocked out. This doesn't exist anymore because it's a one. So you have the first momentum buffer divided by square root of the second momentum buffer, plus an epsilon that's directly at them. It's the same thing as here, this thing over there. So it shows that, like, with these parameters, you can recover a lot of different algorithm, and you can have like a continuous quadrant, let's say here, right? Based on the value of v and the value of v one and the value of v two, right? And on each extreme, you're going to get something different. So here you can, with Qh atom, you can recover RM's prop if you set v one to zero, right? And if you set v two to one, right, you're gonna recover Mrs. Prop. If you do v one equal beta one and v two equal one, you're gonna get NADM, and then you can get like a whole bunch of these like that. So that's kind of this. This is the idea, right? The idea is that instead of having like only one algorithm for optimization, it's mixing the power of multiple ones in order to create something that is a bit different. And if you look at the experiment section in the paper, it shows, like, why it's better, more efficient, and whatever. But if we only look at the formula, that's basically it. That's the exact same ideas as before, right? For QHM applied to now, two axes, essentially, and that's it. Now that we did all that work of understanding, I would distill down each of the logical formula into an intuition that you understand. Very important here. You don't need to be mathematically rigorous in your intuition. It just needs to make logical sense for you. Write that down in the paper. And now you can break through the paper from start to finish. And if need be, you can always go and dig back down to follow your train of thought. That led to this intuition, this tie up nicely to my video on how to read deep learning paper in general I did a few weeks ago. You can check it out over here. And that's it for today. I hope you enjoyed the video. Don't forget to like if it was the case, and leave a comment if you have any question. I'm here to help. Have a great week, everyone, and see you in next video.