번역: How to Read Math in Deep Learning Paper?

Transcript Translation

How to Read Math in Deep Learning Paper? - https://www.youtube.com/watch?v=YXWxVxQ6AeY

Here you are reading an interesting deep learning paper. The abstract looks awesome. The introduction is great. Then you get into the methods section. At first everything is fine until you hit the wall of formulas. Your mind is foggy a bit and you decide to glance over them quickly. Guess what comes afterward? More formulas. At this point, your mind is completely blanked out and you have a hard time even following the text information. You finally set the paper aside. In this video, I'll show you how to read the math section of deep learning paper with a simple method I use. We will be using the quasi hyperbolic anim gradient descent formula as an example for those that are new to the channel. I'm Yasin, a published researcher and machine learning practitioner who love to teach. The method I use has these five steps which we will go through. The core idea is that you need to slow down when you read math in deep learning paper and work through the formula by hand. So let's get started by taking a deep breath. Feeling overwhelmed by math is completely natural. This feeling is experienced by many others, even seasoned machine learning practitioner. First thing you should realize though, is that you're supposed to take one glance at the formula and get it, especially if the formulas have arbitrary single letter names like x, a, b. What you're supposed to do is you work through the formula and understand each of the elements that compose it and make multiple attempts at understanding how everything connects together. At the end of this exercise, you should be left with an intuition about why something is working. That intuition is what you will be recalling next time you see the formula. But first off, take a deep breath. Now that we have the right mindset, the first step is to identify all the formulas in the paper that are shown or referred. The ones that are shown in the paper should be kept within their logical block. What I mean by that is that each of the formula blocks are trying to come down to a result of some sort. Identify these groups of results. Sometimes it's not too clear, so do your best. The one that you need to make sure you track down and note are those that are referred in the paper but not shown. Those are usually a good source of confusion since the others will assume you know what they mean already. Also be aware that the later formula usually connects logically to the one shown prior in the paper. Once you identify all of them, I would strongly advise you to take them out of the digital world into the real world, aka on a piece of paper. This might be just a personal preference, but I've always found that you need to have as much degrees of freedom as possible to move around mathematic, the variable aren't just variable, they mean something. Being able to take your pen and sketch out the motion the formula is going through or how a variable refer to a specific topic is critical for your understanding. Prepare to work on the formulas just as you would a lego building set. After you took all this formula out, now is the time to translate the symbols you are seeing into meaning. Math is as much about symbols as a poem is about letters. The meaning behind the symbols is absolutely crucial for you to get at a deep level to be able to read the formula. The best way I found to parse mat when encountering a new formula is to slowly and surely study each of the symbols and understand their interaction together. Here's how I do it. We'll use this formula from the quasi hyperbolic Adam paper. I'll show you an overview, and then I'll walk through it myself. First, I want to understand what each of the individual symbols means and put a name on them. Just in these examples, there are twelve symbols that we need to understand, which aren't explicitly explained in the image. As you work through this first step, you will realize that a lot of the formulas are reusing the same symbols. Hopefully, they are not changing their meaning within the same paper when reading them. From now on, name them in your head instead of reading alpha or epsilon or beta. After that, I look at the connection between these different entities and how this interaction is transforming some things with one meaning into a completely new one. This is where I do a lot of the work to build out an intuition, take example, and work through them to really unveil the motions these ideas are going through. Finally, I then study how each of these individual ideas construct the final result for a logical block. Ideally, at the end, I should be able to walk someone or rub a duck on how we are going from these atomic entities to these larger concepts and how they all connect together. This can take a while, by the way, especially if I'm missing core concept that I need to fetch externally. All right, let's jump into the formula now and work through the core one, which is so that you see how it's done. So remember the first step, like, not freak out. So we're gonna be reviewing this paper. Crazy probabilic momentum. And, Adam, for deep learning, we're not gonna actually review the paper, just the formulas. This is a cool one because it looks scary, but it's actually pretty simple, and there's some information that is not in the paper. So you have to get it out. So we're gonna go through the whole flow together. But first thing first, we take a deep breath. This is not that hard. You see here we have some formulas. You see another formula here, right? And then there's a whole bunch of them. And we have the big one, which is this scary one, all the mumbo jumbo. So let's get to it. Okay, so step one, now we have to identify all the formulas that are shown and refer. So the first one that we have here is not an actual formula. It's kind of like a template, right? But it's important to know it's this one, like, definition of optimization algorithm. And then you have that thing, right? So, like, the parameter at time t plus one is the parameter at time t minus something. So this is cool with this paper, is that they really break it down from like, very, like the primitive to the actual novelty. So let's take it out and put it into our stuff. Next one we have is the plane stochastic gradient descent, which is just over here. It's also pretty simple. So we have like, this minus alpha times this stuff. Second one we have is this momentum version of the stochastic gradient descent. Okay? So we're making some, some good leeway right now. Then we get the first thing, which is the quasi parabolic momentum. So this is not the qH edem, it's a qhM. And here we are starting to get a lot of letters, but it builds up into the thing that we've seen before. Let's take it out. Okay. Afterward, we have a whole bunch of stuff. They're doing comparison between that thing, QHM, with, like, Nesterov accelerated gradient Pid control. We have a synthesized, synthesized Nesterov variant, an accelerated stochastic gradient descent. A whole bunch of them. We don't need to dive into them. Basically, they're making the claim that, like, QHm recover all of them. If you change the parameter properly, this is not too important. So we skip them. Then we get to QH Adam, which is where things get a bit out of control. So let's take this whole thing out and let's stop a bit here. Let's look at our flow. We're starting with the primitive. And then we have this. We have this one, the stochastic gradient descent. We have momentum one. Cool. We have the Qhm. So far, everything makes sense. Then we jump straight to QH Adam, right? So it's the Qhm thing applied to Adam, but we. They haven't mentioned Adam. So we need to get that out. And this is external to the paper. So here we are in the Adam paper. So this is the original one. And then we have this beautiful algorithm that walks us through like the whole thing that we need. So we can use this one. And there we go. Now we have the whole gang. The whole gang is there. We have this one, and then we have qhlm jobs done. Okay, now ideally, you take these out and you're gonna like take them out of here and start to work on them on the, on a piece of paper like I'm doing over here. Reason for this, right? Like I said prior, is that, like, I don't have a lot of degree of freedom here. Like I have my mouse here. Crappy, crappy mouse, and I'm gonna write like this and I'm gonna work on them. I much prefer have like my ends and then try to like circle stuff and like name something here, take another piece of paper and try to do like some, some derivation if I need to, and then bring it back and have all of my stuff on the table that I can move around and look at it, right? Like for demonstration purposes, I'm gonna do it all over here, but idly I will. I just took them out already and I worked through them on a real piece of paper. Okay, so let's work through this piece of Alphabet soup. So what we have here, if we start with the qh atom, you're gonna freak out like for sure you're gonna freak out like, what the hell is happening here? We have like these ones, like are these derivative? No, like we have g, t plus first, plus one. You have s here you have a bunch of variable beta. This look important. It's look like a hyperparameters, right? And you have these on top, the v's that we saw previously, but that's this thing and absolutely unreadable. Like this is total madness. So the trick trick, right, if you have a whole flow, a paper that will do the whole flow and the whole journey, there's a reason behind it. There's a reason why in the paper this is like section five and not section zero, right? So take a step back. Okay, here we are, took a step back. Right? We start with the primitive. Here we have theta, which is the model parameters, right? So every time we're going to see a data. Now don't say that on your head. Say parameter at t plus one here, parameter at t plus one is equal, right, to the parameters at time t minus something. That's how you should read it. So now we know what theta is, then we have the l of theta. And what this means is that it's just a loss function to be minimized via the parameters that we have. So the l in everything that we're going to see, we should say in our head loss function, then we have l hat, because if you see all of them, like, they have a little add on it, there's never like l alone. And this is the little thing that will confuse you. Like, you have l of theta hat. What the heck does that mean? It's like. It just means that it's an approximator of the loss function, right? So over a mini batch, that's all it means to. So if you were coding this thing, you are giving a mini batch, right? So like about 50 or whatever thing. And then through this, you average them out. And then this is how you make a step for the parameter to that mini batch. You don't take the whole of the data set every time to make like. I think this is why it's an approximation of the last function. You're kind of approximating that this is the number for the loss. In some cases, if you're doing like, literally stochastic readiness, you're going to take one of the data points. Okay, so we're good. Now, what the heck is a little triangle? Inverted l, right? It's the gradient of the function l. Again, you're not never going to see that. You're not going to see the gradient of l. You're going to see the gradient of here, lithe hat of approximation, because you're taking one sample data point or multiple at time t. So just with this, you have four different symbols, but this is what it means. It's like, it's an approximation of the gradient at time t. Okay, we're making some headways here, like all of these stuff, right? It's already like a whole bunch of the unknowns that are kind of put out, but you have to kind of top and then really put in your head what all of these means. And then you have the vectors are operational element wise. Good. And then you have blob, GS, vW. Like, all of these are auxiliary buffer, right? So what it means, a buffer in programming sense, is just like an array of some sort where you're going to store something like in between x number of steps, right? You see here, like there's two steps. So this is one buffer, right? Here it is like two buffers, because you have multiple step and you have all of these which are related, right, to these buffers. So there's operation on buffers that are happening. Okay, so we know. So whenever we see GSVW and even a, sometimes they're gonna be a buffer. And g is special. It's the momentum buffer. We're going to see it also in momentum. Again, you see here. So there, finally, all of these stuff are subscribable by t, which is a step in time for the optimization. So, good job. Now we have like a good understanding about what heck is happening. We can move one step further into the, our road to understand qh atom. So here we have the stochastic grid in this, the plain one. So if you see that we have the parameter at t plus one, right, is equal to the parameter at time t minus, right. You see same form alpha time, the gradient of the approximator of the loss with the parameter theta at D. So what does that mean? Let's break it down a bit, right? So this is the parameter, the current one here that we jam. And then we're going to calculate the approximation of the loss function, and we're going to take the gradient of that. So this we know. And then there's this a here, right here. This is an alpha, actually, it's not the a that is here. And alpha, we're saying here, this is called a learning rate. This is a hyperparameter that you can tune so that you can take small step or big step. So that's it. The only thing that is new here is this learning rate. So now every time we're going to see alpha, afterward, we're going to keep in mind that it means the learning rate parameter, right? And if we see a instead, maybe a buffer, we're good here. We got stochastic grid in a set. And let's also not forget the form that this thing is taking, right? Like the form I feel is almost as important as a symbol, as long as soon as you have the idea in your head. Now, when I'm going to see this, right, this thing for me is telling me like, this is an, an optimization step that I'm taking, right? So if I see the same form, somebody, somewhere else, I'll know that this is what they mean. Let's look at the momentum. For example, you see, if I just glance over these two formula, this one and this one, I know for sure. This is the optimization step, right? It's the same form as this one over here, and the same form as this one. The other one is not the same form, it's something else, right? But you see that this one thing is connected to that. So now we recognize the bit, the form that is taking, and we see that there are some stuff that are added. We still have our learning rate here, right? We still have our good old parameters. Over here we have g, which is now called the momentum buffer. Buffer, right? And then we have b, which is called the exponential discount factor. B beta, actually. So this thing is now the exponential discount factor in my head. And what is happening here, right? If we have a beta of zero. So this get kills out and this goes to one, right, we just recover the stochastic grid, in a sense. So there seemed to be like a weighted average between two things, right? Here it's between this, which is the gradient of the estimator of the last for the parameter at t, right? And this other thing, which is what, this other thing is what happened previously. So the previous buffer, right? So what it effectively is doing is it's taking into consideration the previous optimization we've done. If it was super big before and now it's super small and the beta is controlling, like, let's say it's at 0.9. What you will do is you will take into consideration what was happening before, right, but dampen it down just a bit. So this is how effectively you're getting. You're getting this ball to be rolling. Because in this case, right, if we had a big, big, big optimization step at t minus one, and now we're at time t and the step is small, we're going to go big step and then we're going to go small step, right? It doesn't matter. It will move. If you see it in your head, it will move, like kind of in a stop motion type of way. You can have a big motion and then small motion, big motion, medium motion. It doesn't matter what happened previously here, though. It's not, it's not this. It's taking into consideration what happened in the past. So if the past was big, right, and technically this step should be small, it would actually take this one and this one. So it will still be big. It will still be a big step. A bit lesser than this one, though, right? It will be maybe something like this, right? And if the all the step that we calculate are still small, it will go gradually slower and slower and slower. So now, in your head, your intuition, you can understand why this is called momentum, because it's literally, if you see the parameter as a ball that is rolling through this lost landscape, it actually has a momentum. It has some. Some prior acceleration from before, and there's some friction in this plane. Okay, so now you take a good snapshot of this, right, because it will have the same form with QH momentum. Okay, so we have the two together. This was a stuff prior, and this is now qhm, right. Let's just take a look at. And if I add them on paper, we'll just, like, take a look at them like one after the other. If you were to superimpose, you have the b, g t here, same thing. You have one minus b, same thing. Like the first one, the ghdev t plus one. So that this momentum buffer is exactly the same thing. There's nothing. That nothing has changed here. This is good. We know what this means. Why happening here. That is different is this whole part, right. Remember when I said that this is kind of an average weighted of two things? It's doing the same trick again. Right. This thing is the weighted average between this quantity, right, which is very similar to this, right? But it's this quantity, which is something plus the rest here, which is what was before. Right. So what is going on? Effectively, here we have how they call the v, the v they call an immediate discount factor v, right. So it's a weighted average of the momentum update step and the plain stochastic reading update. Stephen, let's take a look back at stochastic gradient descent. So here we are. We have both of them. You have stochastic gradient descent here, and then we have momentum here. What the v is actually doing, right, is it's going to take an average between what is happening if we only were following stochastic gradient descent for now. And then what will happen if we were only following the momentum step, right? Because momentum is this and stochastic gradient descent is this. So this is what this means, right? There's nothing inherently complicated here that is happening. We're either going to follow momentum more or like, stochastic gradient descent more, and we're going to keep those two things in running in parallel. That's it. That's all that QHM is doing, essentially. There's nothing more fancy than this. That's the whole idea. And it explained it in the paper. But when you see it here, it also become very, very obvious. So now we're getting there. Let's move into Adam. And if you take something out of the paper right from another one, you have to be very careful about, like the variable this, where these variable might be the same, most likely than the other ones that you saw. So you have to take this idea and maybe rewrite it a bit and then you can incorporate it back into your stuff. So let's take a look at this. You can see right now it's a bit different. It's not an actual formula, it's an algorithm. Right. So the other decided to do it this way instead of having a formula. So you have to be aware that there's going to be some slight difference and be flexible. So here, step size, this is like our alpha. So it's also the learning rate. You have b one, b two, which is our exponent decay rate for the momentum estimated. Right. This, these b's are the same, these betas are the same betas as in the momentum. But you have two now. So you have f, which is stochastic objective function with parameter theta, which should be l. In our case, you have theta, which are the parameter vector. This is good. This, there's a moment vector one, and then there's a second moment vector. So there's two moment vector in the momentum. You have only one vector, t is, t is something. So now what we see, we calculate the gradient and then we calculate the two momentum vector at the same time. The momentum vector will have a specific formula. This one, same thing as before, right? Nothing different here. This one is different because we're taking g at exponent two at taide. So it's a gt and a g t exponent two. That's all there is. That's the only difference between the two. Then it does something a bit different. It's doing a bias correction of the first moment estimate and the bias correction of the second moment estimate. So in those two things, they are corrected before getting implemented into the formula. So this is why you're going to see four buffer, right? You have the two buffer because you're just adding one more than the momentum. And then you're going to take these two, correct them so that these matters, not the other ones anymore. And then you take these two quantity, and then you jam them into the update rule. And the update rule, if you look at, take a look quickly at it, you see the same thing, right? You have previous parameter, some learning rate and something, whatever it is, right? It's just that in this case it's this bias corrected first momentum divided by the square root of the second momentum plus some epsilon, which is usually some error that you're going to use to not have to divide by zero if this thing is way too small. You see here it's very small and that's kind of that that's the form of atom. Now you know it, right? Keep it in your head before going into the other stuff. Okay, now we're ready for QH atom. So, if you take a quick look over here. Yes, there's a lot of letters, but one thing to notice is that it's the same stuff here as Adam. If you were to rewrite this stuff, this, the first four buffers, exactly the same. So if you work through this, you understand pretty well what is going on here. Right? Where the difference lies is over here, right? You see, we have the exact same idea of Qhm. If you look back at Qhm, you see we have the idea of like the parameter v, which is the immediate discount factor that is oscillating between two things, right? In this case, qhm, it's either stochastic gradient descent if v zero, or momentum if v equal one. You can have like a continuous type of stuff between one or the other. It look like it's the same thing in Qh atom, but not exactly, right? Because if you take a quick look, we have v one and then we have v two, right? So, one, if I had like a, on my piece of paper, what I would do here is I would try a bunch of variation of these to understand what actually the algorithm is doing. So. But like here, they will already tell you, if you have v one and v two equal one, right? You get atom back, because if you have that, this will go to zero. So this is knocked out. This is knocked out. This doesn't exist anymore because it's a one. So you have the first momentum buffer divided by square root of the second momentum buffer, plus an epsilon that's directly at them. It's the same thing as here, this thing over there. So it shows that, like, with these parameters, you can recover a lot of different algorithm, and you can have like a continuous quadrant, let's say here, right? Based on the value of v and the value of v one and the value of v two, right? And on each extreme, you're going to get something different. So here you can, with Qh atom, you can recover RM's prop if you set v one to zero, right? And if you set v two to one, right, you're gonna recover Mrs. Prop. If you do v one equal beta one and v two equal one, you're gonna get NADM, and then you can get like a whole bunch of these like that. So that's kind of this. This is the idea, right? The idea is that instead of having like only one algorithm for optimization, it's mixing the power of multiple ones in order to create something that is a bit different. And if you look at the experiment section in the paper, it shows, like, why it's better, more efficient, and whatever. But if we only look at the formula, that's basically it. That's the exact same ideas as before, right? For QHM applied to now, two axes, essentially, and that's it. Now that we did all that work of understanding, I would distill down each of the logical formula into an intuition that you understand. Very important here. You don't need to be mathematically rigorous in your intuition. It just needs to make logical sense for you. Write that down in the paper. And now you can break through the paper from start to finish. And if need be, you can always go and dig back down to follow your train of thought. That led to this intuition, this tie up nicely to my video on how to read deep learning paper in general I did a few weeks ago. You can check it out over here. And that's it for today. I hope you enjoyed the video. Don't forget to like if it was the case, and leave a comment if you have any question. I'm here to help. Have a great week, everyone, and see you in next video.

딥러닝 논문에서 수학을 읽는 방법? - https://www.youtube.com/watch?v=YXWxVxQ6AeY

여기서 흥미로운 딥러닝 논문을 읽고 있습니다. 초록이 굉장해 보입니다. 서론이 훌륭합니다. 그런 다음 방법 섹션으로 들어갑니다. 처음에는 공식의 벽에 부딪힐 때까지 모든 것이 괜찮습니다. 정신이 약간 흐릿해지고 빠르게 훑어보기로 합니다. 그 다음에 무엇이 나올까요? 더 많은 공식입니다. 이 시점에서 정신이 완전히 비어 있고 텍스트 정보를 따라가는 것조차 어렵습니다. 마침내 논문을 옆에 둡니다. 이 비디오에서는 간단한 방법을 사용하여 딥러닝 논문의 수학 섹션을 읽는 방법을 보여드리겠습니다. 채널을 처음 보는 분들을 위해 준 쌍곡선 애니메이션 경사 하강 공식을 예로 들어보겠습니다. 저는 가르치는 것을 좋아하는 출판 연구자이자 머신 러닝 실무자인 야신입니다. 제가 사용하는 방법에는 다음 5단계가 있으며 이를 살펴보겠습니다. 핵심 아이디어는 딥러닝 논문에서 수학을 읽을 때 속도를 늦추고 수식을 직접 풀어야 한다는 것입니다. 그럼 심호흡을 하면서 시작해 봅시다. 수학에 압도당하는 느낌은 완전히 자연스러운 일입니다. 이런 느낌은 많은 사람들, 심지어 노련한 머신 러닝 실무자도 경험합니다. 하지만 가장 먼저 깨달아야 할 것은 수식을 한 번 보고 이해해야 한다는 것입니다. 특히 수식에 x, a, b와 같이 임의의 단일 문자 이름이 있는 경우 더욱 그렇습니다. 해야 할 일은 수식을 풀고 그것을 구성하는 각 요소를 이해하고 모든 것이 어떻게 연결되는지 이해하기 위해 여러 번 시도하는 것입니다. 이 연습을 마치면 왜 무언가가 작동하는지에 대한 직감이 생길 것입니다. 그 직감은 다음에 수식을 볼 때 떠올리게 될 것입니다. 하지만 우선 심호흡을 하세요. 이제 올바른 사고방식을 갖추었으므로 첫 번째 단계는 논문에서 표시되거나 참조되는 모든 수식을 식별하는 것입니다. 논문에 표시된 수식은 논리적 블록 내에 있어야 합니다. 제가 말하고자 하는 것은 각 공식 블록이 어떤 종류의 결과로 귀결되려고 한다는 것입니다. 이러한 결과 그룹을 식별하세요. 때로는 너무 명확하지 않으므로 최선을 다하세요. 반드시 추적하여 기록해야 할 것은 논문에 언급되었지만 표시되지 않은 것입니다. 다른 공식이 이미 그 의미를 알고 있다고 가정하기 때문에 일반적으로 혼란의 원인이 됩니다. 또한 이후 공식은 일반적으로 논문에서 이전에 표시된 공식과 논리적으로 연결된다는 점에 유의하세요. 모든 공식을 식별한 후에는 디지털 세계에서 실제 세계로, 즉 종이에 옮기는 것이 좋습니다. 이는 개인적인 선호도일 수 있지만 수학을 이동하려면 가능한 한 많은 자유도가 필요하다는 것을 항상 알게 되었습니다. 변수는 단순히 변수가 아니라 무언가를 의미합니다. 펜을 들고 공식이 거치는 동작이나 변수가 특정 주제를 참조하는 방식을 스케치할 수 있는 것은 이해에 중요합니다. 레고 조립 세트를 만들 때처럼 공식을 작업할 준비를 하세요. 이 모든 공식을 꺼낸 후에는 이제 보이는 기호를 의미로 번역할 때입니다. 수학은 시가 글자에 대한 것만큼 기호에 대한 것입니다. 기호 뒤에 숨은 의미는 공식을 읽을 수 있도록 깊은 수준에서 파악하는 데 절대적으로 중요합니다. 새로운 공식을 접했을 때 mat를 구문 분석하는 가장 좋은 방법은 각 기호를 천천히 그리고 확실히 연구하고 상호 작용을 이해하는 것입니다. 제가 하는 방법은 다음과 같습니다. 준 쌍곡선 Adam 논문의 이 공식을 사용하겠습니다. 개요를 보여드린 다음 직접 살펴보겠습니다. 먼저 각 개별 기호의 의미를 이해하고 이름을 붙이고 싶습니다. 이 예에서만 이미지에 명확하게 설명되지 않은 12개의 기호를 이해해야 합니다. 이 첫 번째 단계를 진행하면서 많은 공식이 동일한 기호를 재사용하고 있다는 것을 깨닫게 될 것입니다. 이들을 읽을 때 같은 논문 내에서 의미가 바뀌지 않기를 바랍니다. 지금부터는 알파나 엡실론 또는 베타를 읽는 대신 머릿속에서 이름을 붙이세요. 그 후에, 저는 이러한 서로 다른 개체 간의 연결과 이러한 상호 작용이 어떻게 하나의 의미를 가진 어떤 것들을 완전히 새로운 것으로 바꾸는지 살펴봅니다. 여기서 저는 직관을 구축하고, 예를 들어, 그것들을 통해 이러한 아이디어들이 겪는 움직임을 실제로 밝혀내기 위해 많은 작업을 합니다. 마지막으로, 저는 이러한 개별 아이디어들이 어떻게 논리적 블록에 대한 최종 결과를 구성하는지 연구합니다. 이상적으로는 마지막에 저는 우리가 이러한 원자적 개체에서 이러한 더 큰 개념으로 어떻게 이동하는지, 그리고 그것들이 어떻게 모두 연결되는지에 대해 누군가를 걷게 하거나 오리를 문지르게 할 수 있어야 합니다. 그런데, 특히 외부에서 가져와야 하는 핵심 개념이 빠진 경우 시간이 좀 걸릴 수 있습니다. 좋아요, 이제 공식으로 넘어가서 핵심 개념을 살펴보겠습니다. 그러면 어떻게 이루어지는지 알 수 있습니다. 그러니 첫 번째 단계를 기억하세요. 당황하지 마세요. 그래서 우리는 이 논문을 검토할 것입니다. 미친 확률적 모멘텀입니다. 그리고 Adam, 딥 러닝을 위해 우리는 실제로 논문을 검토하지 않고 공식만 검토할 것입니다. 이건 무섭게 보이기 때문에 멋진데, 사실 꽤 간단하고 논문에 없는 정보도 있습니다. 그러니 꺼내야 합니다. 그럼 전체 흐름을 함께 살펴보겠습니다. 하지만 우선 심호흡을 하세요. 그렇게 어렵지 않습니다. 여기 몇 가지 공식이 있습니다. 여기 또 다른 공식이 보이죠? 그리고 정말 많은 공식이 있습니다. 그리고 가장 큰 공식이 있는데, 무섭고 횡설수설입니다. 그럼 시작해 볼까요. 좋아요, 1단계로, 이제 표시된 모든 공식을 식별하고 참조해야 합니다. 여기 있는 첫 번째 공식은 실제 공식이 아닙니다. 일종의 템플릿과 같죠? 하지만 이것이 최적화 알고리즘의 정의와 같은 것이라는 것을 아는 것이 중요합니다. 그리고 그런 것이 있죠? 그러니까, 시간 t에서 매개변수를 더한 것은 시간 t에서 매개변수를 뺀 것과 같습니다. 이 논문의 장점은 원시적인 것에서 실제 참신한 것까지 정말 세분화했다는 것입니다. 그러니 꺼내서 우리 자료에 넣어 봅시다. 다음은 평면 확률적 경사 하강법인데, 바로 여기 있습니다. 또한 매우 간단합니다. 알파에서 이것을 곱한 것과 같은 마이너스 값이 있습니다. 두 번째는 확률적 경사 하강법의 모멘텀 버전입니다. 알겠어요? 지금 약간의 여유를 두고 있습니다. 그런 다음 첫 번째는 준포물선 모멘텀입니다. 이것은 qH edem이 아니라 qhM입니다. 여기서 많은 글자를 얻기 시작했지만, 이전에 본 것과 같은 것으로 쌓입니다. 꺼내 봅시다. 알겠어요. 그 후에 많은 자료가 있습니다. 그들은 QHM과 Nesterov 가속 경사 Pid 제어와 같은 것을 비교하고 있습니다. 우리는 합성된, 합성된 네스테로프 변형, 가속 확률적 경사 하강법을 가지고 있습니다. 그것들이 아주 많습니다. 우리는 그것들에 대해 깊이 파고들 필요가 없습니다. 기본적으로, 그들은 QHm이 그것들을 모두 복구한다고 주장합니다. 매개변수를 적절히 변경한다면, 이것은 그렇게 중요하지 않습니다. 그래서 우리는 그것들을 건너뜁니다. 그런 다음 QH Adam으로 넘어가는데, 여기서 상황이 약간 통제 불능이 됩니다. 그러니 이 모든 것을 제거하고 여기서 잠시 멈추도록 합시다. 우리의 흐름을 살펴보겠습니다. 우리는 원시에서 시작합니다. 그리고 이것이 있습니다. 이것은 확률적 경사 하강법입니다. 우리는 모멘텀 1을 가지고 있습니다. 멋지네요. 우리는 Qhm을 가지고 있습니다. 지금까지는 모든 것이 의미가 있습니다. 그런 다음 우리는 바로 QH Adam으로 넘어가죠, 그렇죠? 그래서 그것은 Adam에 적용된 Qhm이지만, 우리는. 그들은 Adam을 언급하지 않았습니다. 그래서 우리는 그것을 꺼내야 합니다. 그리고 이것은 논문 외부에 있습니다. 그래서 여기 우리는 Adam 논문에 있습니다. 그래서 이것은 원본입니다. 그리고 우리에게 필요한 모든 것을 안내하는 아름다운 알고리즘이 있습니다. 그래서 이걸 사용할 수 있습니다. 그리고 저기 있습니다. 이제 우리는 모든 갱단을 가지고 있습니다. 모든 갱단이 거기에 있습니다. 이게 있고, 그런 다음 qhlm 작업이 완료되었습니다. 좋습니다. 이제 이상적으로는 이것들을 꺼내서 여기서 꺼내서 제가 여기에서 하는 것처럼 종이에 작업을 시작할 것입니다. 이유가 맞죠? 앞서 말했듯이, 여기에는 자유도가 많지 않기 때문입니다. 여기에 마우스가 있습니다. 엉터리, 엉터리 마우스, 이렇게 쓰고 작업할 것입니다. 저는 끝을 원 모양으로 표시하고 여기에 무언가 이름을 붙이고, 다른 종이를 가져와서 필요하다면 파생 작업을 시도하고, 다시 가져와서 테이블 위에 모든 것을 놓고 움직이고 볼 수 있도록 하는 것을 훨씬 선호합니다. 맞죠? 데모 목적으로, 여기 전부 다 하겠지만, 그냥 멍하니 할게요. 방금 꺼내서 진짜 종이에 풀어냈어요. 좋아요, 알파벳 수프를 풀어보죠. 여기 있는 것, qh 원자로 시작하면, 틀림없이 미칠 거예요. 도대체 여기서 무슨 일이 일어나고 있는 거지? 이런 것들이 있는데, 미분인가요? 아니요, g, t 더하기 1, 더하기 1이 있어요. 여기 s가 있고, 베타 변수가 있어요. 중요해 보이네요. 하이퍼파라미터처럼 보이죠? 그리고 위에 v가 있는데, 이전에 봤던 거지만, 이건 이게 전부고 읽을 수 없어요. 완전 미친 짓이에요. 그러니까 요령 요령, 그렇죠? 전체 흐름, 전체 흐름과 전체 여정을 다루는 논문이 있다면, 그 뒤에 이유가 있어요. 논문에서 이게 5절이고 0절이 아닌 데는 이유가 있죠, 맞죠? 그러니 한 걸음 물러서세요. 좋아요, 여기 있습니다. 한 걸음 물러서세요. 맞죠? 원시 함수부터 시작합니다. 여기에는 세타가 있는데, 모델 매개변수죠? 그러니까 매번 데이터를 볼 때마다. 머릿속에서 그렇게 말하지 마세요. t+1에서의 매개변수, t+1에서의 매개변수는 시간 t에서의 매개변수에서 무언가를 뺀 것과 같다고 합시다. 그렇게 읽어야 합니다. 이제 세타가 무엇인지 알았으니, 세타의 l이 있습니다. 이것이 의미하는 바는 매개변수를 통해 최소화해야 할 손실 함수라는 것입니다. 그래서 볼 모든 것의 l은 헤드 손실 함수라고 해야 하고, l hat이 있습니다. 왜냐하면 모든 것을 보면, 마치 작은 추가가 있는 것처럼, l만 있는 경우는 없기 때문입니다. 그리고 이것이 여러분을 혼란스럽게 할 작은 것입니다. l의 세타 hat이 있습니다. 이게 무슨 뜻일까요? 그냥 손실 함수의 근사치라는 뜻이죠? 미니 배치에 대해서는 그게 전부입니다. 이걸 코딩했다면 미니 배치를 제공하는 거죠? 그러니까 50 정도나 되는 거죠. 그런 다음 이걸 통해 평균을 냅니다. 그리고 이렇게 해서 매개변수를 미니 배치로 옮기는 단계가 만들어집니다. 매번 전체 데이터 세트를 가져와서 만들지 않습니다. 제 생각에 이게 마지막 함수의 근사치인 이유인 것 같습니다. 손실에 대한 숫자라고 근사하는 겁니다. 어떤 경우에는 문자 그대로 확률적 준비 상태를 하는 경우 데이터 포인트 중 하나를 가져옵니다. 좋아요, 그럼 좋습니다. 이제, 작은 삼각형이 뭐예요? 역 l이죠? 함수 l의 기울기입니다. 다시 말하지만, 절대 볼 수 없을 겁니다. l의 기울기를 볼 수 없을 겁니다. 여기서 기울기를 보게 될 텐데, 근사치의 유연한 모자는 시간 t에서 샘플 데이터 포인트 하나 또는 여러 개를 가져오기 때문입니다. 그래서 이것만으로, 네 가지 다른 기호가 있지만, 이것이 의미하는 바입니다. 시간 t에서의 기울기의 근사치와 같습니다. 좋아요, 우리는 이 모든 것과 같은 진전을 이루고 있습니다, 맞죠? 이미 일종의 미지수가 많이 나와 있지만, 여러분은 꼭대기에 올라서서 이 모든 것이 무엇을 의미하는지 머릿속에 넣어야 합니다. 그리고 벡터는 연산 요소별로 있습니다. 좋습니다. 그리고 blob, GS, vW가 있습니다. 이 모든 것이 보조 버퍼인 것처럼요, 맞죠? 그래서 프로그래밍 관점에서 버퍼는 x개 사이의 단계 사이에 무언가를 저장할 일종의 배열과 같습니다, 맞죠? 여기서 두 단계가 있는 것처럼 보입니다. 그래서 이것은 하나의 버퍼, 맞죠? 여기서는 여러 단계가 있고 이 모든 것이 이 버퍼와 관련이 있기 때문에 두 개의 버퍼와 같습니다. 버퍼에서 발생하는 작업이 있습니다. 알겠습니다. GSVW와 a를 볼 때마다 때때로 버퍼가 됩니다. 그리고 g는 특별합니다. 모멘텀 버퍼입니다. 모멘텀에서도 볼 수 있습니다. 다시 한 번, 여기에서 볼 수 있습니다. 마지막으로 이 모든 것은 t에 의해 구독 가능하며, 이는 최적화를 위한 시간 단계입니다. 잘하셨습니다. 이제 무슨 일이 일어나고 있는지 잘 이해했습니다. qh 원자를 이해하기 위한 길로 한 단계 더 나아갈 수 있습니다. 여기서는 확률적 격자가 있습니다. 평범한 격자입니다. t에서 매개변수가 1 더하기 t에서 매개변수를 빼는 것과 같다는 것을 알 수 있습니다. 알파 시간에서 동일한 형태를 볼 수 있습니다. D에서 매개변수 세타를 갖는 손실 근사자의 기울기입니다. 그러면 무슨 뜻일까요? 조금 나눠서 설명하죠, 그렇죠? 이게 매개변수이고, 현재 여기서 잼한 매개변수입니다. 그리고 손실 함수의 근사치를 계산하고, 그 기울기를 취할 겁니다. 그러니까 이건 알죠. 그리고 여기 a가 있습니다. 바로 여기요. 이건 알파인데, 사실 여기 있는 a가 아니에요. 그리고 알파는 여기서 학습률이라고 합니다. 이건 작은 단계나 큰 단계를 취할 수 있도록 조정할 수 있는 하이퍼 매개변수입니다. 그게 다예요. 여기서 새로운 것은 이 학습률뿐입니다. 그래서 알파를 볼 때마다, 그 후에, 학습률 매개변수를 의미한다는 걸 기억해야 합니다, 맞죠? 그리고 대신 a를 본다면, 버퍼일 수도 있고, 여기서는 괜찮습니다. 우리는 세트에 확률적 그리드를 얻었습니다. 그리고 이것이 취하는 형태도 잊지 마세요, 맞죠? 제가 느끼기에 형태는 상징만큼이나 중요합니다. 머릿속에 아이디어가 떠오르는 대로 말이죠. 이제 이걸 볼 때, 이게 제게는 최적화 단계라고 말해주는 거죠. 그러니까 다른 곳에서 같은 형태를 본다면, 누군가, 어딘가에서, 이게 그들이 의미하는 바라고 알 수 있을 겁니다. 모멘텀을 살펴보죠. 예를 들어, 이 두 공식을 훑어보면, 이거랑 이거면 확실히 알 수 있어요. 이게 최적화 단계죠? 여기 있는 것과 같은 형태고, 저것과 같은 형태예요. 다른 하나는 같은 형태가 아니고, 뭔가 다른 거예요. 하지만 이 하나가 저것과 연결되어 있다는 걸 알 수 있죠. 이제 우리는 취하고 있는 형태를 인식하고, 추가된 것들이 있다는 걸 알 수 있어요. 여기에는 여전히 학습 속도가 있죠? 여전히 오래된 매개변수가 있죠. 여기에는 g가 있는데, 이제 모멘텀 버퍼라고 합니다. 버퍼죠? 그리고 b가 있는데, 이것을 지수 할인 계수라고 합니다. 사실, B 베타입니다. 그래서 이것은 지금 제 머릿속에서 지수 할인 계수입니다. 그리고 여기서 무슨 일이 일어나고 있는 걸까요? 베타가 0이라면요. 그래서 이것은 제거되고 이것은 1로 갑니다. 어떤 의미에서 우리는 확률적 격자를 복구합니다. 그래서 두 가지 사이에 가중 평균이 있는 것 같았죠? 여기 이것은 t에서 매개변수에 대한 마지막 추정치의 기울기인 이것 사이입니다. 그리고 이 다른 것은, 이 다른 것은 이전에 일어난 일입니다. 이전 버퍼, 맞죠? 그래서 효과적으로 하는 일은 우리가 한 이전 최적화를 고려하는 것입니다. 이전에 매우 컸지만 지금은 매우 작고 베타가 제어하고 있다고 가정해 보겠습니다. 예를 들어 0.9입니다. 여러분이 할 일은 이전에 일어난 일을 고려하는 것이지만, 그것을 약간 낮추는 것입니다. 그래서 이것이 여러분이 얻는 효과적인 방법입니다. 당신은 공을 굴리게 만들고 있어요. 이 경우, 맞죠, 만약 t-1에서 크고, 크고, 큰 최적화 단계가 있었고, 지금 t 시점에 있고 단계가 작다면, 우리는 큰 단계로 이동하고 나서 작은 단계로 이동할 겁니다, 맞죠? 상관없습니다. 움직일 겁니다. 머릿속으로 보면, 정지 모션처럼 움직일 겁니다. 큰 동작이 있고, 작은 동작, 큰 동작, 중간 동작이 있을 수 있습니다. 하지만 이전에 여기서 무슨 일이 일어났는지는 중요하지 않습니다. 그게 아닙니다. 과거에 무슨 일이 일어났는지 고려하는 겁니다. 그러니 과거가 컸다면, 맞죠, 그리고 기술적으로 이 단계는 작아야 하지만, 실제로는 이 단계와 이 단계를 취할 겁니다. 그래서 여전히 클 겁니다. 여전히 큰 단계일 겁니다. 하지만 이 단계보다 조금 작을 겁니다, 맞죠? 아마 이런 식일 겁니다, 맞죠? 그리고 우리가 계산하는 모든 단계가 여전히 작다면, 점점 더 느려지고 더 느려질 겁니다. 이제 머릿속에서, 직감적으로, 왜 이것이 모멘텀이라고 불리는지 이해할 수 있을 겁니다. 왜냐하면 문자 그대로 매개변수를 이 잃어버린 풍경을 굴러가는 공으로 본다면, 그것은 실제로 모멘텀을 가지고 있기 때문입니다. 그것은 약간의. 이전의 약간의 가속도, 그리고 이 평면에 약간의 마찰이 있습니다. 좋아요, 이제 이것을 잘 찍어보세요, 맞죠? 왜냐하면 이것은 QH 모멘텀과 같은 형태를 가질 것이기 때문입니다. 좋아요, 우리는 둘을 함께 가지고 있습니다. 이것은 사전의 물건이었고, 이것은 지금 qhm입니다, 맞죠? 살펴보죠. 그리고 제가 그것들을 종이에 더한다면, 우리는 그것들을 차례로 살펴볼 것입니다. 만약 여러분이 중첩한다면, 여러분은 여기에 b, g t, 같은 것을 가지고 있습니다. 여러분은 1에서 b를 뺀 것과 같은 것을 가지고 있습니다. 첫 번째 것과 같이, ghdev t 더하기 1. 그래서 이 모멘텀 버퍼는 정확히 같은 것입니다. 아무것도 없습니다. 여기서 아무것도 변하지 않았습니다. 이것은 좋습니다. 우리는 이것이 무엇을 의미하는지 압니다. 왜 여기에서 일어나는지. 그것은 이 전체 부분이 다릅니다, 맞죠? 제가 이게 두 가지의 가중 평균이라고 말했던 걸 기억하세요? 다시 같은 트릭을 하고 있는 거예요. 그렇죠. 이건 이 양 사이의 가중 평균이에요. 이거랑 아주 비슷하죠? 하지만 이 양은 뭔가에 여기 나머지를 더한 거예요. 이게 전에 있던 거예요. 그렇죠. 그럼 무슨 일이 일어나고 있는 걸까요? 사실, 여기서 v라고 부르는 방식이 있고, v를 즉각 할인 계수 v라고 부르죠. 그러니까 모멘텀 업데이트 단계와 확률적 판독 업데이트의 가중 평균이에요. 스티븐, 확률적 경사 하강을 다시 살펴보죠. 여기 있어요. 둘 다 있어요. 여기 확률적 경사 하강이 있고 여기 모멘텀이 있어요. v가 실제로 하는 일은 지금 확률적 경사 하강만 따른다면 일어나는 일 사이의 평균을 구하는 거예요. 그리고 모멘텀 단계만 따른다면 무슨 일이 일어날까요? 모멘텀이 이게 되고 확률적 경사 하강이 이게 되거든요. 그러니까 이게 의미하는 바가 맞죠? 여기서 본질적으로 복잡한 것은 없습니다. 우리는 모멘텀을 더 따르거나 확률적 경사 하강법을 더 따르고, 이 두 가지를 병렬로 실행합니다. 그게 전부입니다. QHM이 기본적으로 하는 전부입니다. 이것보다 더 멋진 것은 없습니다. 그것이 전체 아이디어입니다. 논문에서 설명했습니다. 하지만 여기서 보면 매우, 매우 분명해집니다. 이제 거기에 도달하고 있습니다. Adam으로 넘어가겠습니다. 논문에서 다른 것을 꺼내면 이 변수와 같은 변수에 대해 매우 조심해야 합니다. 이 변수는 본 다른 변수와 동일할 가능성이 큽니다. 따라서 이 아이디어를 가져와서 약간 다시 쓴 다음 다시 여러분의 것에 통합할 수 있습니다. 이것을 살펴보겠습니다. 지금 보시다시피 약간 다릅니다. 실제 공식이 아니라 알고리즘입니다. 맞죠? 그래서 다른 하나는 공식을 사용하는 대신 이런 식으로 하기로 했습니다. 따라서 약간의 차이가 있을 것이라는 점을 알고 유연해야 합니다. 여기서 단계 크기는 알파와 같습니다. 따라서 학습률이기도 합니다. b 1, b 2가 있는데, 이는 추정된 모멘텀에 대한 지수 감소율입니다. 맞습니다. 이 b는 같고, 이 베타는 모멘텀과 같은 베타입니다. 하지만 이제 두 개가 있습니다. 따라서 f가 있는데, 이는 매개변수 세타가 있는 확률적 목적 함수로, l이어야 합니다. 우리의 경우 세타가 있는데, 이는 매개변수 벡터입니다. 좋습니다. 여기에는 모멘텀 벡터 1이 있고, 두 번째 모멘텀 벡터가 있습니다. 따라서 모멘텀에는 두 개의 모멘텀 벡터가 있습니다. 벡터는 하나뿐입니다. t는 t가 무언가입니다. 따라서 지금 우리가 보는 것은 기울기를 계산한 다음 두 모멘텀 벡터를 동시에 계산하는 것입니다. 모멘텀 벡터는 특정 공식을 갖습니다. 이것은 이전과 동일하지 않습니까? 여기서는 다를 것이 없습니다. 이건 다릅니다. g를 taide에서 지수 2로 취하기 때문입니다. 그래서 gt와 g t 지수 2입니다. 그게 전부입니다. 그게 둘의 유일한 차이점입니다. 그런 다음 약간 다른 작업을 수행합니다. 첫 번째 모멘트 추정치의 바이어스 보정과 두 번째 모멘트 추정치의 바이어스 보정을 수행합니다. 그래서 이 두 가지에서 공식에 구현되기 전에 보정됩니다. 그래서 버퍼가 4개인 이유가 맞죠? 버퍼가 2개인 이유는 모멘텀보다 1개 더 추가하기 때문입니다. 그런 다음 이 두 개를 가져와서 보정하여 다른 것이 아닌 이 두 가지가 중요하도록 합니다. 그런 다음 이 두 양을 가져와서 업데이트 규칙에 넣습니다. 업데이트 규칙을 살펴보면, 잠깐 살펴보면 같은 것을 알 수 있죠? 이전 매개변수, 학습률, 그리고 무언가가 있습니다. 맞죠? 이 경우 바이어스가 수정된 첫 번째 모멘텀을 두 번째 모멘텀의 제곱근으로 나눈 다음 엡실론을 더한 값입니다. 이는 일반적으로 너무 작은 경우 0으로 나누지 않아도 되는 오류입니다. 여기서 매우 작고 원자의 형태가 바로 그것입니다. 이제 알았죠? 다른 내용으로 넘어가기 전에 기억해 두세요. 좋아요, 이제 QH 원자를 준비했습니다. 그럼 여기를 잠깐 살펴보면, 글자가 많지만, 한 가지 알아두어야 할 점은 Adam과 같은 내용이라는 것입니다. 이것을 다시 쓰면, 처음 네 개의 버퍼는 정확히 동일합니다. 그러니 이걸 살펴보면 여기서 무슨 일이 일어나고 있는지 꽤 잘 이해할 수 있을 겁니다. 맞죠? 차이점은 여기 있습니다. 맞죠? 우리는 Qhm에 대해 정확히 같은 생각을 가지고 있습니다. Qhm을 다시 보면, 매개변수 v와 같은 아이디어가 있습니다. v는 두 가지 사이에서 진동하는 즉각적인 할인 요인입니다. 이 경우, qhm은 v가 0이면 확률적 경사 하강법, v가 1이면 모멘텀입니다. 둘 중 하나 사이에 연속적인 유형의 것을 가질 수 있습니다. Qh 원자에서는 같은 것처럼 보이지만 정확히 그렇지는 않습니다. 잠깐 살펴보면 v가 1이고 v가 2인 것이 맞습니까? 그래서, 제 종이에 있는 것과 같은 것이 있다면, 여기서 제가 할 일은 알고리즘이 실제로 무엇을 하는지 이해하기 위해 이것들의 여러 변형을 시도하는 것입니다. 그래서. 하지만 여기처럼, 그들은 이미 당신에게 말할 것입니다. v가 1이고 v가 2이면, 맞습니까? 원자를 돌려받습니다. 왜냐하면 그것이 있다면, 이것은 0으로 갈 것이기 때문입니다. 그래서 이것은 제거되었습니다. 이것은 제거되었습니다. 이것은 더 이상 존재하지 않습니다. 왜냐하면 그것은 1이기 때문입니다. 그래서 첫 번째 모멘텀 버퍼를 두 번째 모멘텀 버퍼의 제곱근으로 나누고, 거기에 바로 그 위치에 있는 엡실론을 더합니다. 여기와 같은 것입니다. 그래서 이 매개변수를 사용하면 많은 다른 알고리즘을 복구할 수 있고, 연속 사분면을 가질 수 있습니다. 여기라고 합시다. v의 값과 v 1의 값과 v 2의 값에 따라, 맞죠? 그리고 각 극단에서 다른 것을 얻게 됩니다. 그래서 여기서 Qh 원자를 사용하면 v 1을 0으로 설정하면 RM의 prop을 복구할 수 있습니다. 맞죠? 그리고 v 2를 1로 설정하면 Mrs. Prop을 복구하게 됩니다. v 1을 베타 1과 같고 v 2를 1과 같게 하면 NADM을 얻게 되고, 그런 식으로 여러 개를 얻을 수 있습니다. 그래서 이런 종류입니다. 이게 아이디어죠, 맞죠? 아이디어는 최적화를 위한 알고리즘을 하나만 사용하는 대신 여러 알고리즘의 힘을 섞어서 약간 다른 것을 만드는 것입니다. 논문의 실험 섹션을 보면 왜 더 좋고 효율적인지 알 수 있습니다. 하지만 공식만 보면 기본적으로 그게 전부입니다. 이전과 정확히 같은 아이디어죠? 지금 QHM을 적용하면 기본적으로 두 개의 축이 있고 그게 전부입니다. 이제 모든 이해 작업을 마쳤으니, 논리적 공식을 여러분이 이해할 수 있는 직관으로 요약하겠습니다. 여기서 매우 중요합니다. 직관에서 수학적으로 엄격할 필요는 없습니다. 그저 여러분에게 논리적으로 말이 되면 됩니다. 논문에 적어 두세요. 이제 논문을 처음부터 끝까지 훑어볼 수 있습니다. 필요하다면 언제든지 다시 파고들어 생각의 흐름을 따라갈 수 있습니다. 그게 이 직관으로 이어졌고, 몇 주 전에 제가 만든 딥러닝 논문을 일반적으로 읽는 방법에 대한 비디오와 잘 어울립니다. 여기에서 확인할 수 있습니다. 오늘은 여기까지입니다. 영상을 즐기셨으면 좋겠습니다. 만약 그렇다면 좋아요를 눌러주시고, 질문이 있으시면 댓글을 남겨주세요. 저는 도와드리기 위해 여기 있습니다. 여러분 모두 좋은 주말 보내시고, 다음 영상에서 뵙겠습니다.