Part 12 - Gradient Descent Algorithm
Kwenye Lecture iliyopita tuliona ni jinsi gani ilivyo ngumu kurekebisha (adjust) kila weight kwenye neural network nzima based on feedback errors zinazokuwa propagated back to the network wakati wa backpropagation
Kwa simple linear classifier, yenye parameter moja, ilikua rahisi kutafuta mahusiano kati ya Error
na kwa kiasi gani inapaswa kurebisha thamani ya parameter
ku reduce error, tofauti kati ya network predicated value (output) na target value, kupitia hii formula
Kwasababu output ya network nzima, ina depend on only one parameter, ambayo tunaweza kui tweak kwa urahisi
Tatizo kuhusu neural network, ni kwamba output ina depend on many parameters, weights huku kila weight ikiwa na influence kwenye output ambayo imekua influenced pia na weight nyingine
Hii inafanya final output ya neural network iwe ni function of functions (multivariate function), hivyo huwezi kuifanya weight
iwe subject of this function kwa kutumia linear algebra.
Na hatuwezi ku try every possible values of weights kupata correct values of weight through brute force kama tulivyoona.
Solution ya hili tatizo ni Gradient Descent Algorithm ambayo tunaenda kuitazama kwenye hii lecture.
Kabla hatujaendelea, tukumbuke kuwa, Error
ni tofauti kati ya network output
na target value
So, kama output ya neural network ni complex function of many functions and variables ambayo hatuwezi hata kuiandika (means hatuwezi kujua hata shape ya graph yake / curve), vile vile Error ya network ni pia ni complex function of many variables ambayo hatuwezi kuiandika, kumbuka target value
ni constant (training data hazibadiliki throughout the training)
Hivyo Error function au Loss function kwa jina lingine (sometimes inaitwa Loss function kwasababu ina represent loss of accuracy kati ya network output na target value) ni
multivariate function
Na kumbuka neural networks inaweza kuwa na so many weights, hivyo hata graph / curve ya hii function ita produce very complex higher dimensional shapes
Mfano, kwa neural network rahisi yenye just two weights, hii itakua shape of its Error function (Error ikiwa function ya hizo weights)
Ni slightly complex graph for just two weights, kwa weights zaidi ya moja tunahamia kwenye higher dimensional shapes ambazo hatuwezi hata kuzi visualize
Kitu kingine cha msingi kuki note kwenye hio graph, ni kwamba japokua ni function ni complex ila ni
differentiable
Najua bado hujasahau calculus, ila tukumbushane tuna maana gani tunaposema function ni differentiable
Function ni differentiable kama ina
change smoothly as its variables change, hakuna mabadiliko ya ghafla (abrupt change) au jump at any point on the graph, so unaweza kuchora tangent line katika point yoyote ile unayotaka
Hivyo tunaweza kupima kwa kiasi gani kidogo sana function ina changes kama one of its varaible iki change, kwa maneno mengine tunaweza measure its
rate of change, hence differentiable
Sasa twende kwenye point ya msingi, jinsi gani Gradient Descent ita solve hili tatizo? Na Gradient descent ni nini haswa?
Hii ni formal definition ya Gradient Descent
It is a first-order iterative algorithm for finding a local minimum of a differentiable multivariate function.
(Kama mara ya mwisho kusoma pure mathematics ni Advance, nadhani hii method hujawahi kukutana nayo, hivyo tuchukue mda kidogo kuielewa)
Tayari tunajua nini maana ya differentiable function (if we can calculate its rate of change, then it is differentiable), lakini tuna neno
optimization, firs-order, iterative, a local minimum na multivariate function, ni vitu gani?
Ok,
optimization ni method ya kutafuta "best" au most efficient solution kutoka kwenye set of many possible solutions or choices
Kuiweka kwenye mfano, imagine unataka ku budget mafuta kwenye gari lako wakati unasafiri kwenda sehemu A, kama kuna njia 7 za kupita ili ufike, optimization ni pale unapochagua njia fupi zaidi ya kupita kwasababu ndio itaokoa mafuta zaidi kuliko nyingine.
Nini maana ya
local minimum? Ok, turudi back in time, Advance.
In differentiation, minimum point au points (jina lingine local minimum) ni thamani za variable(s) ambazo zina produce lowest possible value of the function
Imagine hii function
Minimum points ni value za x ambazo zinafanya y iwe ndogo, for single variable function kama hii tunaweza kupata minimum points kwa kufanya
first order au first derivative (gradient) yake iwe 0
(Tunaposema first-order tuna maana ya first derivative of function au gradient)
So, kwenye 0,0 ndipo function, y inapokua na lowest value, hii ni graph ya y function ikionyesha hio minimum points au
local minimum
Sawa, lakini utakutana na hii term nyingine ya
Global minimum? kama maneno yanavyo suggest, local minimum ni point on the graph ambapo function inakua na lowest value BUT ukiilinganisha na nearby points, hence neno local
Kwa upande mwingine, Global minimum ni point on the graph ambapo function inakua na lowest value ukiilinganisha na all other points
Tukiiweka kwenye mfano wetu wa gari, local minimum ni ile njia fupi ukilinganisha na njia nyingine za karibu yake, ila global minimum ni ile njia fupi zaidi kuliko nyingine, ambayo ndio ya muhimu zaidi kwenye optimization ya mafuta
Function yetu ya
ni simple kiasi kwamba its local minimum ni sawa na its global minimum, hakuna point yoyote ile yenye lowest value of y ukiacha 0,0
Now, fikiria complex function ya complex graph kama hii
Kama una imagine hio curve kama bonde kubwa, then kuna vibonde vidogo vidogo (local minimum) vipo kwa ndani
Ila chini kabisa ya hili bonde ndio global minimum ya bonde lote, sehemu ambayo mwinuko (gradient) ni mdogo zaidi kuliko sehemu nyingine yoyote
Kwa very complex curve, ni ngumu kupata Global minimum (kufika chini kabisa ya bonde nikiiweka katika mfano), hivyo matumaini yetu ni angalau tutafute local minimum ambayo ipo karibu na global minimum
Nini maana ya
iterative, gradient descent ni iterative kwasababu haitupatii local minimum (au global minimum kama tuna bahati) moja kwa moja, in one step, bali inarudia rudia in many steps mpaka inapojiamini kuwa imefika kwenye local minimum ya function
Sina haja ya kuelezea zaidi kuhusu
Algorithm cause nahakika unajua maana yake, ila kwa meneno machache ni step-by-step solution to a problem
Multivariate function ni function yoyote ile inayo depend on many variables (au dimensions kwa jina lingine)
Mfano
ni multivariate function ya
na
. Hivyo gradient descent inaweza ku operate kwenye higher dimensional space
Nguvu ya Gradient descent ni kwamba, hatuna haja ya kujua function yenyewe ili kupata its local minimum kama ilivyokua kwa
Criteria ambayo kimsingi ina fit kwenye tatizo letu kwasababu hatujua actual equation ya Loss function yetu.
Kwa kutumia Gradient descent tunaweza estimates its local minimum bila haja ya ku derive na kuijua Loss function nzima (kwasababu hatuwezi hata hivyo)
Unaweza kujiuliza swali, hayo maelezo yote yanahusiana vipi na ku reduce Error au Loss function kitu ambacho ndio tunakitaka
Well, imagine kama our Loss function ni hii quadratic equation
Kutafuta local minimum yake ni sawa sawa kabisa na ku reduce its value, kwasababu hiyo ndiyo sifa ya local minimum, point ambayo function has the lowest value
Na ku reduce loss function ndio maana ya ku learn kwenye network, kwasababu kunakua hakuna tofauti kubwa kati ya output ya neural network na target value, tunasema
neural network has learned
So kama tukiweza ku tafuta local minimum ya hii Complex and unknown Loss function kwa kutumia gradient descent tunaweza sasa ku train our neural network
Na kwasababu variables za hii loss function ni weights of network, gradient descent automatically ita calculate na ku assign correct value ya weights ambazo zitakua na uwezo wa ku produce correct value mwisho wa training
Kwa maneno mengine, gradient descent itaenda kututafutia values of weights ambazo zina reduce the Error au Loss function
the most kitu ambacho tulikua tu struggle kukipata
Nasema the "most" not completely, kwasababu tuna deal na complex "valleys" au shapes in higher dimensional space, hivyo ni ngumu kupata global minimum, lakini gradient descent ni efficient enough kutupa local minimum ambayo ipo so close tu global minimum
Hivyo neural network haiwezi kuwa completely correct mara zote, lakini to be fair, hata binadamu tupo hivyo.
So how actually gradient descent work?
Gradient descent ina njia rahisi zaidi na intuitive ya ku solve tatizo letu la ku train neural network
Kuielewa, imagine hii scenario
Upo kwenye kilele cha bonde au mlima wenye vilele vingi (mfano wa complex Loss function yetu), lengo lako ni kushuka chini (to the local au global minima, ambapo error ina lowest value), lakini kuna tatizo.
Ni usiku na giza, huna ramani na hujui njia (Loss function is unkown)
Kitu pekee ulichonacho ni tochi inayoweza kumurika hatua tatu au nne mbele
Kwenye hii situation, Gradient descent algorithm inakwambia utazame sehemu ulipo kisha uangalie ni sehemu ipi kuna mwinuko, kisha uende opposite na hio sehemu
Kwa maneno mengine, tafuta sehemu ambayo inateremsha chini (descent), kinyume na mwinuko (opposite of the gradient), hii algorithm ni very intuitive, kimsingi inakwambia haijalishi ni sehemu ipi utajikuta kwenye hili bonde kwa kila hatua utakayopiga,
usipandishe juu
Kwa kila sehemu utakayofika, rudia tena hii njia, murika tochi yako kutazama wapi kuna mwinuko, kisha wewe shuka chini
Ndiyo maana inaitwa Gradient descent (means, always take the steps opposite to the gradient or slope)
Kama ukifanya hii tena na tena (iteratively), utajikuta umefika chini ya bonde
Inaendelea
hapa