Create Your Own Neural Network From The Scratch (A.I - 101)

Create Your Own Neural Network From The Scratch (A.I - 101)

Part 13 - Mean Squared Error ( Loss ) Function And Mathematics of Gradient Descent Algorithm​


Tangu mwanzo, tuliona kuwa Error, ni tofauti kati ya network prediction, O na target value from the training dataset, T
equation (80) (1).png

Ni equation rahisi sana, inapima tofauti kati ya output na target value na kazi yetu ni kupunguza hio tofauti kadri tunavyoweza, lakini je inatufaa tunapoelekea?

NB: Error function ina majina mengi, kuna Error function, Loss function na Cost function. Pote tunazungumzia kitu kimoja, kuna cases nyingine, tofauti inaweza kuwepo, ila mda mwingi hizo terms ni interchangeably

Hii table hapa chini inaonyesha Error functions za aina tatu ikiwemo hii
error_function_candidates.PNG

Tuna network output, O, tuna target output T, na aina tatu za error functions

Ya kwanza, ni simple difference kati ya output na target tunayoijua, udhaifu wa hii approach ni kwamba kama tukitaka kujua overall performance ya network nzima kwa kujumlisha errors zote
Hii error function ina tabia ya ku cancel baadhi ya errors na kutupa wrong judgement ya network performance

Utaona kwenye hio table, tukichukua jumla ya errors zote tunapata 0, ikiwa na maana kuwa overall network haina errors (ina perform better) licha ya kuwa imetupa incorrect prediction mbili (0.5 badala ya 0.4, na 0.7 badala ya 0.8)

The reason ni kwamba, negative na positive errors zina cancel each other, hata kama hazitoji cancel completely na kuwa 0, utaona kuwa sio njia nzuri ya kupima network error

Aina ya pili ya error function ni kutake absolute values of errors (inachukua only positive values) ku avoid cancellation of errors
Ubaya wa hii approach ni kwamba, Absolute function hazina smooth curve, ni discontinuous pale inapo approach minimum values (eg at
x=0.png
), hivyo sio differentiable

Hii ni graph ya simple absolute function kuona tunachomaanisha:
graph_of_absolute_function.png

Tuliona kuwa ili function iwe differentiable, inapaswa ku change smoothly as its parameters changes bila kuwepo kwa discontinuities au abrupt change (mabadiliko ya ghafla) ya aina yoyote, ila utaona kuwa x inapo approach 0 (at x = 0), absolute function ina jump ghafla (from decreasing to increasing or viceversa)
Hatuwezi kutumia hii approach kwasababu algorithm ya gradient descent ambayo inategemea differentiability ya function kufanya kazi (its first derivative) haitoweza ku deal na V shaped curve ya aina hii. Vilevile slope yake haipungui as we approach its local minimum, so kuna risk ya ku overshoot na ku bounce hapo hapo milele.

Aina ya tatu ya Error function, ni square error function, means tuna square tu Error function yetu tuliyoizoea
squared_error_function.png

Tayari tunajua kuwa squared error function ni differentiable, zina changes smoothly as its variables change at any point on the graph, so tume solve tatizo letu la errors kuji cancel na bado tuna advantage ya kutumia Gradient descent

Hii candidate ya error function ina advantage nyingine pia, ni rahisi kutafuta derivative ya squared function, so inakua rahisi hata kwenye computation.

Mathematics of the Gradient Descent

Kama tuna imagine overall Error function kama curve, basi kila weight kwenye neural network ita represent a particular location au point kwenye hio curve

Hii ni 2D (plane) representation ya tulichosema hapo juu
gradient-descent-1.PNG

Sasa tunachopaswa kujua, ni slope au gradient ya Error function katika hio location au weight (generally, any location au weight), kwenye huo mchoro hapo juu
equation (88).png
ina represent any particular weight kwenye neural network na
equation (89).png
ni slope ya error function with respect to this weight ONLY, formally hii ni partial derivative ya Error with respect to this weight huku tuki assume other weights as constant

(Kihesabu hakuna tofauti kati normal derivative uliyozoe na partial derivate, tofauti ni kwamba in partial derivative tuna differentiate the function with respect to only one variable, while keeping other constant. kwenye hii case yetu tunataka kujua gradient of the error function katika hii particular variable, so hatuna haja ya kujali kuhusu all other weights, we keep them constant
Tunatumia partial derivative tunapo deal na multivariate function kama kwenye hii case)

Hapa tunashauku ya kujua direction of the slope ili tu move hii weight in opposite direction

Sasa tutazame hii diagram hapa, ni simple neural network yenye 3 layers, kila layer ikiwa na 2 neurons
update-the-weight-of-neural-network.drawio.png

Japokua tuta reference hii diagram, tuta generalize derivation yetu ku fit any neural network of any size.

Tutazame notation kwenye hio diagram:
oi.png
ina represent any input signal into the network
wij.png
ina represent any weight connecting any node from the input layer to any node in the hidden layer
wjk.png
ina represent any weight connecting any node from the hidden layer to any node in output layer, in case of hidden layer, hii ita represent any weight connecting any node from the previous layer to any node in the current (or next) layer. Kumbuka output ya one hidden layer ina save kama input ya next hidden layer
ok.png
ina represent final output ya neural network at any output node
ej.png
ni jumla ya backpropagated error from the hidden layer (jumla ya errors zote za hidden layer zinazokua propagated back to input layer)
tk.png
ni target value at any output node (hii value ni constant kwasababu inatoka kwenye training data)
ijk.png
zina represent any particular node in the input layer (i), hidden layer (j) na output layer (k)

Kama unavyoona, tumehakikisha tunakua as general as possible ili final equation yetu ifit any neural network of any size

Kwa kuanza, tu focus na weight kati ya node yoyote ya hidden layer iliyounganisha node yoyote ya output layer
Kwenye notation zetu hapo juu, hii particular weight tumeiita
wjk.png

So, imagine tupo juu ya huu mlima wa Error function (kwenye deep neural network, hii very complex higher dimensional space),
E.png
katika hii location
wjk.png
na tunataka kujua ni wapi pana muinuko (slope or gradient) ili tuendee ulekeo tofauti na huo muinuko (kumbuka hatuna ramani, hatujui wapi kuna mteremko, so best option ni kucheki wapi kuna muinuko ili tusielekee huko, hii ndio logic nzima ya Gradient descent)
Na kumbuka kuwa, tuna focus only on this weight, tuna ignore nyingine zote

Hivyo kihesabu, tunapaswa tu compute partial derivate ya Error
E.png
with respect to this weight
wjk.png
ambayo ndio muinuko (gradient) wa huu mlima (space of error) kwenye hii location (weight)

Mathematically, hio gradient ni
partial_derivative_of_error_with_respect_to_weight.png



But
E.png
ni summation ya all squared errors in output nodes (Kumbuka tunatumia squared error function sasa kama our best error function kwasababu tulizoziona kwenye lecture iliyopita)

So,
overall_error_in_sigma_notation.png

n.png
ni instance au index ya particular output node (it can range from n = 1 to n = n, basically any number depend on size ya neural network uliyo design)
tn.png
ni target value katika hio particular node
on.png
ni output or prediction ilyofanywa na hio node

So, tuki expand our expression tunapata hiki:

partial_derivative_of_error_with_respect_to_weight_expanded.png

Sasa, tu expand our sigma notation
partial_derivative_of_error_with_respect_to_weight_2.png

Kama tunavyoona, tunapaswa ku take derivative ya kila instance ya squared error with respect to weight
wjk.png

Hii ni gharama kihesabu (computationally expensive) ukizingatia tunaweza kuwa na maelfu hata mabilion ya hizi instances kulingana na idadi ya output neurons / nodes

Lakini tukumbuke kuhusu tulichojifunza wakati wa forward pass, kila weight ina influence output ya neuron ambayo imeunganishwa nayo na haina influence kwenye output ya node ambayo haijaunganishwa nayo
Ndio maana tunapo propagate errors back wakati wa backpropagation kila node ina share its fraction of errors kutoka kwenye output nodes zilizoungana nayo

Inaendelea....
 
wjk.png
ni single particular weight, assume inaunganisha any node
j.png
from hidden layer, with any node
k.png
in the ouput layer
so ina influence kwenye output node ya hio node pekee na sio kwenye node nyingine

Kwa maneno mengine
wjk.png
ina influence tu kwenye output
ok.png
ambapo
n = k.png

Hivyo derivative ya squared errors nyingine with respect to
wjk.png
ni
0.png
isipokua kwenye squared error ambapo
n = k.png


So, kwahio logic tumeweza ku simplify kwa kiwango kikubwa sana our calculation
derivative_simplified_0_0.png

Tukiondoa hizo 0 tunapata hii simpler expression
simpler_expression_targetting_only_output_node_who_input_influenced_by_this_weight.png

Licha ya simplification tuliyofanya, still tunapaswa kutafuta derivative ya
E.png
with respect to
wjk.png


Lakini tunajua kuwa
ok.png
nayo ni function ya
wjk.png
kwasababu value yake imekua influenced na
wjk.png
wakati wa linear transformation through weights multiplication (note, incoming signal tunai treat kama constant kwenye hii stage, so only real variable ya
ok.png
ni weight
wjk.png
)

Kwa kusema hivi, ni wazi hapa tuna deal na composite function kwasababu tuna try kufafuta derivative ya function (error) with respect to variable (weight) which is also a variable of another function (output) in the same expression

Ladies & gentlemen......

The Chain Rule

Utaikumbuka hii rule kwenye pure paper 1.

If a variable z depends on the variable y, which itself depends on the variable x (that is, y and z are dependent variables), then z depends on x as well, via the intermediate variable y. In this case, the chain rule is expressed as

chain_rule.png


kwenye case yetu,
z.png
ni
E.png
,
y.png
ni
ok.png
na
x.png
ni
wjk.png

ili kupata derivative ya error with respect to weight
partial_derivative_of_error_with_respect_to_weight.png
ni lazima tutafute kwanza derivative ya error with
respect to output
derivative_of_error_with_respect_to_ok.png
pamoja na derivative ya output with respect to weight
derivative_of_output_with_respect_to_weight.png
kwa mujibu wa Chain rule.

Tuanze na derivative ya Error with respect to output

Inaendelea.....
 

Attachments

  • tk-ok.png
    tk-ok.png
    1.4 KB · Views: 3
  • E=u^2.png
    E=u^2.png
    1.1 KB · Views: 2
(Kuna limit ya 30 max attachments, kwa hii derivation, itatuchukua multiple posts)

Tuanze na derivative ya Error with respect to output,
derivative_of_error_with_respect_to_ok.png


Tusema,
tk-ok.png
, hivyo
E=u^2.png
(kumbuka ni equation tu ya squared error iliyobaki)

Hivyo
derivative of error with respect to u as chain rule.png


Now, tu apply power rule ku compute derivative of error with respect to u (function of output)
derivative of error with respect to u0.png


Power rule​

The power rule is used to find the slope of polynomial functions and any other function that contains an exponent with a real number. In other words, it helps to take the derivative of a variable raised to a power (exponent).

power rule.png


Tunapata
derivative of error with respect to u.png

Then, imebaki derivative of ya u with respect to output,
derivative_of_u_with_respect_to_ok.png

Hii ni rahisi, kumbuka
tk-ok.png
huku
tk.png
ikiwa ni constant (it is target value from the training data katika hii node), so derivative ni simply
-1.png


Tuki tafuta sasa product ya hizo derivatives according to the chain rule, tunapata
error-with-respect-to-ok.png

Kumbuka
tk-ok.png
, so
derivative-of-e-with-resp-to-ok.png

Hivyo basi,
derivative of error with respect to weight chain rule.png

ni
equation (90).png

Lakini tukumbuke kuwa
ok.png
tunaipata kwa ku apply sigmoid function kwenye combined moderated signals from hidden layer, so mathematically
ok eqn.png
oj.png
ikiwa ni input kutoka kwenye node ya hidden layer
j.png


Now, tuki expand our equation

derivative of error with respect to weight full v1.png

Sasa tu focus kwenye hii part ya equation
equation (91).png

Tu assign label kwenye hii composite function kama tulivyofanya hapo juu
u-sums.png

(Utaona kuwa tunataka kutafuta derivative ya hii function nzima with respect to weight,
wjk.png
, lakini weight pia ni variable kwenye sigmoid function)
Inaendelea....
 

Attachments

  • -2u.png
    -2u.png
    1.9 KB · Views: 0
  • -2(tk-ok).png
    -2(tk-ok).png
    3.1 KB · Views: 1
  • -2u expanded.png
    -2u expanded.png
    3.2 KB · Views: 0
  • power rule.png
    power rule.png
    2.9 KB · Views: 1
  • -2u.png
    -2u.png
    2.1 KB · Views: 0
Then tu apply chain rule kama mwanzo:
equation (94).png

Tuanze na first part, ambayo kimsingi ni derivative ya sigmoid function,
equation (95).png

Hapa hatuna haja ya kupiga hesabu, tunatumia standard result ya derivative ya sigmoid function ambayo ni:
derivation of sigmoid function.png

Imeisha hio, tuhamie kwenye hio derivative ya pili, derivative of u with respect to weight,
equation (96).png


Kumbuka
u-sums.png
, so

equation (97).png


But tuki expand hio sigma notation:
equation (98).png

Tunapata
equation (99).png


Kama mwanzo, tunapaswa ku compute derivative ya each term, lakini utaona kuwa expression nyingine zote ambazo hazi contain
wjk.png
kimsingi hazi change with respect to it, kwasababu outputs
equation - 2024-09-03T202043.992.png
haziwi influenced na weight
wjk.png
, na
equation - 2024-09-03T202129.446.png
tume zi keep constant (kumbuka hii operation nzima ni partial derivative) so derivative katika hizo terms simply ni
equation - 2024-09-03T202328.254.png


Kwa part ambayo ime contain our weight, derivative ya
wjk.png
with respect to itself ni
equation - 2024-09-03T202508.794.png

so tunabaki na our constant term ambayo ni
oj.png

so
equation (91) (1).png
equation - 2024-09-03T203200.089.png


But
u-sums.png

so


equation (91) (1).png
equation - 2024-09-03T203802.846.png


Finally, derivative of Error
E.png
of the neural network, with respect to any weight between hidden and output layer
wjk.png
,
equation - 2024-09-03T204416.233.png


Ni

equation - 2024-09-03T204343.651.png


Kwa kivumbi na jasho, tume derive expression ambayo inatumika ku compute gradient of the error function with respect to any weight between hidden and input layer

Hii ni full expression:
equation - 2024-09-03T204400.410.png


.
 

Attachments

  • equation - 2024-09-03T203535.455.png
    equation - 2024-09-03T203535.455.png
    7.8 KB · Views: 1
  • equation - 2024-09-03T204400.410.png
    equation - 2024-09-03T204400.410.png
    11.3 KB · Views: 1
  • equation (98).png
    equation (98).png
    4.9 KB · Views: 0
  • oi.png
    oi.png
    574 bytes · Views: 0
Tutumie huu mda ku formalize tulichojifunza mpaka sasa:

1. Partial Derivative

In the context of neural networks, a partial derivative represents the rate of change of a function (typically a loss function) with respect to one of its input variables, while keeping the other variables constant. Specifically, when training a neural network, we compute the partial derivatives of the loss function with respect to each weight in the network to understand how small changes in that weight will affect the overall loss.

2. Chain Rule

The chain rule is a fundamental rule in calculus used to compute the derivative of a composite function. In neural networks, the chain rule is used extensively during backpropagation, a process where we calculate the gradients of the loss function with respect to each weight.

3. Power Rule​

The power rule is a basic derivative rule in calculus, which states that:
equation - 2024-09-03T213928.248.png


4. Why Weight Only Influences the Output of the Node It Is Associated With​

In a neural network, each weight is associated with a specific connection between two neurons. Therefore, the effect of a weight change is localized to the node (neuron) it is directly connected to, which in turn influences the output of that specific node.

5. Why This Fact Simplifies Gradient Descent Calculation​

The fact that a weight only influences the output of the node it is associated with simplifies the gradient descent calculation because it allows us to break down the problem into smaller, more manageable pieces. When computing the gradient of the loss with respect to a particular weight, we only need to consider how changes in that weight affect the specific node's output, rather than having to account for the entire network all at once.
During backpropagation, we can calculate the gradient of the loss with respect to each weight individually, then update each weight accordingly. This locality reduces the computational complexity and makes it feasible to train large neural networks efficiently using gradient descent and its variants (like stochastic gradient descent)

Next: Derivative of Error with respect to any weight between input and hidden layers.
 

Attachments

  • equation - 2024-09-03T214147.329.png
    equation - 2024-09-03T214147.329.png
    845 bytes · Views: 2
  • equation - 2024-09-03T213521.755.png
    equation - 2024-09-03T213521.755.png
    2.4 KB · Views: 2
  • equation - 2024-09-03T204400.410.png
    equation - 2024-09-03T204400.410.png
    11.3 KB · Views: 1

Part 14 - Derivative of Error with respect to any weight between input and hidden layers.​


Kwenye lecture iliyopita, tuliona jinsi gani ya ku compute the gradient of the error function
E.png
with respect to any weight kati ya any node of the hidden layer na any node of the output layer
wjk.png


derivative or error with respect to any weight between hidden and output layer.png

Sasa tunapaswa ku compute derivative of the error function
E.png
with respect to any weight kati ya any node of the input layer na any node of the hiddenlayer
wij.png


Mathematically, tunapaswa ku compute
derivative of the error with respect to any weight between input and hidden layer.png


Badala ya ku derive upya, tunaweza kutumia symmetry na physical interpretation ya equation ya
partial derivative of error with respect to wjk.png


Tutazame kila sehemu ya hio equation na physical interpretation yake:

tk-ok-.png
hii interpretation yake ni simple, ni just error, difference kati ya target value na predicted value katika any instance of the output node
Kwasababu training data hazitwambi ipi ni target value inayopaswa kufikiwa na node za hidden layer, tunaweza equate hii expression with back propagated error kutoka kwenye hidden nodes
ej.png


combined moderated inputs between hidden and output.png
ina represents sum of the combined moderated signals kati ya nodes za hidden layer na output layer, hii simply tuna i equate na sum of the combined moderated signals kati ya nodes za input layer na hidden layer
combined moderated inputs between input and hidden layer.png


oj.png
ni ouput from any node of the hidden layer, hi tuna i equate na
oi.png
ambayo ni input signal into the network

So based on this symmetry na physical interpretation ya hio equation, tunaona kuwa
derivative of the error with respect to any weight between input and hidden layer.png
ni

derivative of error with respect to weight btn input and hidden.png




Sasa tuziandike hizo equation mbili kwa pamoja

derivative or error with respect to any weight between hidden and output layer.png


derivative of error with respect to any weight btn input and hidden layer.png


Kwasababu tupo interested kujua direction ya gradient, tunaweza ku drop
2.png
completely, kwasababu ni constant inayo scale magnitude ya gradient, sisi hatupo interested na magnitude ya gradient bali its direction

So hii ni final result

gradient of error with respect to any weight btn hidden and ouput layer.png


gradient of error with respect to any weight btn input and hidden layer.png


Kwa kutumia hizo equations mbili tunaweza ku compute gradient ya error function with respect to any weight in the network, hii ni step muhimu kwa mahesabu ya gradient descent

Tutumie mda huu ku summarize yote tuliyojifunza so far:

The symmetry in neural networks refers to the idea that the method used to calculate the error gradient and update weights in one layer can be similarly applied to other layers due to the network's consistent structure. For example, the process for adjusting weights between the hidden and output layers involves calculating the error (difference between target and actual outputs), the signals entering the output layer, and the outputs from the hidden layer. This same approach can be applied to the input-to-hidden layer by considering the backpropagated error from the hidden layer, the signals entering the hidden layer, and the outputs from the input layer. This symmetry simplifies the training process by allowing the same principles to be reused across different parts of the network.

Next : How to update weights using Gradient Descent
 

Attachments

  • combined moderated inputs between input and hidden layer.png
    combined moderated inputs between input and hidden layer.png
    1.9 KB · Views: 2
  • tk-ok.png
    tk-ok.png
    1.4 KB · Views: 2
  • equation - 2024-09-03T213928.248.png
    equation - 2024-09-03T213928.248.png
    2.5 KB · Views: 0
  • equation - 2024-09-03T204416.233.png
    equation - 2024-09-03T204416.233.png
    1.5 KB · Views: 0
  • equation - 2024-09-03T204343.651.png
    equation - 2024-09-03T204343.651.png
    9.8 KB · Views: 4
  • equation - 2024-09-03T204343.651.png
    equation - 2024-09-03T204343.651.png
    9.8 KB · Views: 0
  • equation - 2024-09-03T203802.846.png
    equation - 2024-09-03T203802.846.png
    7.9 KB · Views: 0
  • equation - 2024-09-03T204248.371.png
    equation - 2024-09-03T204248.371.png
    9.8 KB · Views: 1
  • equation - 2024-09-03T213928.248.png
    equation - 2024-09-03T213928.248.png
    2.5 KB · Views: 3
  • equation - 2024-09-03T214147.329.png
    equation - 2024-09-03T214147.329.png
    845 bytes · Views: 0

Part 15 - How to update weights using Gradient Descent​

Kwenye lecture iliyopita, tumeona jinsi gani tunaweza ku compute gradient of the error function with respect to any weight
Kwa maneno mengine tunajua wapi kwenye hii higher dimensional space of error kuna muinuko mkali (steepest slope)
Gradient descent inatutaka tu take step in opposite direction, i.e tu change value of weight in the opposite direction of the gradient

Kama gradient ina move towards positive direction, to move weight in negative direction, na kama gradient ina move towards negative direction to move weight in positive direction

Mathematically, tunachosema ni kwamba thamani mpya ya weight (weight update) tunaipata kwa:

gradient descent for w_jk.png
for any weight kati ya hidden layer na output layer

gradient descent for w_ij.png
for any weight kati ya input layer na hidden layer

alpha.png
ni Learning rate, kwenye lecture zilizopita tulitumia
L.png
ni kitu kilekile, notation tu imebadilika
Tuliona faida ya Learning rate kwenye lectures zilizopita, inatusaidia ku take step kule gradient descent inapotuongoza but partially, ili tu avoid bias ya hio direction.

Kisha tuna rudia hii process mpaka pale tunapo karibia local minima ya error function, kwa maneno mengine ni mpaka pale tutakapo reduce error (tofauti kati ya target value na predicted value)

Hizi ni hatua za Gradient descent algorithm

1. Tunaanza na ku initialize random weights

2. Kwa kila weight, tuna compute its partial derivative with respect to loss or error function, yaani gradient

3. Kisha tuna update value ya weight kwa ku take step in opposite direction of the gradient

4. Tunarudia hatu ya 2 na ya 3 mpaka pale tunapo ridhika na last updated value of weight

Hii ni Pseudocode ya Gradient Descent Algorithm

pseudocode implementation of the gradient descent.png


In simple english
gradient descent step.png


Inaendelea....
 
Back
Top Bottom