Thursday 20 November 2014

Linear Regression- Gradient Descent- Cost Function- Octave

Overview:
1. Intuition for Hypothesis.
2. Cost Function.
3. Gradient Descent (minimum squared error).
4. Algorithm Implementation.


Let's Begin:
A small part of a larger data set:


x1
y
2104
399900
1600
329900
2400
369000
1416
232000
3000
539900
1985
299900

  ..                         ...
What do you want?
Just before we start, have a look at the data above and the graph below. The data set shown above is a subset of a larger data set. The graph below is a plot with x axis- first column of the data set and the y axis- the second column, i.e. the output.     
 
Fig-1
Assume that the black line indicates the best fit line for all our data set with minimum squared error. Don’t worry, we are yet to know what is minimum squared error.

What good is the straight line for? If tomorrow you encounter a question that what would be the price of a 3500 sq feet area? By looking at the straight line you can make a fair estimation that the price would be somewhere around 600000 (indicated by the red dot in the above plot).
Now the question arises how do we find the straight line? The answer to this question is COST FUNCTION.


Intuition:
What according to you is a good hypothesis? For me a good hypothesis is the one in which the difference between my output (h(x)) and the original output (y) is minimum.
For example consider the above dataset, assume we have 100 rows. For the first Instance (row or record) we need to find a parameters (theta) which when multiplied to the row-column value should give a hypothesis as close as the original output.

A Straight line equation says:
       Y=mx + C 

To make it more clear let’s take the first record of the above data :
              x1                        Y 
                2104                 399900

Looking at the above data (row vector) the straight line equation says  to find the value of  "m" - which is the slope that when multiplied by "x" gives value equal to "Y", the term "C" is the intercept term or Y intercept or the point that the straight line touches in the Y-axis. For a straight line passing through center C is 0.

Using the intuition of the straight line we add an intercept column to our original data-set with all rows with the value of 1.
x0                  x1                        Y 
    1                  2104                 399900

To have a deep and clear understanding of the importance of the intercept term you will have to run the complete algorithm for both without the intercept term and with the intercept term and then compare the graphs. You will see tremendous difference.

Our approach should be to find a theta_0 for x0 and theta_1 for x1 so that,
theta_0 *x0 + theta_1 *x1 =h(x) should be as close as to Y 


Cost Function:

Cost function is the equation that helps us to figure out the straight line to the data. In fig_1 the black line is the result of cost function.


“The Cost function states that for the parameter value theta, minimize the difference between the hypothesis and the original output to the least”

You must be wondering why did we use the above formula? what is it? what is the intuition behind it? Lets nail it by doing a small task.

Question: Suppose you have an equation as shown:

(a-1)2  + (a-2)2 + (a-3)2      
and you are asked to compute the value of (a) for which the equation is minimum. Without a single thought you can say, hey its 2!. But did you stress a bit why it is 2. There lies a pattern. 
2 is the average of 1,2,3. Therefore, you can always compute the average to get to the minimum. To find the minimum of the above equation we write:

(a-1)2  + (a-2)2 + (a-3)2
                3
Now, stare at the above summation equation, you will notice that even that's doing the average. Therefore the intuition for minimum squared error lies in averaging.



hθ(x) in the above equation in the hypothesis or the estimation

        hθ(x)= θ0 x0 + θ1x1 + θ2x2 +........ +θnxn

x0 is the intercept column with all the values as 1
x1 is our first column (sq feet)
x2 is the second column (number of rooms)
θ0,θ1,θ2 are the computed parameters.

  

Understanding the Cost function even better:

(hθ(x) - y)2) Squared errorThis equation minimize the difference between our hypothesis and the original output for the given parameter theta.


To understand the cost function better, lets do a small exercise. 
Let us take the first column x1 (sq feet) and the original output y (price) and determine the cost function for range of theta value and see for which value of theta the cost function j_theta is minimum
Range of theta: (1.0300e+005  to  1.1099e+005), 

Compute the cost function j_theta for each theta value in the above range. I get the below curve (convex), each point in the curve is the output of the cost function for one theta value.



Fig_2 



The Red point indicates the theta value for which the j_theta is minimum.

We see that theta at min is        :                         theta_1=1.0691e+005 
and for simplicity lets assume   :                         theta_0=3.4041e+005
Finally, we substitute the value of theta_0 and theta_1 to the hypothesis equation i.e. (hθ(x)= θ0 x0 + θ1x1 + θ2x2 +........ +θnxn) and plot the straight line :


Fig_3

Isn't this something similar to what we drew at the start, so it confirms that the theta values are indeed correct.
Note: If you are wondering why the range of x-axis in (-2,4), don't stress on it. It is just another property known as Feature Scaling which we shall see later.


Calculating Theta: 

Theta value can be calculated by many methods like Gradient Descent, Newton's method, Stochastic Gradient descent and etc. Let us, for now, concentrate on calculation based on Gradient descent.


The formula says:


Intuition: 

Consider the below graph: 

Fig_4
Suppose you start from the green point at the top, to reach to the red point you will need to take tiny steps down. The tiny steps are nothing but slopes and we know that derivative at a point of function is the slope to the graph of that function.

Hence we take the derivative of our Cost Function


After calculating the derivative we get 

Which is the RHS of our Gradient Descent formula.
The rate at with the slope converges is denoted by alpha. The value of alpha shouldn't be too less or too big.
A marginal range of alpha would be (0.03,0.06,0.09............1)

Note: The number of theta depends on the number of columns plus the theta_0. And all the theta value should be calculated simultaneously.


The Complete Algorithm:


1. Load the data set.
2. Perform feature scaling to the data set.

     for example                          
  x0                x1             x2               Y 
  1                2104           3            399900 
  1                2105           4            400000
  1                2200           3            410000

The difference between x1 and x2 is very much which could be a problem to the algorithm so we feature scale the column x1 and x2 with the formula.
                                                   
                                                    

When is the mean and is the standard deviation.

The above data after feature scaling you will get as
       x0               x1              x2                 y 
   1.00000     -0.58639     -0.57735       399900
   1.00000     -0.56826      1.15470       400000
   1.00000      1.15465     -0.57735       410000

3. Perform the algorithm for the feature scaled data for like 50 to 100 iteration. One iteration is the tiny step that the brings us closer to the minimum theta value. In Fig_4 the small violet line between two green point is just one iteration.
 Note: Depending on the data you might wanna choose you number of iteration, it can go as much as 10000 or even beyond.

4. For each iteration perform the cost_function (j_theta) vs number of iteration graph. J_theta value will decrease after each iteration and finally it will decrease to its minimum and show a constant line (beyond that point the graph wont decrease further).
  Following graph shows an example:


 

Note: If you don't get similar graph but get something like increasing-decreasing graph then try varying the alpha value. Probably decrease it.

5. After you got your graph find the theta value, multiply it to your feature scaled data set
         hθ(x)= θ0 x0 + θ1x1 + θ2x2 +........ +θnxn 
and plot your straight line.

6. For a new prediction on incoming data set, don't directly just multiply the data set to the obtained theta value. First feature scale the data set and then multiply it to the obtained theta value, then add all to get you predicted outcome.

   
For those who are familiar with OCTAVE below script can help you understand it properly:

I have used as much as loop avoiding Vectorization because people unfamiliar with vectorization might find it difficult to understand:

Enjoy :)

clear all
clc

1. Load the file
x=load('feature_dataset_xaxis.ext');
y=load('original_output_yaxis.ext');

2. Take the size m is the total instances and n is the attributes (features)
[m,n]=size(x);

3. Add the intercept column
x=[ones(m,1),x];

4. Perform feature scalling

mn = mean(x);
sd = std(x);
for mnsd =2:n+1
    x(:,mnsd) = (x(:,mnsd) - mn(mnsd))./ sd(mnsd);
end


5. Chose alpha and number of iteration

max_iter=200;
alpha=1;
theta = zeros(size(x(1,:)))';      % the theta has to be a 2*1 matrix so that it can multiply by x that is m*2 matrix
theta_u = zeros(size(x(1,:)))';
j_theta=zeros(max_iter,1);      % j is a zero matrix that is used to store the theta cost function j(theta)

6. Algorithm

   for num_iter=1:max_iter
       j_cost_each=0;
       theta=theta_u;                  % theta_u for simultaneous update
      

      % Calculating The cost function 
        for i=1:m
            h=0;
               for j=1:n+1
                   h=h+(theta(j)*x(i,j)); 
               end
               j_cost_each=j_cost_each+ ((h-y(i)) * (h-y(i))); 
         end
         j_theta(num_iter)=(1/(2*m)) * j_cost_each;         

   

        %  Calculating Gradient Descent   

         for j = 1:n+1    %  n+1 because n is the size before adding the bias column     
             grad(j) = 0;
             for i=1:m
                 grad(j) += ((x(i,:)*theta)-y(i))*x(i,j);      % Vectorization       

             end                                                               % x(i,j) is multiply the hypothesis of that row
             grad(j)=grad(j)/m;
             theta_u(j)=theta(j)- alpha * grad(j);
         end  




         % There may be a small question why did we take two loops that too first being the theta and second being no of rows
         % The reason is that when you do matrix multiplication or vectorization implementation of linear regression
         % then for
         % grad for Theta 1= each row hypothesis has to be multiplied by the x0 column and added
         % grad for Theta 2= each row hypothesis has to be multiplied by the x1 column and added
         % finally the theta values are calculated for each different grad values
    end

figure
plot(0:199, j_theta(1:200), 'b', 'LineWidth', 2)
xlabel('Number of iterations')
ylabel('Cost J')


1 comment: