Machine Learning and Big Data- The ultimate blend: Linear Regression- Gradient Descent- Cost Function- Octave

Overview:
1. Intuition for Hypothesis.
2. Cost Function.
3. Gradient Descent (minimum squared error).
4. Algorithm Implementation.

Let's Begin:
A small part of a larger data set:

x1	y
2104	399900
1600	329900
2400	369000
1416	232000
3000	539900
1985	299900

.. ...

What do you want?

Just before we start, have a look at the data above and the graph below. The data set shown above is a subset of a larger data set. The graph below is a plot with x axis- first column of the data set and the y axis- the second column, i.e. the output.

Fig-1

Assume that the black line indicates the best fit line for all our data set with minimum squared error. Don’t worry, we are yet to know what is minimum squared error.

What good is the straight line for? If tomorrow you encounter a question that what would be the price of a 3500 sq feet area? By looking at the straight line you can make a fair estimation that the price would be somewhere around 600000 (indicated by the red dot in the above plot).

Now the question arises how do we find the straight line? The answer to this question is COST FUNCTION.

Intuition:

What according to you is a good hypothesis? For me a good hypothesis is the one in which the difference between my output (h(x)) and the original output (y) is minimum.

For example consider the above dataset, assume we have 100 rows. For the first Instance (row or record) we need to find a parameters (theta) which when multiplied to the row-column value should give a hypothesis as close as the original output.

A Straight line equation says:

Y=mx + C

To make it more clear let’s take the first record of the above data :

x1 Y

2104 399900

Looking at the above data (row vector) the straight line equation says to find the value of "m" - which is the slope that when multiplied by "x" gives value equal to "Y", the term "C" is the intercept term or Y intercept or the point that the straight line touches in the Y-axis. For a straight line passing through center C is 0.

Using the intuition of the straight line we add an intercept column to our original data-set with all rows with the value of 1.

x0 x1 Y

1 2104 399900

To have a deep and clear understanding of the importance of the intercept term you will have to run the complete algorithm for both without the intercept term and with the intercept term and then compare the graphs. You will see tremendous difference.

Our approach should be to find a theta_0 for x0 and theta_1 for x1 so that,

theta_0 *x0 + theta_1 *x1 =h(x) should be as close as to Y

Cost Function:

Cost function is the equation that helps us to figure out the straight line to the data. In fig_1 the black line is the result of cost function.

“The Cost function states that for the parameter value theta, minimize the difference between the hypothesis and the original output to the least”

You must be wondering why did we use the above formula? what is it? what is the intuition behind it? Lets nail it by doing a small task.

Question: Suppose you have an equation as shown:

(a-1)² + (a-2)²+ (a-3)²

and you are asked to compute the value of (a) for which the equation is minimum. Without a single thought you can say, hey its 2!. But did you stress a bit why it is 2. There lies a pattern.

2 is the average of 1,2,3. Therefore, you can always compute the average to get to the minimum. To find the minimum of the above equation we write:

(a-1)² + (a-2)²+ (a-3)²

Now, stare at the above summation equation, you will notice that even that's doing the average. Therefore the intuition for minimum squared error lies in averaging.

h_θ(x) in the above equation in the hypothesis or the estimation.

h_θ(x)= θ₀ x0 + θ₁x1 + θ₂x2 +........ +θ_nxn

x0 is the intercept column with all the values as 1

x1 is our first column (sq feet)

x2 is the second column (number of rooms)

θ₀,θ₁,θ₂are the computed parameters.

Understanding the Cost function even better:

(h_θ(x) - y)²) Squared error—This equation minimize the difference between our hypothesis and the original output for the given parameter theta.

To understand the cost function better, lets do a small exercise.

Let us take the first column x1 (sq feet) and the original output y (price) and determine the cost function for range of theta value and see for which value of theta the cost function j_theta is minimum

Range of theta: (1.0300e+005 to 1.1099e+005),

Compute the cost function j_theta for each theta value in the above range. I get the below curve (convex), each point in the curve is the output of the cost function for one theta value.

Fig_2

The Red point indicates the theta value for which the j_theta is minimum.

We see that theta at min is : theta_1=1.0691e+005
and for simplicity lets assume : theta_0=3.4041e+005

Finally, we substitute the value of theta_0 and theta_1 to the hypothesis equation i.e. (h_θ(x)= θ₀ x0 + θ₁x1 + θ₂x2 +........ +θ_nxn) and plot the straight line :

Fig_3

Isn't this something similar to what we drew at the start, so it confirms that the theta values are indeed correct.

Note: If you are wondering why the range of x-axis in (-2,4), don't stress on it. It is just another property known as Feature Scaling which we shall see later.

Calculating Theta:

Theta value can be calculated by many methods like Gradient Descent, Newton's method, Stochastic Gradient descent and etc. Let us, for now, concentrate on calculation based on Gradient descent.

The formula says:

Intuition:

Consider the below graph:

Fig_4

Suppose you start from the green point at the top, to reach to the red point you will need to take tiny steps down. The tiny steps are nothing but slopes and we know that derivative at a point of function is the slope to the graph of that function.

Hence we take the derivative of our Cost Function

After calculating the derivative we get

Which is the RHS of our Gradient Descent formula.

The rate at with the slope converges is denoted by alpha. The value of alpha shouldn't be too less or too big.

A marginal range of alpha would be (0.03,0.06,0.09............1)

Note: The number of theta depends on the number of columns plus the theta_0. And all the theta value should be calculated simultaneously.

The Complete Algorithm:

1. Load the data set.
2. Perform feature scaling to the data set.

for example

x0 x1 x2 Y

1 2104 3 399900

1 2105 4 400000

1 2200 3 410000

The difference between x1 and x2 is very much which could be a problem to the algorithm so we feature scale the column x1 and x2 with the formula.

When

is the mean and

is the standard deviation.

The above data after feature scaling you will get as

x0 x1 x2 y

   1.00000     -0.58639     -0.57735       399900
   1.00000     -0.56826      1.15470       400000
   1.00000      1.15465     -0.57735       410000

3. Perform the algorithm for the feature scaled data for like 50 to 100 iteration. One iteration is the tiny step that the brings us closer to the minimum theta value. In Fig_4 the small violet line between two green point is just one iteration.
Note: Depending on the data you might wanna choose you number of iteration, it can go as much as 10000 or even beyond.

4. For each iteration perform the cost_function (j_theta) vs number of iteration graph. J_theta value will decrease after each iteration and finally it will decrease to its minimum and show a constant line (beyond that point the graph wont decrease further).
Following graph shows an example:

Note: If you don't get similar graph but get something like increasing-decreasing graph then try varying the alpha value. Probably decrease it.

5. After you got your graph find the theta value, multiply it to your feature scaled data set

h_θ(x)= θ₀ x0 + θ₁x1 + θ₂x2 +........ +θ_nxn

and plot your straight line.

6. For a new prediction on incoming data set, don't directly just multiply the data set to the obtained theta value. First feature scale the data set and then multiply it to the obtained theta value, then add all to get you predicted outcome.

For those who are familiar with OCTAVE below script can help you understand it properly:

I have used as much as loop avoiding Vectorization because people unfamiliar with vectorization might find it difficult to understand:

Enjoy :)

clear all
clc

1. Load the file
x=load('feature_dataset_xaxis.ext');
y=load('original_output_yaxis.ext');

2. Take the size m is the total instances and n is the attributes (features)
[m,n]=size(x);

3. Add the intercept column
x=[ones(m,1),x];

4. Perform feature scalling
mn = mean(x);
sd = std(x);
for mnsd =2:n+1
    x(:,mnsd) = (x(:,mnsd) - mn(mnsd))./ sd(mnsd);
end

5. Chose alpha and number of iteration
max_iter=200;
alpha=1;
theta = zeros(size(x(1,:)))'; % the theta has to be a 2*1 matrix so that it can multiply by x that is m*2 matrix
theta_u = zeros(size(x(1,:)))';
j_theta=zeros(max_iter,1);      % j is a zero matrix that is used to store the theta cost function j(theta)

6. Algorithm
   for num_iter=1:max_iter
       j_cost_each=0;
       theta=theta_u;          % theta_u for simultaneous update

% Calculating The cost function
      for i=1:m
          h=0;
         for j=1:n+1
               h=h+(theta(j)*x(i,j));
             end
           j_cost_each=j_cost_each+ ((h-y(i)) * (h-y(i)));
         end
         j_theta(num_iter)=(1/(2*m)) * j_cost_each;


      % Calculating Gradient Descent
       for j = 1:n+1    % n+1 because n is the size before adding the bias column
             grad(j) = 0;
             for i=1:m
                 grad(j) += ((x(i,:)*theta)-y(i))*x(i,j);    % Vectorization
             end                                         % x(i,j) is multiply the hypothesis of that row
             grad(j)=grad(j)/m;
         theta_u(j)=theta(j)- alpha * grad(j);
         end

   % There may be a small question why did we take two loops that too first being the theta and second being no of rows
       % The reason is that when you do matrix multiplication or vectorization implementation of linear regression
       % then for
       % grad for Theta 1= each row hypothesis has to be multiplied by the x0 column and added
       % grad for Theta 2= each row hypothesis has to be multiplied by the x1 column and added
     % finally the theta values are calculated for each different grad values
    end

figure
plot(0:199, j_theta(1:200), 'b', 'LineWidth', 2)
xlabel('Number of iterations')
ylabel('Cost J')

Machine Learning and Big Data- The ultimate blend

Thursday, 20 November 2014

Linear Regression- Gradient Descent- Cost Function- Octave

1 comment:

Blog Archive