Data science, Machine Learning, or any Mathematical Optimization related technical interviews encounter the most common question on one of the properties of the method of Steepest Descent. It is also called Gradient Descent method. The successive directions of the steepest descent are normal to one another. They ask for proof or some example to explain this property. In this article, I start with the proof and give a simple example to describe it –

Let x_0 is the starting point of the iterative Steepest descent method to solve f(x) whose
extremum is x^*.
We update the iterative point as follows:
Next successive point x_1 = x_0 - t \cdot \nabla{f(x_0)}.
The new point is a function of t, the step size.

If x_1 = x_0 - t \cdot \nabla f(x_0)} is the function of t that decides the new point, the useful value of t can be calculated by setting \nabla{f(x_1)} = 0.
Note, derivative is w.r.t to t,

    \[\nabla{f(x_1)} = 0.\]

    \[\mbox{or~ } \nabla{f(x_1)} = \frac{{\delta{f(x_1)}}}{\delta{x_1}} \frac{\delta{x_1}}{\delta{t}} = 0.\]

    \[\mbox{or~ } \nabla{f(x1)} = \frac{{\delta{f(x1)}}}{\delta{x1}} \frac{{\delta{(x0 - t \cdot \nabla{f(x0)})}}}{\delta{t}} = 0, \mbox{~here $x_1 = x_0 - t \cdot \nabla{f(x_0)}$}.\]

If we further simplify

    \[\frac{{\delta{f(x_1)}}}{\delta{x_1}}. ( 0 - 1 \cdot \nabla{f(x_0)}) = 0\]

Now we have,

    \[\frac{{\delta{f(x_1)}}}{\delta{x_1}} \cdot \nabla{f(x_0)} = 0.\]


    \[\mbox{Or~~} \frac{{\delta{f(x_1)}}}{\delta{x_1}} \cdot \frac{{\delta{f(x_0)}}}{\delta{x_0}} = 0.\]

From above it looks clear that the product of successive gradients is = 0. Hence from the basic definition of gradient, successive gradients are orthogonal to each other.
Note: \frac{{\delta{f(x_{i})}}}{\delta{x_{i}}} means derivative of f at point x = x_i.

Steepest Descent
Steepest Descent, Source: https://trond.hjorteland.com/

Understand with one simple example

From example, f(x) = x_{1}^2 + 3 x_{2}^2 is a function of two variables.
Starting point x_0 = (6,3) is given.
Now, gradient of f(x) at x = x_0 is

(1)   \begin{equation*}\nabla {f(x_0)} = 2 x_1 (\hat{i}) + 6 x_2 (\hat{j}).\end{equation*}


According to Steepest Descent rule, new update point x_1 (t) = x_0 + t time the negative of the gradient of f(x) at x = x_0.
i.e., x_1 =(x_1 (\hat{i}) + x_2 (\hat{j}) ) - t ( 2 x_1 (\hat{i}) + 6 x_2 (\hat{j}))
So, x_1 = (1- 2 t) x_1 (\hat{i}) + (1 - 6 t) x_2 (\hat{j}))
Note, new point x_1 is a function of t, we call it the step size.
Function value at new point will also be a function of t,
f(x_1) = (1- 2 t)^2 x_1^2 + 3 ((1 - 6 t))^2 x_{2}^2.
Now, gradient of f(x) at x = x_1~ w.r.t.~ t is

(2)   \begin{equation*}\nabla {f(x_1)} = \nabla g(t) = 2 (1- 2 t) (-2) x_1^2 + 6 ((1 - 6 t)) (-6) x_{2}^2.\end{equation*}


We set \nabla g(t) = 0, it gives us t = 0.2.
Now new point we have is
x_1 = (1- 2 t) x_1 (\hat{i}) + (1 - 6 t) x_2 (\hat{j})) = (3.484,- 0.744).

From 1 and 2 we have,
x_0 = (6,3) and x_1 = (3.484,- 0.744)
The dot product of two vectors:

    \[2.6 (\hat{i}) + 6.3 (\hat{j}). 3.484 (\hat{i}) + -0.774 (\hat{j}) =0.\]

Also read Column Generation Method for Cutting Stock Problem