# Import packages
%load_ext autoreload
%autoreload 2
from logistic import LogisticRegression, GradientDescentOptimizer, NewtonOptimizer
import torch
from matplotlib import pyplot as plt
'seaborn-v0_8-whitegrid') plt.style.use(
Abstract
In this blog post, we expand upon our previous implementation of logistic regression to add another optimization option. In our updated implementation of logistic regression, available at logistic.py, we can now optimize our logistic regression model with both gradient descent (with momentum) and Newton’s method. After implementing this extension, we demonstrate with an example that logistic regression with both optimizers converges to the same solution. Using the same example, we reveal that logistic regression with Newton’s method commonly converges to a solution in fewer training iterations than gradient descent. We also illustrate that selecting an appropriate value for \(\alpha\) is critical to Newton’s method, as the optimizer will not converge to a solution if \(\alpha\) is too large. Finally, we count the number of operations required in each iteration of model training for both gradient descent and Newton’s method, revealing that Newton’s method is more computationally expensive than gradient descent, especially when working with many features.
Generating Training Data
First, we import the packages we will need for this assignment.
Next, we generate the data that we will use to train our model. Thank you to Professor Chodrow for providing the functions for generating and visualizing training data.
# define function to generate data
def classification_data(n_points = 300, noise = 0.2, p_dims = 2):
= torch.arange(n_points) >= int(n_points/2)
y = 1.0*y
y = y[:, None] + torch.normal(0.0, noise, size = (n_points,p_dims))
X = torch.cat((X, torch.ones((X.shape[0], 1))), 1)
X
return X, y
# define function to plot data
def plot_data(X, y, ax):
assert X.shape[1] == 3, "This function only works for data created with p_dims == 2"
= [0, 1]
targets = ["o" , ","]
markers for i in range(2):
= y == targets[i]
ix 0], X[ix,1], s = 20, c = y[ix], facecolors = "none", edgecolors = "darkgrey", cmap = "BrBG", vmin = -0.5, vmax = 1.5, alpha = 0.5, marker = markers[i])
ax.scatter(X[ix,set(xlabel = r"$x_1$", ylabel = r"$x_2$")
ax.
# set seed
1234)
torch.manual_seed(
# generate data
= classification_data(n_points = 500, noise = 0.6)
X, y
# plot data
= plt.subplots(figsize = (5, 5))
fig, ax
plot_data(X, y, ax)"Training Data"); ax.set_title(
Convergence to an Appropriate Decision Boundary
To begin evaluating the efficacy of our model, we fit logistic regression on our training data with regular gradient descent and with Newton’s method, illustrating that both options converge to the same solution.
# set seed
1234)
torch.manual_seed(
# initialize logistic regression model and gradient descent optimizer
= LogisticRegression()
LR_gradient = GradientDescentOptimizer(LR_gradient)
opt_gradient
# initialize vector to record loss values
= [10, 5]
loss_vec_gradient
# fit model
while loss_vec_gradient[-2] - loss_vec_gradient[-1] > 0.00001:
# update model
= 0.07, beta = 0)
opt_gradient.step(X, y, alpha
# calculate and record loss
= LR_gradient.loss(X, y)
loss
loss_vec_gradient.append(loss)
# remove initialization values
= loss_vec_gradient[2:]
loss_vec_gradient
# reset seed
1234)
torch.manual_seed(
# initialize logistic regression model and newton's optimizer
= LogisticRegression()
LR_newton = NewtonOptimizer(LR_newton)
opt_newton
# initialize vector to record loss values
= [10, 5]
loss_vec_newton
# fit model
while loss_vec_newton[-2] - loss_vec_newton[-1] > 0.00001:
# update model
= 10)
opt_newton.step(X, y, alpha
# calculate and record loss
= LR_newton.loss(X, y)
loss
loss_vec_newton.append(loss)
# remove initialization values
= loss_vec_newton[2:] loss_vec_newton
Now that we have fit our models, let’s inspect the decision boundaries in the context of our training data.
# define function to draw line
def draw_line(w, x_min, x_max, ax, **kwargs):
= w.flatten()
w_ = torch.linspace(x_min, x_max, 101)
x = -(w_[0]*x + w_[2])/w_[1]
y = ax.plot(x, y, **kwargs)
l
# plot decision boundary
= plt.subplots(figsize = (5, 5))
fig, ax
plot_data(X, y, ax)= -1.25, x_max = 2.25, ax = ax, color = "black", label = "Gradient Descent")
draw_line(LR_gradient.w, x_min = -1.25, x_max = 2.25, ax = ax, color = "red", label = "Newton's Optimizer")
draw_line(LR_newton.w, x_min = "lower left")
ax.legend(loc "Training Data and Decision Boundaries"); ax.set_title(
The decision boundaries for the gradient descent optimizer and Newton’s optimizer are virtually identical! This result makes me confident that we have correctly implemented logistic regression with Newton’s optimizer.
The Speed of Newton’s Optimizer
In the previous section we saw that logistic regression with gradient descent and Newton’s method converge to the same solution. Naturally, one might wonder whether both options approach the solution at the same rate. As it turns out, when an appropriate value of \(\alpha\) is selected, logistic regression with Newton’s optimizer converges to a solution in far fewer iterations than logistic regression with classic gradient descent. To illustrate this statement, we plot the change in loss over the iterations of both models’ training.
# plot the changes in loss
= "slategrey", lw = 3, label = "Gradient Descent")
plt.plot(loss_vec_gradient, color = "#A37933", lw = 3, label = "Newton's Optimizer")
plt.plot(loss_vec_newton, color = plt.gca().set(xlabel = "Iteration", ylabel = "Loss")
labs = "upper right")
plt.legend(loc "Loss Function Throughout Model Training"); plt.title(
As we would expect, the two optimizers begin with identical loss values. However, from the very beginning, the loss values of logistic regression with Newton’s optimizer decrease more rapidly than logistic regression with gradient descent. In fact, Newton’s approach achieves an incremental improvement to the loss function of less than our threshold of \(0.00001\) in just over \(200\) iterations, whereas the gradient descent approach requires over \(1,100\) iterations. For this data and this choice of \(\alpha\), the Newton’s optimizer approach requires dramatically fewer iterations than regular gradient descent.
When \(\alpha\) is Too Large
The learning rate \(\alpha\) impacts how far logistic regression with Newton’s method moves the decision boundary in any given iteration. When \(\alpha\) is very small, each change to the decision boundary becomes similarly small, such that it takes many iterations to converge to an adequate solution. In the example above, \(\alpha\) was large enough that logistic regression with Newton’s optimizer converged to a solution more rapidly than ordinary gradient descent. But what happens when \(\alpha\) is substantially larger? To answer this question, we experiment with fitting a model where \(\alpha = 10,000\) in the code chunk below.
# reset seed
1234)
torch.manual_seed(
# initialize logistic regression model and newton's optimizer
= LogisticRegression()
LR_newton_10000 = NewtonOptimizer(LR_newton_10000)
opt_newton_10000
# initialize vector to record loss values
= [10, 5]
loss_vec_newton_10000
# initialize counter
= 0
counter
# fit model
while loss_vec_newton_10000[-2] - loss_vec_newton_10000[-1] > 0.00001:
# update model
= 10000)
opt_newton_10000.step(X, y, alpha
# calculate and record loss
= LR_newton_10000.loss(X, y)
loss
loss_vec_newton_10000.append(loss)
# update counter
+= 1
counter
# remove initialization values
= loss_vec_newton_10000[2:]
loss_vec_newton_10000
# print results
print("Total iterations:", counter, "\nResulting loss vector", loss_vec_newton_10000)
Total iterations: 1
Resulting loss vector [tensor(nan)]
Apparently, by increasing \(\alpha\) to \(10,000\), we prevented the logistic regression model with Newton’s optimizer from converging to a solution. In fact, with \(\alpha\) set so high, the model produces a loss value of NaN
, terminating our training loop in a single iteration. This likely occured because the large value of \(\alpha\) caused the decision boundary to move so far that the loss function’s value exceeded the maximum value that torch can store.
Operation Counting
Let \(C_{gd}\) represent the total computational cost of fitting a logistic regression model with gradient descent and \(C_{nm}\) represent the total computational cost of fitting a logistic regression model with Newton’s method.
Suppose that gradient descent converges to an adequate solution in \(t_{gd}\) iterations and Newton’s method converges to an adequate solution in \(t_{nm}\) iterations. Furthermore, per Professor Chodrow’s instructions, we assume that the following operations are associated with the following computational costs.
- Computing the loss: \(c\) computational units
- Computing the gradient: \(2c\) units
- Computing the Hessian: \(pc\) units
- Inverting a \(p \times p\) matrix: \(k_1p^\gamma\) units
- Performing the matrix-vector multiplication required by Newton’s method: \(k_2p^2\)
In each iteration of gradient descent, we calculate the loss and the gradient once, leading to the following total computational cost.
\(C_{gd} = t_{gd} (c + 2c) = t_{gd}(3c)\)
In each iteration of Newton’s method, we calculate the loss, Hessian, inverse Hessian, matrix multiplication, and gradient once, leading to the following total computational cost.
\(C_{nm} = t_{nm}(c + pc + k_1p^\gamma + k_2p^2 + 2c) = t_{nm}(3c + pc + k_1p^\gamma + k_2p^2)\)
If Newton’s method requires fewer computational units to complete than gradient descent, we have the following inequalities.
\[C_{nm} < C_{gd}\] \[t_{nm}(3c + pc + k_1p^\gamma + k_2p^2) < t_{gd}(3c)\] \[\frac{(3c + pc + k_1p^\gamma + k_2p^2)}{3c} < \frac{t_{gd}}{t_{nm}}\]
Thus, Newton’s method will only require less computational effort than gradient descent if the total number of iterations required to complete gradient descent \(t_{gd}\) is at least \(\frac{(3c + pc + k_1p^\gamma + k_2p^2)}{3c}\) times as much as the number of iterations required to complete Newton’s method.
Since \(p\) is raised to the power of \(2\) and to the power of \(\gamma\) in this formula, the ratio between \(t_{gd}\) and \(t_{nm}\) required for Newton’s method to be more efficient than gradient descent increases dramatically as \(p\) increases. Practically speaking, when working with datasets of many variables, this is unlikely to happen. For this reason, if working on a dataset with many variables, gradient descent will almost certainly be more efficient than Newton’s method.
Conclusion
Completing this assignment gave me the opportunity to understand the strengths and weaknesses of two different optimization methods for logistic regression. By implementing and performing experiments with both gradient descent and Newton’s method, I learned that Newton’s method may converge to an adequate solution in fewer training iterations than gradient descent. By counting the number of operations required for both optimizers, I discovered that fewer iterations does not necessarily mean lower computational complexity. In fact, when working with a large number of features, gradient descent is almost certainly less computationally expensive than Newton’s method. In summary, if given the choice between gradient descent and Newton’s method, choose gradient descent!