Let’s compare logistic regression vs AutoGluon, an easy-to-implement AutoML library!
We’ll use spambase data from UCI ML Repository which has $4601$ examples, $57$ features and $0/1$ (no-spam/spam) labels to find out which algorithm is better at detecting spam.
Snapshot of data
# | $X_1$ | $X_2$ | $X_2$ | $\cdots$ | $X_{56}$ | $X_{57}$ | $Y$ |
---|---|---|---|---|---|---|---|
0 | $.00$ | $.64$ | $.64$ | $\cdots$ | $61$ | $278$ | $1$ |
1 | $.21$ | $.28$ | $.50$ | $\cdots$ | $101$ | $1028$ | $1$ |
$\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\ddots$ | $\vdots$ | $\vdots$ | $\vdots$ |
4600 | $.00$ | $.00$ | $.65$ | $\cdots$ | $5$ | $40$ | $0$ |
To keep it simple, let’s ignore the data aspects (class imbalance, normalization, etc).
Logistic Regression
We’ll implement a logistic regression with regularization from scratch (using numpy only).
Import Spambase Data
First import the data, change encoding $0/1$ to $-1/1$ for convenience and do train-test split.
import numpy as np
np.random.seed(1)
data = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/"
+ "spambase/spambase.data", delimiter=',')
data = np.random.permutation(data)
data[:, -1] = (lambda x: x*2 - 1)(data[:, -1]) # switch to -1/1
train_y, test_y = data[: 3000, -1:], data[3000:, -1:] # (3000, 1), (1601, 1)
train_x, test_x = data[: 3000, :-1], data[3000:, :-1] # (3000, 57), (1601, 57)
Before we proceed, let’s do a very basic feature engineering step: add a column of $1$’s (intercept term) and an extra “indicator” feature that is $1$ if the corresponding $x$ is positive and $0$ otherwise, so that $[3,0,1]$ will turn to $[1, 3, 0, 1, 1, 0, 1]$.
def phi(x):
''' adds a column of ones
and all second-order combinations of features
'''
m = x.shape[0]
x1 = np.hstack((np.ones((m, 1)), x))
x1 = np.hstack((x1, (x > 0).astype(float)))
return x1
Loss & Optimizer
It helps to write down the loss, the gradient of the loss, and the Hessian of the loss for logistic regression. The $\ell_2$-penalized loss is ($m$ is sample size, $n$ is dimensionality)
\[\begin{align*} L &= \left[\sum_{i=1}^m \ln(1+e^{-y_i\mathrm{x_i}' \mathrm{w}})\right] + \frac{\lambda}{2}\sum_{j=2}^n w_j^2 \end{align*}.\]Note that the first weight is not penalized since it corresponds to the intercept. The gradient and Hessian are
\[\begin{align*} \nabla_\mathrm{w} L &= - \left[\sum_{i=1}^m (1-s_i)y_i \mathrm{x}_i'\right] + \lambda\mathrm{w}^* \\ \nabla_\mathrm{w}^2 L &= \left[\sum_{i=1}^m s_i(1-s_i)\mathrm{x}_i\mathrm{x}_i'\right] + \lambda I^*_n \end{align*}\]where
\(s_i = (1+e^{-y_i\mathrm{x}'_i \mathrm{w}})^{-1},\qquad \mathrm{w}^* = \begin{bmatrix} 0 & w_2 & \cdots & w_n \end{bmatrix}', \qquad I^*_n = \begin{bmatrix} 0 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix}\).
def lrloss(w, x, y, lmbda):
''' logistic loss '''
wstar = w.copy()
wstar[0] = 0
return (np.sum(np.log(1+np.exp(-y*(x@w))), 0).T
+ .5*lmbda*np.sum(wstar*wstar))[0]
def lrgrad(w, x, y, lmbda):
''' logistic gradient '''
s = 1/(1+np.exp(-y*(x@w)))
wstar = w.copy()
wstar[0] = 0
return -np.sum((1-s)*x*y, 0)[:, np.newaxis] + lmbda*wstar
def lrhess(w, x, y, lmbda):
''' logistic hessian '''
s = 1/(1+np.exp(-y*(x@w)))
istar = np.eye(w.shape[0])
istar[0, 0] = 0
return x.T@np.diag((s*(1-s))[:, 0])@x + lmbda*istar
Optimization
To estimate the weights, let’s implement a Newton-Raphson optimizer,
\[\begin{align*} \mathrm{w}_{t+1} = \mathrm{w}_t - H^{-1}_{\mathrm{w}_t} g_{\mathrm{w}_t} \end{align*},\]where $H$ is the Hessian matrix and $g$ is the gradient vector.
This method is relatively straightforward, except when it doesn’t work.
We’ll check if the loss decreased after each iteration to alternate b/w gradient and Newton steps. Each time a second-order step does not decrease the loss, double the step size and try taking a gradient step. If it also fails, keep halving the step size until gradient descent succeeds (leads to smaller loss). As soon as it succeeds, return to second-order steps. Once the step sizes become too small, we are done.
(not a great implementation but demonstrates the idea)
def newton(w, fn, gradfn, hessfn):
''' newton-raphson optimizer'''
oldf = fn(w)
eta = 1
while True:
g = gradfn(w)
h = hessfn(w)
neww = w - np.linalg.solve(h, g) # newton's step
newf = fn(neww)
if newf >= oldf: # failed to decrease the loss :(
eta *= 2 # double the step size
while eta > 1e-10:
neww = w - eta*g # gradient step
newf = fn(neww)
if newf < oldf: # exit if loss decreased!
break
eta *= .5 # o/w keep halving
if eta < 1e-10: # finish if step size too small
return w
oldf = newf
w = neww
Train time!
We should be able to train the model now!
def trainlr(x, y, lmbda):
''' training to produce w estimate '''
w0 = np.zeros((x.shape[1], 1)) # initiating at zero works well for LR
return newton(w0, lambda w: lrloss(w, x, y, lmbda),
lambda w: lrgrad(w, x, y, lmbda),
lambda w: lrhess(w, x, y, lmbda))
def lrerrorrate(x, y, w):
''' share of incorrectly classified '''
return np.sum(y*x@w < 0)/y.shape[0]
lmb = .1
mytrain_x, mytest_x = phi(train_x), phi(test_x)
myw = trainlr(mytrain_x, train_y, lmb)
print(lrerrorrate(mytest_x, test_y, myw))
Which gives a training accuracy of about $5.2\%$ !
Cross-Validation
Let’s implement 3-fold cross-validation and plot the cv error as a function of lambda.
def xvalideval(x, y, lmbda):
''' validation error '''
nfold = 3
px = phi(x)
m = x.shape[0]
splits = np.linspace(0, m, nfold+1).astype(int)
v = 0
for low, high in zip(splits[0:-1], splits[1:]):
# validation bucket, from low to high
validx = px[low:high, :]
validy = y[low:high, :]
# training bucket, everything else
trainx = np.vstack((px[:low, :], px[high:, :]))
trainy = np.vstack((y[:low, :], y[high:, :]))
v += lrerrorrate(validx, validy, trainlr(trainx, trainy, lmbda))
return v
ls = np.logspace(-4, 2, 10)
xverr = ls.copy()
bestv, bestl = None, None
for i, l in enumerate(ls):
xverr[i] = xvalideval(train_x, train_y, l)
if bestv is None or bestv > xverr[i]:
bestv = xverr[i]
bestl = l
import matplotlib.pyplot as plt
plt.semilogx(ls, xverr)
plt.ylabel('CV error', size=16)
plt.xlabel('$\log \lambda$', size=16)
plt.show()
Hence we have $\lambda \approx 1$ achieving the lowest validation error.
testerr = lrerrorrate(phi(test_x), test_y, trainlr(phi(train_x), train_y, bestl))
print(f"lambda = {bestl}, test error = {testerr}")
Producing the test error of $\boxed{5.1\%}$, not bad!
AutoGluon
For comparison, let’s see the performance of an AutoML library which will automatically process the data, train and combine a range of models.
from autogluon.tabular import TabularPredictor, TabularDataset
train_data, test_data = TabularDataset(data[:3000]), TabularDataset(data[3000:])
predictor = TabularPredictor(label=57).fit(train_data)
y_pred = predictor.predict(test_data.iloc[:, :57])
testerr = np.sum(test_y.transpose()*y_pred.to_numpy() < 0)/test_y.shape[0]
print(f"test error = {testerr}")
Yielding test error of $\boxed{3.6\%}$ , beating LR by $1.5\%$ !
Conclusion