The First Equation of Machine Learning: Linear Regression in Vanilla Rust

A deep dive into the fundamental equation of Machine Learning, exploring the math behind Linear Regression and implementing it in Rust from scratch.


As I dive into the world of Artificial Intelligence, moving away from my daily work as an SAP Full Stack Developer, I realized I needed to start at the very beginning. Today is quite literally my first day learning Machine Learning. Before getting lost in the hype of LLMs or complex neural networks, I wanted to strip away the magic and understand the absolute foundation.

For me, that meant looking at the very first equation of Machine Learning: Linear Regression.

The Mathematical Foundation

At its core, linear regression attempts to model the relationship between a scalar response and one or more explanatory variables by fitting a linear equation to observed data.

When I looked at Simple Linear Regression (a single variable), the hypothesis function hθ(x)h_\theta(x) is defined simply as:

y=wx+by = wx + b

Where:

  • yy is our predicted output.
  • xx is our input feature.
  • ww is the weight (the slope).
  • bb is the bias (the intercept).

To find the optimal values for ww and bb, I needed a way to measure how far off my predictions were from the actual true values in my dataset. In machine learning, we do this using a Cost Function. The most standard one for this use case is the Mean Squared Error (MSE):

J(w,b)=12Ni=1N(y(i)(wx(i)+b))2J(w, b) = \frac{1}{2N} \sum_{i=1}^{N} (y^{(i)} - (wx^{(i)} + b))^2

Once we have the cost, we use Gradient Descent to iteratively update our parameters and minimize that cost. We calculate the partial derivatives (the gradients) with respect to ww and bb:

w=wαJw=wα1Ni=1N(wx(i)+by(i))x(i)w = w - \alpha \frac{\partial J}{\partial w} = w - \alpha \frac{1}{N} \sum_{i=1}^{N} (wx^{(i)} + b - y^{(i)})x^{(i)} b=bαJb=bα1Ni=1N(wx(i)+by(i))b = b - \alpha \frac{\partial J}{\partial b} = b - \alpha \frac{1}{N} \sum_{i=1}^{N} (wx^{(i)} + b - y^{(i)})

Here, α\alpha is our learning rate, dictating how big of a step we take in the direction of the steepest descent.

Implementation in Vanilla Rust

I could have easily used Python with PyTorch or Scikit-Learn to do this in three lines of code. But to truly grasp the mechanics, I decided to write it from scratch using a systems language. Rust forces me to be explicit about memory and types, which I find incredibly helpful for building solid mental models.

Here is my pure Rust implementation, using nothing but standard vectors:

fn main() {
    let xs: Vec<f64> = vec![1.0, 2.0, 3.0, 4.0, 5.0];
    let ys: Vec<f64> = vec![2.0, 4.0, 5.5, 8.0, 10.5];
    let n = xs.len() as f64;

    let mut w: f64 = 0.0;
    let mut b: f64 = 0.0;
    
    let learning_rate: f64 = 0.01;
    let epochs: usize = 1000;

    for epoch in 0..epochs {
        let mut dw = 0.0;
        let mut db = 0.0;
        let mut cost = 0.0;

        for i in 0..xs.len() {
            let prediction = w * xs[i] + b;
            let error = prediction - ys[i];
            
            dw += error * xs[i];
            db += error;
            cost += error * error;
        }

        dw /= n;
        db /= n;
        cost /= 2.0 * n;

        w -= learning_rate * dw;
        b -= learning_rate * db;

        if epoch % 200 == 0 {
            println!("Epoch {:>4} | Cost: {:.4} | w: {:.4}, b: {:.4}", epoch, cost, w, b);
        }
    }

    println!("Final Equation: y = {:.4}x + {:.4}", w, b);
    
    let test_x = 6.0;
    let pred_y = w * test_x + b;
    println!("Prediction for x={}: {:.4}", test_x, pred_y);
}

By building it from scratch in Rust, I forced myself to understand the raw math before hiding it behind a library. It’s a small step, but it demystifies the learning process. It builds the intuition I know I’ll need as I continue this journey into more complex models. Not bad for day one.