Moving to Matrices: Multiple Linear Regression in Vanilla Rust

As I dive into the world of Artificial Intelligence, moving away from my daily work as an SAP Full Stack Developer, I realized I needed to start at the very beginning. Yesterday, I looked at Simple Linear Regression. Today, it’s time to level up to Multiple Linear Regression. Before getting lost in the hype of LLMs or complex neural networks, I want to continue stripping away the magic to understand the absolute foundation.

When dealing with real-world data, you rarely have just one variable. For example, if you want to predict the price of an apartment, you don’t just look at the surface area; you also look at the number of rooms.

The Mathematical Foundation

At its core, linear regression assumes that the output (the price) is a linear combination of the input features. Instead of a single slope, we now have multiple weights—one for each feature.

Our new hypothesis function for an apartment’s price based on its surface and rooms represents how many additional thousands of euros per square meter and per room, plus a bias (the base price when all features are 0).

Moving to Matrices

Handling equations one by one in code is slow and hard to read. To predict all examples simultaneously, we use matrices. Here, we use an input matrix of dimension n x p (where n is the number of examples and p is the number of features) and a weight vector.

The Importance of Feature Scaling

When moving to multiple variables, we encounter a new problem: different scales. In our dataset, the surface area might range from 40 to 120, while the number of rooms only ranges from 1 to 3. Without normalization, the algorithm’s gradient will be completely dominated by the surface area, making the learning process extremely slow or unstable.

To fix this, we use Min-Max Normalization to force every feature into a [0, 1] range.

Cost Function and Gradients

To measure the error of our model, we still use the Mean Squared Error (MSE). It has a convex “bowl” shape, meaning it has a single global minimum that our algorithm can slide down to.

To find the bottom of that bowl, we calculate the gradients using the chain rule. In vectorized matrix form, the gradient requires the transposed input matrix to ensure the dimensions are compatible.

Implementation in Vanilla Rust

I could easily use Python with PyTorch to do this in three lines of code. But to truly grasp the mechanics, I am continuing to write this from scratch using Rust. Since this is vanilla Rust without an external math crate like ndarray, I’m using nested vectors to manually perform the matrix operations based on the universal pseudo-code.

Here is the implementation of a Batch Gradient Descent algorithm using the apartment dataset:

fn main() {
    // Dataset
    // X = [Surface (m2), Rooms]
    let raw_x: Vec<Vec<f64>> = vec![
        vec![40.0, 1.0],
        vec![70.0, 2.0],
        vec![100.0, 3.0],
        vec![120.0, 3.0],
    ];
    // y = Price (k)
    let y: Vec<f64> = vec![160.0, 285.0, 410.0, 480.0];
    
    let n = y.len() as f64;
    let p = raw_x[0].len();

    // 1. Min-Max Normalization
    let mut min_x = vec![f64::MAX; p];
    let mut max_x = vec![f64::MIN; p];
    
    for row in &raw_x {
        for j in 0..p {
            if row[j] < min_x[j] { min_x[j] = row[j]; }
            if row[j] > max_x[j] { max_x[j] = row[j]; }
        }
    }

    let mut x_scaled: Vec<Vec<f64>> = vec![vec![0.0; p]; raw_x.len()];
    for i in 0..raw_x.len() {
        for j in 0..p {
            x_scaled[i][j] = (raw_x[i][j] - min_x[j]) / (max_x[j] - min_x[j]);
        }
    }

    // 2. Initialization
    let mut w: Vec<f64> = vec![0.0; p];
    let mut b: f64 = 0.0;
    
    let learning_rate: f64 = 0.1;
    let epochs: usize = 10000;

    // 3. Training Loop
    for epoch in 0..epochs {
        let mut dw = vec![0.0; p];
        let mut db = 0.0;
        let mut cost = 0.0;

        for i in 0..raw_x.len() {
            // Forward pass: y_hat = X * w + b
            let mut prediction = b;
            for j in 0..p {
                prediction += w[j] * x_scaled[i][j];
            }
            
            // Residuals
            let error = prediction - y[i]; 
            
            // Gradients
            for j in 0..p {
                dw[j] += error * x_scaled[i][j];
            }
            db += error;
            cost += error * error;
        }

        // Averages and parameter updates
        for j in 0..p {
            dw[j] /= n;
            w[j] -= learning_rate * dw[j];
        }
        db /= n;
        b -= learning_rate * db;
        
        cost /= 2.0 * n;

        // Print progress
        if epoch % 2000 == 0 || epoch == epochs - 1 {
            println!("Epoch {:>5} | Cost: {:.4} | w1: {:.4}, w2: {:.4}, b: {:.4}", epoch, cost, w[0], w[1], b);
        }
    }

    // 4. Inference: Predicting a new price
    let test_x = vec![85.0, 2.0]; // 85m2, 2 rooms
    let mut pred_y = b;
    
    // Crucial: We must normalize test data using training min/max
    for j in 0..p {
        let test_scaled = (test_x[j] - min_x[j]) / (max_x[j] - min_x[j]);
        pred_y += w[j] * test_scaled;
    }
    
    // Should interpolate around ~340k
    println!("\nPrediction for 85m2, 2 rooms: {:.2}k", pred_y); 
}

By building it from scratch in Rust, I forced myself to understand matrix transformations, feature scaling pitfalls, and the calculus behind gradients before hiding them behind a framework. It demystifies the learning process, proving that algorithms simply learn by predicting, measuring errors, calculating the direction of improvement, and adjusting weights iteratively.

Not bad for day two.

The Mathematical Foundation

Moving to Matrices

The Importance of Feature Scaling

Cost Function and Gradients

Implementation in Vanilla Rust

Further Reading