Skip to main content

Back To Basics: An Introduction to Bayesian Modeling

I was preparing another blog series on how transformers work when the real world disrupted my focus and once again pulled me into the world of Bayesian inference. This time it was not geopolitical tensions that caught my attention but the relentless ascent of the S & P 500 despite the perceptively turbulent social and political environment of the United States. Confused, I sought answers by trying to identify relationships between economic activity and the share price of the largest American corporations. Assuming this relationship to be noisy, I reached for the tool that would quantify this noise via reported uncertainty. This blog post is a recount of this journey, starting with a review of the tools I plan to use to learn the relationships of interest. 


This first installment is introductory. A subsequent post will attempt to recreate the results of a decades old case study with modern data. In a final installment, a hypothesis about relationships in modern times will be proposed and audited for credibility using the tooling from the first two installments.


Key:

  • Orange Boxes are code

  • Yellow boxes are key ideas or bullets for the section

  • Purple boxes are deep dives and should be skippable if the reader does not want further details

  • Green boxes contain definitions and equations (not code)

Background

To begin, let us review how Bayesian inference works. Imagine we have some measurable properties and an unmeasurable property of interest. For example, we might know the current rate of inflation but we do not know the median annualized rate of return of an equity in the S&P 500 five years from now.


 If we assume the relationship between these properties is linear we can write a simple linear regression equation to express this:


RoR = W * rate_of_inflation + B                   [1]


We could estimate W and B by fitting data to the expression and using gradient descent to learn the values of W and B that minimize the loss between predictions and observations in our data sets. This process gives us the single best fit values for W and B.


How confident are we that these estimates accurately capture the underlying relationship between the rate of inflation and the rate of return? The loss function error only measures the accuracy of the estimation relative to our dataset. It’s reasonable to assume we have not accounted for all the variables that affect the rate of return. 


Bayesian modeling enables us to quantify the uncertainty due to these missing signals. By treating W and B as random variables rather than point estimates we can learn a distribution rather than a value for each variable. A distribution is a more accurate representation of a variable in the context of a complex environment because its value is subject to variation due to an unmeasurable number of influencing factors. 


Exhibit A: The loss from the squared residuals between observed data and predictions measures the accuracy of the model. Our confidence in the model is quantified by this loss.

Exhibit B: A distribution of relationships is learned so rather than a loss we have a variance and a mean. 


Why is Variance More Useful Than Loss?

  • Bayesian inference gives us parameter level insight: a variance for each weight rather than a point estimate that minimizes the composite loss value.

  • Bayesian inference allows us to construct confidence intervals given an observation. A model learned via linear regression will only give us a point estimate.

Data Collection

Let us start with a very basic example. Let’s say we assume there is a linear relationship between three fundamental metrics and the annualized rate of return of an equity. 


The annualized rate of return (ARR) of an equity is the geometric average rate at which the investment grows per year over a given period, assuming compounding. If an equity’s value changes from P[0] to P[T] the annualized rate of return is defined by

ra = (PT/P0)1/T - 1

where T is the number of years. We will define our time period on a monthly basis, so T = 1/12 for our toy example. 


Next let us look at appropriate independent variables. These are the values we will use to try and predict the dependent variable, the ARR. Since we are predicting monthly patterns we need data that also changes monthly. I have chosen the following:

  • Inflation. The mean inflation for the last year was 3.58 with variance 0.11 and standard deviation 0.33. Inflation is measured in percent per year.

  • Money Supply. From September 2024 to September 2025 the mean money supply in the United States was 21681.98 billion dollars, with a variance of 91575 and a standard deviation of 302.

  • Earnings per share (measure the profitability per share). For the October 2024 to October 2025 year the mean earnings per share was 221.56, variance 15.9, standard deviation 3.99.


Deep Dive 0: Geometric Averaging

”Geometric” means multiplicative averaging rather than additive. In additive averaging you sum the annual returns and divide it by the number of years. In multiplicative averaging you multiply the annual returns and take the Tth root. Geometric averaging is an accurate measurement of investment return because of compounding. For example, suppose an account grows by 50% year one and loses 50% year two and I invest 100 dollars into that account. If I used additive averaging I would compute (%50 - %50)/2 = %0. But that suggests that I would still have $100 in that account after two years. In reality after the first year I have 150 dollars and at the end of year two I have $75 so it was not a 0% return investment! Now let’s calculate it using geometric averaging: ((1 + 0.5)(1 - 0.5))0.5 - 1 = -13.4%. This suggests after year one I have  86.6 dollars and after another year I have $75.00, which is an accurate reflection of reality.

Defining the Model

Next let us transform our relationship defined in [1] into a Bayesian model:


Model 1

# Weights

𝜇0,j ~ N(0,1), j = 0, 1, 2

σ0,j ~ HalfNormal(1), j = 0, 1, 2

Wj ~ N(𝜇0,j, σ0,j), j = 0, 1, 2


# bias

𝜇1 ~ N(0,1)

σ1 ~ HalfNormal(1)

B ~ N(𝜇1, σ1)


# observation

σ2 ~ HalfNormal(1)

# More accurate models will use Beta distribution for Y since ARR in [0,1] but for simplicity we will just use a Normal distribution.

Y ~ N(X.[W0, W1, W2]T + B, σ2


We need to encode this into a program so we can train it. I have chosen to use numpyro. 


Step 1: Encode the model. 

This is a function that accepts the independent variables X and returns a distribution.


Code 1: Model 1 Implementation

def model(X):

   # Weights

   n_features = X.shape[1]

   mu_0 = numpyro.sample("mu_0", dist.Normal(jnp.zeros(n_features), 1.0))

   sigma_0 = numpyro.sample("sigma_0", dist.HalfNormal(jnp.ones(n_features)))

   weight_node = numpyro.sample("w", dist.Normal(mu_0, sigma_0))


   # bias

   mu_1 = numpyro.sample("mu_1", dist.Normal(0., 1.0))

   sigma_1 = numpyro.sample("sigma_1", dist.HalfNormal(1.0))

   bias_node = numpyro.sample("b", dist.Normal(mu_1, sigma_1))

  

   # observation

   sigma_node = numpyro.sample("sigma", dist.HalfNormal(1.0))

   mu_node = jnp.dot(X, weight_node) + bias_node

   leaf_dist = dist.Normal(mu_node, sigma_node)

   return leaf_dist


Step 2: Generate Data. 

Before we train the model by fitting real data to the observations it is useful to synthesize data first so we can test that the model learns correctly. We synthesize the data by selecting values for our random variables 𝜇0,j, σ0,j, 𝜇1, σ1, Wj , and B to create a distribution and then sample N values from that distribution. Here I sample data from distributions parameterized with values from real data (see previous sections for mean and standard deviation listings for earnings, inflation, and money supply) because I want the unit test to be as realistic as possible.


Code 2: Data generation

rng = np.random.default_rng(42)

n_samples = 1000  # or however many you want

n_features = 3  # EPS, Inflation, Money Supply


# Initialize matrix

X = np.zeros((n_samples, n_features))


# --- Inflation (percent per year) ---

infl_mean = 3.58

infl_std = 0.33

X[:, 1] = rng.normal(loc=infl_mean, scale=infl_std, size=n_samples)


# --- Money Supply (billions of USD) ---

ms_mean = 21681.98

ms_std = 302

X[:, 2] = rng.normal(loc=ms_mean, scale=ms_std, size=n_samples)


# --- Earnings per Share (USD per share) ---

eps_mean = 221.56

eps_std = 3.99

X[:, 0] = rng.normal(loc=eps_mean, scale=eps_std, size=n_samples)


# --- Hierarchical parameters for synthetic data ---

mu_0 = jnp.array([0.5, -0.3, 0.7])        # top-level means for weights

sigma_0 = jnp.array([0.1, 0.2, 0.15])     # top-level std for weights

mu_1 = 0.3                                # top-level mean for bias

sigma_1 = 0.05                             # std for bias


# Sample W and B from hierarchical priors

rng_key = random.PRNGKey(0)

W = dist.Normal(mu_0, sigma_0).sample(rng_key)

print(W)

rng_key, subkey = random.split(rng_key)

B = dist.Normal(mu_1, sigma_1).sample(subkey)

print(B)


# Generate synthetic observations

rng_key, subkey = random.split(rng_key)

Y = generate_data(W, B, X, sigma_y=1.0, rng_key=subkey)


# again, note that this is not really the right distribution but just follow along for the purpose of learning model building basics

def generate_data(W, B, X, sigma_y=1.0, rng_key=random.PRNGKey(0)):

  mu_node = jnp.dot(X, W) + B

  leaf_dist = dist.Normal(mu_node, sigma_y)

  Y = leaf_dist.sample(rng_key)

  return Y



Note that the features have vastly different scales and will need to be normalized before training. Otherwise we could have gradient instability that could cause divergences. 


Deep Dive 1: Normalization

We train on 


Xnorm = (X - mean(X))/std(X)

Ynorm = (Y - mean(Y))/std(Y)


Thus we learn the following relationship:


Ynorm = Wnorm . Xnorm + Bnorm


Thus if we want the unnormalized W and B we substitute the normalization expression:


(Y - mean(Y))/std(Y) = Wnorm . (X - mean(X))/std(X) + Bnorm


After rearranging the terms so that we match the linear relationship we get the following:


Y = SUM[i=1,2,3]{((std(Y)/std(X[i])) * Wnorm[i])X[i]} + (mean(Y) + std(Y)Bnorm - SUM[i=1,2,3]{(std(Y)/std(X[i]))Wnorm[i] mean(X[i])}


Thus to get de-normalized weights and bias we use the following conversion:


Worig = ((std(Y)/std(X[i])) * Wnorm[i]                                  [1.1]    

Borig = (mean(Y) + std(Y)Bnorm - SUM[i=1,2,3]{Worig[i] mean(X[i])}

Step 3: Fit the Model

We then fit these samples to our model and ask the inference engine to learn the values for the random variables given the observations we generated. These values should match the values we chose to generate the samples. The distribution learned by the inference engine is called the posterior distribution because it models our  weight and bias beliefs after observing the data. 


I wrapped the training code in a Predictor class so that the normalizer is paired with the posterior distribution. Markov Chain Monte Carlo (MCMC) is the algorithm we use to solve an approximation to the often intractable posterior distribution expression. MCMC depends on a sampler to propose a sample from the posterior. No U-Turn Hamiltonian Monte Carlo is a sampler that uses Hamiltonian dynamics to propose new states (i.e. a snap shot of the graph). 


Code 3: Fit the model

class Normalizer:

   def __init__(self):

       self.means = None

       self.stds = None


   def fit(self, X):

       self.means = jnp.mean(X, axis=0)

       self.stds = jnp.std(X, axis=0)

       self.stds = self.stds.at[self.stds == 0].set(1.0)


   def transform(self, X):

       return (X - self.means) / self.stds


   def inverse_transform(self, X_norm):

       return X_norm * self.stds + self.means


# the train method pairs the model from code box 1 with observations. This is needed by the No-U-Turn algorithm to compute gradients of the log posterior for sampling

def train(X, y):

   graph = model(X)

   numpyro.sample("obs", graph, obs=y)



class Predictor:

   def __init__(self, X_train, y_train):

       # Normalize the data

       self.normalizer = Normalizer()

       self.normalizer.fit(X_train)

       X = self.normalizer.transform(X_train)

       self.y_normalizer = Normalizer()

       self.y_normalizer.fit(y_train.reshape(-1, 1))

       Y = self.y_normalizer.transform(y_train.reshape(-1, 1)).ravel()


       # Use Hamiltonian Monte Carlo sampler to compute the posterior distribution

       kernel = NUTS(train)

       # MCMC computes integrals via sampling from an approximation of the posterior distribution.

       mcmc = MCMC(kernel, num_warmup=1000, num_samples=2000, num_chains=4)

       rng_key = jax.random.PRNGKey(0)

       mcmc.run(rng_key, X=jnp.array(X), y=jnp.array(Y))

       posterior_samples = mcmc.get_samples()

       self.posterior = posterior_samples



Step 4: Testing the Model

We have our model, the data, and the inference engine. We can’t make predictions yet but we can test that the model learns appropriate distributions for our random variables. 

Below is the full code and the output. Let’s step through what the output means so we can verify that the model inference is working as expected. 


Code 4: Evaluate Inference (Data Synthesis + Model + inference)

import jax.numpy as jnp

import numpy as np

import numpyro.distributions as dist

from create_model import Predictor

import matplotlib.pyplot as plt

from jax import random


def generate_data(W, B, X, sigma_y=1.0, rng_key=random.PRNGKey(0)):

   mu_node = jnp.dot(X, W) + B

   leaf_dist = dist.Normal(mu_node, sigma_y)

   Y = leaf_dist.sample(rng_key)

   return Y


if __name__ == "__main__":

   rng = np.random.default_rng(42)

   n_samples = 1000

   n_features = 3  # EPS, Inflation, Money Supply


   # Initialize matrix

   X = np.zeros((n_samples, n_features))


   # --- Inflation (percent per year) ---

   infl_mean = 3.58

   infl_std = 0.33

   X[:, 1] = rng.normal(loc=infl_mean, scale=infl_std, size=n_samples)


   # --- Money Supply (billions of USD) ---

   ms_mean = 21681.98

   ms_std = 302

   X[:, 2] = rng.normal(loc=ms_mean, scale=ms_std, size=n_samples)


   # --- Earnings per Share (USD per share) ---

   eps_mean = 221.56

   eps_std = 3.99

   X[:, 0] = rng.normal(loc=eps_mean, scale=eps_std, size=n_samples)


   # --- Hierarchical parameters for synthetic data ---

   mu_0 = jnp.array([0.5, -0.3, 0.7])        # top-level means for weights

   sigma_0 = jnp.array([0.1, 0.2, 0.15])     # top-level std for weights

   mu_1 = 0.3                                # top-level mean for bias

   sigma_1 = 0.05                             # std for bias


   # Sample W and B from hierarchical priors

   rng_key = random.PRNGKey(0)

   W = dist.Normal(mu_0, sigma_0).sample(rng_key)

   print(W)

   rng_key, subkey = random.split(rng_key)

   B = dist.Normal(mu_1, sigma_1).sample(subkey)

   print(B)


   # Generate synthetic observations

   rng_key, subkey = random.split(rng_key)

   Y = generate_data(W, B, X, sigma_y=1.0, rng_key=subkey)


   # next, invoke our inference engine to infer W and B given X and Y

   # and check the results against the values we chose for W and B

   predictor = Predictor(X, Y)

   posterior = predictor.posterior


   # Extract normalized posterior samples

   weights_norm = posterior["w"]  # shape: (num_samples, num_features)

   bias_norm = posterior["b"]  # shape: (num_samples,)

   sigma_norm = posterior["sigma"]  # observation noise in normalized space


   # Retrieve normalizer stats

   X_stds = predictor.normalizer.stds

   X_means = predictor.normalizer.means

   y_std = predictor.y_normalizer.stds[0]

   y_mean = predictor.y_normalizer.means[0]


   # Posterior samples in normalized space

   weights_norm = posterior["w"]  # shape: (num_samples, num_features)

   bias_norm = posterior["b"]  # shape: (num_samples,)

   sigma_norm = posterior["sigma"]  # shape: (num_samples,)


   # denormalize the weights (see deep dive box 1)

   weights_orig = weights_norm * (y_std / X_stds)  # shape: (num_samples, num_features)


   # Posterior mean and std for weights

   weights_mean = weights_orig.mean(axis=0)

   weights_std = weights_orig.std(axis=0)


   # denormalize the bias (see deep dive box 2)

   bias_orig = y_mean + y_std * bias_norm - jnp.sum(weights_orig * X_means, axis=1)

   bias_mean = bias_orig.mean()

   bias_std = bias_orig.std()


   for i, (w_mean, w_std) in enumerate(zip(weights_mean, weights_std)):

     print(f"Weight w[{i}] (original scale) → mean: {w_mean:.4f}, std: {w_std:.4f}")


   print(f"Bias (original scale) → mean: {bias_mean:.4f}, std: {bias_std:.4f}")


   # Inverse-transform sigma

   sigma_orig = sigma_norm * y_std


   # Plotting

   num_features = weights_orig.shape[1]

   fig, axes = plt.subplots(num_features + 2, 1, figsize=(8, 2 * (num_features + 2)))


   # Plot weights

   for i in range(num_features):

       ax = axes[i]

       ax.hist(weights_orig[:, i], bins=50, density=True, alpha=0.7, color='skyblue')

       ax.axvline(weights_orig[:, i].mean(), color='red', linestyle='--', label='mean')

       ax.set_title(f'Posterior of w[{i}] (original scale)')

       ax.legend()


   # Plot bias

   ax = axes[num_features]

   ax.hist(bias_orig, bins=50, density=True, alpha=0.7, color='lightgreen')

   ax.axvline(bias_orig.mean(), color='red', linestyle='--', label='mean')

   ax.set_title('Posterior of bias (original scale)')

   ax.legend()


   # Plot sigma

   ax = axes[num_features + 1]

   ax.hist(sigma_orig, bins=50, density=True, alpha=0.7, color='lightcoral')

   ax.axvline(sigma_orig.mean(), color='red', linestyle='--', label='mean')

   ax.set_title('Posterior of sigma (original scale)')

   ax.legend()


   plt.tight_layout()

   plt.show()





OUTPUT:

# We sampled the following values for W1, W2, W3. So in this test case we assumed that impact on ARR due to EPS is 0.68, the impact due to inflation is -0.39, and the impact due to money supply is 0.75.

[ 0.6816086  -0.39652464  0.75098336]

# the bias represents the expected value of Y when all predictors are 0. All effects from the features are additive deviations from this baseline. In this case we sampled 0.23 as our baseline. So if these other features were not impacting return you could expect an ARR of 0.23. This is high but this is because we chose 0.3 as our mean for the normal distribution that B is sampled from. In the real world we will learn this value from real data so it will not be that high.

0.23742306

# The mean of the distribution for the random variables is very close to what we sampled 

# The standard deviation is also sort of close (note that standard deviation was chosen as 0.1, 0.2, 0.15)

Weight w[0] (original scale) → mean: 0.6865, std: 0.0082

Weight w[1] (original scale) → mean: -0.4360, std: 0.1002

Weight w[2] (original scale) → mean: 0.7510, std: 0.0001


# bias is a bit…off

Bias (original scale) → mean: -0.6182, std: 2.8705


Deep Dive 2: Recoverability of Chosen Weights from Scaled Weights 

Why is our bias so off? 

Recall from our normalization deep dive how to recover the unnormalized bias:


Borig = mean(Y) + std(Y)Bnorm - SUM[i=1,2,3]{Worig[i] mean(X[i])}                [2.1]


In theory then it should be possible to recover Borig from Bnorm. Let’s evaluate the RHS of the above expression with the results from our model.


Borig = (16414.285 

       + 229.9271 [-2.9098275e-04  5.2468775e-05 -7.1851828e-05 ...  3.9109091e-05 6.5002852e-05  2.6932839e-04]

       - [16414.602 16414.912 16418.488 ... 16418.744 16415.883 16413.402]

    = (16414.285

+ [-0.06690482  0.01206399 -0.01652068 ...  0.00899224  0.01494592 0.0619259 ]

      - [16414.602 16414.912 16418.488 ... 16418.744 16415.883 16413.402]

    = [-0.3828125 -0.6152344 -4.21875   ... -4.4492188 -1.5820312  0.9453125]


Bias (original scale) → mean: -0.6182, std: 2.8705


This is not even close to our selection of 0.23 for B! 

Why is it so off and is this expected? To calculate some insight, let’s quantify the inverse function’s sensitivity to error by rewriting [2.1] as a function of Wnorm and Bnorm so that we can measure the change in Borig when the learned Wnorm or Bnorm change.


f(Wnorm, Bnorm) = mean(Y) + std(Y)Bnorm - SUM[i=1,2,3]{Worig[i] mean(X[i])}

      = mean(Y) + std(Y)Bnorm - SUM[i=1,2,3]{(std(Y)/std(X[i]))Wnorm[i] mean(X[i])}


We want to measure change in Borig. Borig is the output and the change of the output relative to changes in the input is the gradient of the function. That gives us the following (see equation [1.1] from original deep dive:


∂f/∂Bnorm = std(Y)

∂f/∂Wnorm[i] = (std(Y)/std(X[i])) mean(X[i])                          [2.2]


This means a small change ΔBnorm will change f by std(Y)ΔBnorm. Since std(Y) is 229.92 but since Bnorm mostly hovers near zero, small changes in samples from the normalized B will have little effect on f. The change in f due to a small change in Wnorm[2] is 229.92 / 302 * 21682 * ΔWnorm[2] = 16507 * ΔWnorm[2]. So small changes in Wnorm[2] result in 16507 times that change in f! This is quite amplified. This analysis gives us an intuition but not a measurement of recoverability. To measure recoverability we want the variance of f. The variance of f can be computed by the following equation:


var(Borig ) = var(f(Wnorm, Bnorm)) = std(Y)2 var(Bnorm) + SUM[i=1,2,3]{((std(Y)/std(X[i])) mean(X[i]))2 var(Wnorm[i])}

       = 229.922 (0.00014232732)2 + (229.92 * 3.58/0.33)2 * 0.000142172 + (229.92 * 21681.98/302)2 0.000141552 + (229.92 * 221.56/3.99)20.000145082 

       = 9.017

The variance of Borig is 9.017, which translates to a standard deviation of around 3. This means that the average sample deviates about three units from the mean, which is a large quantity when you expect a value near 0.23. The high variance is due to the high means of the features, as shown in the gradient calculations in [2.2].

This tells us that the uncertainty of the original bias propagated by the normalized weights is too high to reliably recover a meaningful value for the original bias. The good news is that that is not strictly necessary. As long as the variance of the model Y ~ Norm(WX + B, σ2 ) is low enough the model is still considered useful.


Plotting the samples drawn from the inference engine shows the distribution of posterior samples for that weight in the original scale. The width of the distribution corresponds to the posterior uncertainty of the weight. Overall the results look good because the weights are relatively narrow and the large means of the features explain the wider bias distribution.


Step 5: Model Evaluation Via Posterior Convergence

The test from the previous section helps gain an intuition for what the inference engine calculates. It is a good starting point when you want to perform model introspection. In practice though we are working with real data, not synthesized data. This means we don’t have expected values for W and B. We know if the distributions the inference engine learned are meaningful by analyzing the posterior convergence of the model. This task is usually performed by inspecting two values: the R hat and the effective sample size.


We know the MCMC chains have converged if the R-hat is close to 1 for all parameters and the Effective Sample Size is sufficiently large, ideally at least 400 per parameter. Choose a sufficiently large chain count (4), sample size (2000), and warm up size (1000) to achieve this.


Deep Dive 3: R Hat

R-hat is a ratio of variances that indicates the convergence across a sequence of MCMC chains. Specifically, for a sampling with m chains of size n,


Ȓ = sqrt(total_variance / within_chain_variance)                     [2]


Between_chain_variance = n/(m-1) SUMj=1..m (𝜇 j - 𝜇 )2 


where 𝜇 j is the mean of chain j and 𝜇 is the mean of all the chains means


within_chain_variance = 1 / m SUMj=1..m sj2


Where sj2 is the sample variance of chain j.


total_variance = ((n-1)/n)within_chain_variance + Between_chain_variance/n


Intuitively, if the variance within each chain is close to the variance across chains, then the chains were likely drawn from the same distribution. Thus we aim for an R hat of close to 1. Note that because the “between chain variance” depends on the means of the chains, if the chains have different means then the between chain variance will be high which contributes to the numerator of [2] and leads to an R hat of greater than 1.




Deep Dive 4: Effective Sample Size

Note that in order to trust the R-hat we have to ensure that the effective sample size is sufficiently large. The Effective Sample Size (ESS) metric estimates the number of effectively independent samples in an MCMC chain. Intuitively, if each chain is 2000 elements in size, the ESS tells us roughly how many of those samples are effectively unique given the autocorrelation between samples.


ESS = number of elements in a chain / auto correlation time


Auto_correlation_time = 1 + 2 * SUMk=1..P autocorrelation_of_chain_lag_k, 

for P sufficiently large


autocorrelation_of_chain_lag_k = 1/(n-k) SUMt=1..n-k t - 𝜇 )(θt+k - 𝜇 )/ s2


s2 is the variance of the chain

𝜇 is the mean of the chain

θi is the value of the ith element of the chain 


Autocorrelation_of_chain_lag_k measures how correlated a sample is with the sample k steps later. If it is highly correlated then the autocorrelation time will be high, indicating that each sample is correlated with nearby samples.



Step 6 Making Predictions

In section 4 and section 5 we covered techniques for model introspection: Is the model we built usable to collect insights? But we have yet to actually use the model to collect insights! To add this feature, we need to add another method to our Predictor class, “predict”. This predict function will accept a feature vector “X” and draw outcomes from the distribution defined by the MCMC chains we analyzed in the previous section.


def predict(self, X):

   # Normalize test inputs

   X_test_norm = self.normalizer.transform(X)

   weights = self.posterior['w']

   bias = self.posterior['b']


   y_preds_norm = jnp.dot(weights, X_test_norm.T) + bias[:, None]


   # Convert back to original scale

   y_preds = self.y_normalizer.inverse_transform(y_preds_norm.T)

   mean_preds = jnp.mean(y_preds, axis=1)

   std_preds = jnp.std(y_preds, axis=1)

   return mean_preds, std_preds


I called predict on the generated features and compared the predictions against the generated predictions and plotted the results, which were a mean and standard deviation for each prediction. In this case the standard deviation was so small it would not render on my plot so I multiplied it by a 5000 scaling factor so you could see conceptually the results:

I admit this is not super useful since the predictions (mean in read and standard deviation in blue) almost exactly match the label (black) so it is difficult to see any delta in the actual versus predicted. But at least we understand now how to get predictions from our model. Now let’s leave the playground and see what information we can gather from the real world!


Comments

Popular posts from this blog

Model Based Public Policy: Bayesian Neural Nets and Trade

  1 Motivation Recent changes in U.S. trade policy have sparked debates about their effect on American prosperity, national security, and federal income. This work began as an attempt to formalize the relationship between trade policy and prosperity and continues as a journey learning data science and how to apply modeling to public policy. For my first installment, I recount my experience building a Bayesian neural net to solve the problem of selecting a trade policy that optimizes American prosperity. 2 Defining The Problem Let’s begin by defining the problem we wish to solve. Problem definitions require identifying three things:  Given information. This is the data we start with before any computation. For our modeling problem, we will need at a minimum data relevant to the prosperity of a nation along with trade policy data.  Unknown. The unknown is the set of values you want to identify. For a modeling problem, the unknown are  The templated function we wish to...

Asset Pricing Revised

  In a previous post I included a problem definition and an example. Upon reviewing the post, I discovered that not only were both the problem definition and the example incorrect, they were inconsistent. In this post I aim to correct the problem definition and then reimplement the example with the new problem definition. Part 1: Correct the Problem Definition Let's start with a recap of the problem we are trying to solve. Note that I altered the problem definition a bit to correct two major issues: The constraint in the original problem definition was not in terms of the givens. The requirement is that the expression we optimize must be expressible entirely in the terms we are given. A previous definition omitted the payouts and the potential economic scenarios, both of which are referenced in our constraint equation. Optimization was not with respect to the correct property. The original expression I defined was optimizing payout minus the cost of the portfolio. But we are not o...