I was preparing another blog series on how transformers work when the real world disrupted my focus and once again pulled me into the world of Bayesian inference. This time it was not geopolitical tensions that caught my attention but the relentless ascent of the S & P 500 despite the perceptively turbulent social and political environment of the United States. Confused, I sought answers by trying to identify relationships between economic activity and the share price of the largest American corporations. Assuming this relationship to be noisy, I reached for the tool that would quantify this noise via reported uncertainty. This blog post is a recount of this journey, starting with a review of the tools I plan to use to learn the relationships of interest.
This first installment is introductory. A subsequent post will attempt to recreate the results of a decades old case study with modern data. In a final installment, a hypothesis about relationships in modern times will be proposed and audited for credibility using the tooling from the first two installments.
Key:
Orange Boxes are code
Yellow boxes are key ideas or bullets for the section
Purple boxes are deep dives and should be skippable if the reader does not want further details
Green boxes contain definitions and equations (not code)
Background
To begin, let us review how Bayesian inference works. Imagine we have some measurable properties and an unmeasurable property of interest. For example, we might know the current rate of inflation but we do not know the median annualized rate of return of an equity in the S&P 500 five years from now.
If we assume the relationship between these properties is linear we can write a simple linear regression equation to express this:
We could estimate W and B by fitting data to the expression and using gradient descent to learn the values of W and B that minimize the loss between predictions and observations in our data sets. This process gives us the single best fit values for W and B.
How confident are we that these estimates accurately capture the underlying relationship between the rate of inflation and the rate of return? The loss function error only measures the accuracy of the estimation relative to our dataset. It’s reasonable to assume we have not accounted for all the variables that affect the rate of return.
Bayesian modeling enables us to quantify the uncertainty due to these missing signals. By treating W and B as random variables rather than point estimates we can learn a distribution rather than a value for each variable. A distribution is a more accurate representation of a variable in the context of a complex environment because its value is subject to variation due to an unmeasurable number of influencing factors.
Data Collection
Let us start with a very basic example. Let’s say we assume there is a linear relationship between three fundamental metrics and the annualized rate of return of an equity.
The annualized rate of return (ARR) of an equity is the geometric average rate at which the investment grows per year over a given period, assuming compounding. If an equity’s value changes from P[0] to P[T] the annualized rate of return is defined by
ra = (PT/P0)1/T - 1
where T is the number of years. We will define our time period on a monthly basis, so T = 1/12 for our toy example.
Next let us look at appropriate independent variables. These are the values we will use to try and predict the dependent variable, the ARR. Since we are predicting monthly patterns we need data that also changes monthly. I have chosen the following:
Inflation. The mean inflation for the last year was 3.58 with variance 0.11 and standard deviation 0.33. Inflation is measured in percent per year.
Money Supply. From September 2024 to September 2025 the mean money supply in the United States was 21681.98 billion dollars, with a variance of 91575 and a standard deviation of 302.
Earnings per share (measure the profitability per share). For the October 2024 to October 2025 year the mean earnings per share was 221.56, variance 15.9, standard deviation 3.99.
Defining the Model
Next let us transform our relationship defined in [1] into a Bayesian model:
We need to encode this into a program so we can train it. I have chosen to use numpyro.
Step 1: Encode the model.
This is a function that accepts the independent variables X and returns a distribution.
Step 2: Generate Data.
Before we train the model by fitting real data to the observations it is useful to synthesize data first so we can test that the model learns correctly. We synthesize the data by selecting values for our random variables 𝜇0,j, σ0,j, 𝜇1, σ1, Wj , and B to create a distribution and then sample N values from that distribution. Here I sample data from distributions parameterized with values from real data (see previous sections for mean and standard deviation listings for earnings, inflation, and money supply) because I want the unit test to be as realistic as possible.
Note that the features have vastly different scales and will need to be normalized before training. Otherwise we could have gradient instability that could cause divergences.
Step 3: Fit the Model
We then fit these samples to our model and ask the inference engine to learn the values for the random variables given the observations we generated. These values should match the values we chose to generate the samples. The distribution learned by the inference engine is called the posterior distribution because it models our weight and bias beliefs after observing the data.
I wrapped the training code in a Predictor class so that the normalizer is paired with the posterior distribution. Markov Chain Monte Carlo (MCMC) is the algorithm we use to solve an approximation to the often intractable posterior distribution expression. MCMC depends on a sampler to propose a sample from the posterior. No U-Turn Hamiltonian Monte Carlo is a sampler that uses Hamiltonian dynamics to propose new states (i.e. a snap shot of the graph).
Step 4: Testing the Model
We have our model, the data, and the inference engine. We can’t make predictions yet but we can test that the model learns appropriate distributions for our random variables.
Below is the full code and the output. Let’s step through what the output means so we can verify that the model inference is working as expected.
Step 5: Model Evaluation Via Posterior Convergence
The test from the previous section helps gain an intuition for what the inference engine calculates. It is a good starting point when you want to perform model introspection. In practice though we are working with real data, not synthesized data. This means we don’t have expected values for W and B. We know if the distributions the inference engine learned are meaningful by analyzing the posterior convergence of the model. This task is usually performed by inspecting two values: the R hat and the effective sample size.
Step 6 Making Predictions
In section 4 and section 5 we covered techniques for model introspection: Is the model we built usable to collect insights? But we have yet to actually use the model to collect insights! To add this feature, we need to add another method to our Predictor class, “predict”. This predict function will accept a feature vector “X” and draw outcomes from the distribution defined by the MCMC chains we analyzed in the previous section.
I called predict on the generated features and compared the predictions against the generated predictions and plotted the results, which were a mean and standard deviation for each prediction. In this case the standard deviation was so small it would not render on my plot so I multiplied it by a 5000 scaling factor so you could see conceptually the results:
I admit this is not super useful since the predictions (mean in read and standard deviation in blue) almost exactly match the label (black) so it is difficult to see any delta in the actual versus predicted. But at least we understand now how to get predictions from our model. Now let’s leave the playground and see what information we can gather from the real world!
Comments
Post a Comment