1 Motivation
2 Defining The Problem
Let’s begin by defining the problem we wish to solve. Problem definitions require identifying three things:
Given information. This is the data we start with before any computation. For our modeling problem, we will need at a minimum data relevant to the prosperity of a nation along with trade policy data.
Unknown. The unknown is the set of values you want to identify. For a modeling problem, the unknown are
The templated function we wish to learn that defines the relationship between the variables in our data sets
the values of the weights that render the templated function a specialized, concrete function characterized by our data.
Constraint. The constraint is a binary function over the unknown and given values. It decides whether a given outcome is acceptable. In the context of modeling, a reasonable constraint is an acceptance criteria on the error between the outcomes the model produces and the real world outcomes it was trained with. The error between the outcomes and reality is called the loss function.
The concrete problem definition is stated below.
Let us consider the practicality of our problem. We are trying to learn a relationship between trade policy and prosperity but we have excluded intermediate economic mechanisms that may have more direct causal influence over prosperity outcomes.
Since tariff rates do not directly impact prosperity, we introduce an additional layer of economic indicators as mediating variables. Foreign policy changes, such as tariff adjustments, influence these indicators, which in turn affect overall prosperity. Explicitly modeling these intermediate variables helps prevent misattributing changes in prosperity to the wrong cause. Let’s redefine our problem with this intermediate layer in mind.
Now that we have an idea of what we need to create (the templated functions), what we need to compute (the weights) and how to measure the accuracy of the model, let us start to collect our data.
3 Defining the Model: Foreign Policy Effects on Economic Indicators
Let us progress to our next step: creating the templated function. In our final problem definition we had two templated functions to identify. The first related trade policy to economic indicators and the second related economic indicators to prosperity. We shall start with the first: relating trade policy to economic indicators.
Let us simplify the modeling task by selecting a single economic indicator. We will model the relationship between trade policy and the price of goods before expanding to other economic indicators such as wages and employment, tax revenues, and investment and productivity. Our goal then is to express prices as a function of tariff policy, so that we can learn how markets adapt to tariff changes over time.
Intermediate Variables
For example, imagine a training data set where high tariffs are paired with low prices and low import volume. Without modeling import volumes explicitly the model might incorrectly learn that high tariffs cause prices to fall when the true causal story might be that high tariffs increase prices if volume is high and have little effect if the volume is low.
While it’s impossible to include every influencing factor, we can use economic reasoning to isolate those with the most importance. In the model relating economic indicators to foreign policy, we will start by explicitly including volume. Because volume is dependent on price, we will model volumes as a function of the previous year’s prices.
Selecting volume as an intermediate variable gives us a starting point to start building a model but at this stage we cannot confirm with certainty that we’ve captured all relevant intermediate variables. Once we have a functioning model with our variables of choice, we can test for key variable omission using the following techniques:
Economic Consistency Checks. After training the model we consider a real world economic situation outside the data set we trained the model on. For example, suppose we know that in 1970 tariffs on coffee skyrocketed and the cost of coffee shot up because it could not be domestically produced, yet strong demand for the bean kept the import volumes high. We could suppose that this happened again this year, and ask our model to predict the price of coffee next year given the tax increase. If the model predicts that the price will fall, we have reason to believe that we are missing some governing variables in our model, specifically the ability for domestic producers to offer a less expensive option.
Sensitivity analysis. We consider other variables to add to the model such as domestic production capacity and consumer demand. If the model’s predictions appear to be sensitive to the variable’s inclusion, then clearly the model was missing an important causal pathway.
Function Shape
We have identified our variables in the templated function but we have not identified the operators that join them into a prediction. Importantly, we will assume a nonlinear relationship between tariffs and volumes, and between volumes and prices, to reflect saturation effects (once volume reaches zero, further taxation will have no effect) and behavioral thresholds (sudden changes in volume due to a large number of consumers no longer willing to higher price).
I have chosen to use a Bayesian Neural Network rather than a traditional neural network because I would like to quantify prediction uncertainty. For example, under certain combinations of import volume and trade policy, prices may become volatile, making them more difficult to predict. This is not a limitation of modeling but a reality of the physical world. A traditional model may opaquely overfit to this noisy data. A Bayesian neural net however will include uncertainty in its prediction, giving us insight for how much confidence to have in its predictions.
The final model is specified below.
Figure 0
4 Model Implementation
The next step is to implement the model. First let us offer more details about Bayesian Neural Nets.
Definition
A Bayesian Neural Network (BNN) extends a traditional neural network by treating its weights and biases as probability distributions conditioned on data, rather than fixed point estimates.
In a traditional neural network, inference is performed by computing a linear mapping with fixed value weights followed by an activation function to handle nonlinearities. Concretely, each layer computes the following:
Since each forward pass draws new samples W(s), b(s), the output f(x) becomes a random variable: each pass yields a different value for f(s)(x), each drawn from a distribution of plausible outcomes. Repeating this process yields a predictive distribution over outputs. That is, for a single query to a BNN there are multiple passes required to create the outcome distribution.
By computing a distribution rather than a single point estimate for a prediction, a Bayesian model provides both the predicted value (mean of the distribution) and its uncertainty (the variance of the distribution). For example, if the relationship between prices and tariffs is volatile, a single predicted price may be misleading. In contrast, a full predictive distribution—potentially with wide or fat tails—captures the range of likely outcomes and their probabilities. This allows us to assess not just the expected price, but also how uncertain that estimate is, enabling more risk-aware decisions.
Architecture
A Bayesian neural net has a slightly different architecture than a traditional neural net during training. Instead of a single model function to optimize the Bayesian neural net uses two. The first function, the model, specifies the generative process: how outputs f(s) are generated given inputs x over latent parameters W, b. The model function corresponds to equation [2] in the previous section. The second function, called the guide, represents the posterior of the weights under observed data. The posterior distribution is the distribution of a random variable given observations. It is defined as
For example, in the context of our model we want the distribution of prices conditioned on observed tariff policy and resulting import volumes.
Computing [3] directly is intractable so we approximate it using variational inference. Variational inference is a technique that transforms the problem of computing a complicated posterior distribution p(θ | data) into an optimization problem.
The optimization problem we solve instead of computing the posterior directly is the following:
The guide function defines q(θ; ɸ ), the variational approximation to the true posterior, and corresponds to equation [1] in the previous section. To create the BNN then, we define our model and our guide functions and solve the above optimization problem to choose the parameters of our approximated posterior distribution (equation [1]). To illustrate how the guide and the model are used together to create a predictive function I have included the following pseudo code based on a model built atop the NumPyro framework.
What does it mean to “replay these samples into the model and compute the log probabilities?” This is getting into NumPyro internals, but it does help to understand how this model learns so I will include it. To make the explanation clear, I will include the actual code for the model and the guide.
Step 1: Trace the Guide
NumPyro traces the guide to collect samples for the weight values. This trace is represented as a dictionary that records what was sampled (for example “W1”), from what distribution ( “Normal(loc, scale”)), and the result of the sampling (“value”). The trace has the following structure:
Step 2: Replay the model with the trace
Execute the model function, but replace computations in the model with results from the trace where applicable. For example, if in the model we have a sample from the prior Normal(0.5,0.1), written something like this:
Then during replay NumPyro sees that “W1” is in the trace so instead of sampling from Normal(0,1) it instead uses the value from trace[“W1”][“value”].
The result of the model trace has something like the following form:
Of particular importance is the highlighted log_prob value at the "Y" site. This value represents the log-likelihood of observing Y given the input X and the weights (which are fixed via replay).
This is the quantity we aim to maximize during training. A higher value indicates that, under the current model (with the sampled weights), the observed data is more likely — meaning the model better explains the observed outcomes.
Step 3: Compute Evidence Lower Bound (ELBO)
The evidence lower bound is the objective function we maximize during training. It is a lower bound on the log likelihood of some observed data Y conditioned on known inputs X, written log(p(Y|X)). We want to maximize log(p(Y|X)) but it is intractable to do so directly. Below we will step by step work our way to the expression that is feasible to maximize instead, the lower bound (ELBO).
Let’s consider how the loss is computed. Note that the integral in [8] is calculated in practice using discrete integration numerical methods, Monte Carlo in our case. That is, we compute the following:
To evaluate the ELBO, we use the following Monte Carlo approximation:
Sample K weight vectors W[i] from the variational distribution q(W; θ).
For each sample W[i]:
Bind W[i] to the model weights.
Compute the model output Ŷ[i] = f(X; W[i]).
Evaluate the log-likelihood log p(Y | W[i]) by assuming a fixed variance Gaussian likelihood: Y ~ N(Ŷ[i], σ²), with σ held constant.
Compute the log prior probability log p(W[i]).
Compute the log variational probability log q(W[i]; θ).
Aggregate the terms:
Compute the average over K samples of log p(Y | W[i]) + log p(W[i]) − log q(W[i]; θ).
Negate the result to form the loss (since most optimizers minimize rather than maximize).
For example, I trained my model with K = 10 (sometimes referred to as the ‘particle count’) and 100 observations (N = 100). I found that it was difficult to reduce the loss beyond around 300. But what does this number 300 really mean?
For K particles and 100 observations that means the mean loss contribution is -300/10*100. This corresponds to a log-likelihood of approximately -0.3 per observation, which is consistent with a reasonably high likelihood under a Normal distribution (log(p(y[i]|W) = -0.3 => p(y[i]|W) = e^(-0.3) = ~0.741 => 74.1% likelihood). It is unreasonable to expect any fixed set of weights to perfectly explain all observations, given that the data-generating process is noisy. The goal is not to fit the noise, but to capture the underlying distribution of responses. Thus achieving a log likelihood of -0.3 per observation may be considered acceptable.
Training Risks
Gradient descent produces an approximation of the minimum by iteratively updating a candidate x by the negative of its gradient, -▽f(x), scaled by a step size η. A smooth and stable surface with a single global minimum is thus ideal for this type of algorithm.
In practice the optimization surface may have high curvature, non-convexity, or many local minima. This section identifies symptoms of each of these problems and how to address them. In the next section we will show the full code example that incorporates remedies to each of these problems.
Problem 1: Parameter Value Gets “Stuck”.
Variance Across Minibatches.
But if the variance of the minibatch gradients is high, then the gradient of any one minibatch is unlikely to be a sufficient estimation of the gradient of the loss under the entire dataset. As a result, updating the parameter value with the gradient from one subset may reduce the loss from the contributions of that subset but increase the loss from contributions in another, resulting in a loss that fails to converge.
A solution for this problem is increasing the mini batch size or ensuring that each mini batch has the same distribution as its superset. In my toy problem I did not use mini batching so this was not a problem for me. However, as I scale my problem and start to rely on real world data I will need to incorporate this technique.
Variance Across Training Steps (Across time).
This type of gradient variance is with respect to the set of gradients collected over the entire course of the training. That is, a single gradient of this set was calculated under a unique parameter value at a unique time step. A low variance represents a fairly stable and smooth optimization landscape while a high variance suggests a more volatile optimization landscape, potentially with sharp curvature, frequent oscillations, or numerous local minima.
The problem with high variance is that it can cause updates in the wrong direction. Let’s consider two plots to illustrate the difference visually. Below are two images. The first is an optimization landscape with high variance and the second is an optimization with low variance.
High variance amongst gradients across time steps is a particularly thorny problem when training Bayesian neural nets relative to traditional neural nets. In traditional neural nets, the gradient used to update the parameter value is the gradient g(t) which is estimated from the batch and can be written g(t) = g(t) + 𝛜1(t), where g(t) is the full gradient and 𝛜1(t) is the error term and is equal to the variance of random variable g(t).
In Bayesian neural nets, not only do we have the gradient variance from the mini batch selection 𝛜1(t), but we also have the variance 𝛜2(t) due to the parameter W being a random variable rather than a fixed point estimate. Note that in equation [9] we sample K parameter values from the variational distribution q(W; θ ). If the variance of that distribution is high, we will not compute a gradient that is representative of full parameter space. That is, in the Bayesian neural net training loop we compute g(t) = g(t) + 𝛜1(t) + 𝛜2(t). Because of this extra error term the risk of divergence when training a BNN is higher than in a traditional neural net.
Problem 2: NaN Parameters.
It’s common to observe a point in the training loop where the parameter becomes NaN and stays there. The parameter usually becomes Nan because either the gradient of the loss function grows too large or we have produced an input that is undefined under our loss function.
Exploding Parameter Norms
Recall the gradient descent update function:
Exploding parameter norms can be diagnosed by inspecting the Hessian of the loss function. The Hessian of a function L quantifies the rate of the rate of change of L, or the second derivative. By inspecting the sign and magnitude of the eigenvalues of the Hessian we can predict if the model will diverge during gradient descent. In particular, if any of the eigenvalues of the Hessian have large magnitudes, it indicates a sharply increasing or decreasing slope of the loss function in that direction, perhaps by a quantity not representable in the chosen bit width of our integers in the training loop, hence the NaN. If any of the eigenvalues have a negative sign then it indicates that there is no minimum of the loss function in that dimension. The value in that dimension will diverge.
For example, consider a loss function that resembles a saddle. In particular, the loss function is L(x, y) = x2 - y2. In the x direction, it has a valley but in the y direction it is more like a hill. The shape is crudely illustrated below.
xt+1, yt+1 = xt - 2η xt, yt + 2ηyt
The Hessian can be used to guide values like step size selection but it is an expensive technique and does not guarantee the prevention of exploding gradients. We will use gradient clipping, the bounding of gradients to a maximum value, to prevent exploding gradients instead.
Undefined Values
Input scaling as we suggested earlier can also combat NaN gradients because it can address underflow or overflow by moving extreme values into a more reasonable range. I also found that selecting appropriate initial values was really important. Otherwise the first proposed value can have a likelihood low enough to NaN out the log expression. Lower variance in the initial normal distributions for the variational parameters helped solve this.
Problem 3: Convergence is too Slow
Note the parameter η in our gradient descent equation [10]. The value of this parameter will determine the size of our steps during training. Too small of a value and we may have to wait too long and too large of a value and we may jump past the minimum. Intuitively, we want to jump further when the curvature of the loss function is shallow and jump smaller when it is steep. Thus using the values that quantify the steepness, the Hessian, to adjust the step size is a highly effective way to keep the step size at the optimal size.
Training Debugging
The first illustrates that as tariff rates decrease the prices also decrease. The second illustrates the increase in import volumes as tariff rates decrease. Note that this is admittedly a very simple almost linear relationship and probably does not need the uncertainty and nonlinear modeling of the BNN...but we want to start with something simple! Now let us see if the model learns this relationship.
I started with a relatively simple training loop:
To visualize the trained model, I queried my new functions to discover its prediction for prices. I created a policy in the form of an array of tariffs, increasing over the next 20 time steps and used the output of each time step as the input into the following query. I then plotted the results. The mean is a hard line and the variance is a faded band around the mean:
The model correctly predicted the relationship of prices increasing when import volumes plummet. Because I control the data generators I can test more complex relationships and also test that the variance is learned correctly. Now that I have a model that learns from synthetic data, I can
- Add other economic indicators to this model
- Build out the second tier of the model that relates the economic indicators to prosperity
- create more complex relationships in the test data to further test the model's ability
- Train the model on real world data
Comments
Post a Comment