Bayesian Analysis of Coefficient Instability in Dynamic Regressions

This paper proposes a Bayesian regression model with time-varying coefficients (TVC) that makes it possible to estimate jointly the degree of instability and the time-path of regression coefficients. Thanks to its computational tractability, the model proves suitable to perform the first (to our knowledge) Monte Carlo study of the finite-sample properties of a TVC model. Under several specifications of the data generating process, the proposed model’s estimation precision and forecasting accuracy compare favourably with those of other methods commonly used to deal with parameter instability. Furthermore, the TVC model leads to small losses of efficiency under the null of stability and it is robust to mis-specification, providing a satisfactory performance also when regression coefficients experience discrete structural breaks. As a demonstrative application, we use our TVC model to estimate the exposures of S&P 500 stocks to market-wide risk factors: we find that a vast majority of stocks have time-varying risk exposures and that the TVC model helps to forecast these exposures more accurately.


Introduction 1
There is widespread agreement that instability in regression coe¢ cients represents a major challenge in empirical economics. In fact, many equilibrium relationships between economic variables are found to be unstable through time (e.g.: Stock and Watson -1996).
There are two main approaches to address instability in regression coe¢ cients: 1. formulating and estimating regression models under the hypothesis of constant coe¢ cients, testing for the presence of structural breaks (e.g.: Chow -1960, Brown, Durbin and Evans -1975, Nyblom -1989 and identifying the breakpoints (e.g.: Andrews, Lee, andPloberger -1996, Bai andPerron -1998); 2. formulating regression models with time-varying coe¢ cients (TVC) and estimating the path of their variation (e.g.: Doan, Litterman and Sims -1984, Stock and Watson -1996, Cogley and Sargent -2001. Approach (1) allows to search for time spans over which the hypothesis of constant coe¢ cients is not rejected by the data. However, it can happen that regression coe¢ cients change so frequently that the hypothesis of constant coe¢ cients does not …t any time span (or only time spans that are too short to be of any interest to the econometrician). In these cases, approach (2) can be utilized, as it is suitable to deal also with frequently changing coe¢ cients. On the other side of the coin, approach (2) often relies on dynamic speci…cations that are (at least in theory) not suitable to detect infrequent and abrupt changes in regression coe¢ cients.
In the absence of strong priors about the ways in which relationships between variables change, the two approaches can arguably be considered complementary and it seems reasonable to use them in conjunction. However, approach (1) is apparently much more frequently utilized than approach (2) in empirical work (e.g.: Kapetanios -2008).
One possible reason why TVC models are less popular is that tests for structural breaks are often quite easy to implement, while specifying and estimating TVC models is usually a di¢ cult task that relies on complex and computationally intensive numerical techniques and requires careful speci…cation of the dynamics of the coe¢cients. Even if the development of Markov chain Monte Carlo (MCMC) methods has somewhat facilitated the estimation of TVC models (e.g.: Carter andKohn -1994 andChib andGreenberg -1995), the technical skills and the computing time required by these techniques are still far superior to those required to estimate regressions with constant coe¢ cients 2 .
In this paper, we propose a Bayesian TVC model that aims to …ll this gap. The model has low computational requirements and allows to compute analytically the posterior probability that the regression is stable, the estimates of the regression coe¢ cients and several other quantities of interest. Furthermore, it requires minimal input from the econometrician, in the sense that priors are speci…ed automatically: in particular, the only inputs required from the econometrician are regressors and regressands, as in plain-vanilla OLS regressions with constant coe¢ cients.
Another possible reason why TVC models are less popular than OLS-based alternatives is that the properties of the former are thus far largely unknown, while the latter have been extensively studied both theoretically (e.g.: Bai and Perron -1998) and by means of Monte Carlo simulations (e.g.: Hansen -2000 andPerron -2006). Thanks to the computational tractability of our TVC model, we are able to perform the …rst (to our knowledge) Monte Carlo study of the …nite sample properties of a TVC model.
The main goal of our Monte Carlo study is to address the concerns of an applied econometrician who suspects that the coe¢ cients of a regression might be unstable, does not know what form of instability to expect and needs to decide what estimation strategy to adopt.
The …rst concern we address is loss of e¢ ciency under the null of stability. Suppose my data has indeed been generated by a regression with constant coe¢ cients; how much do I lose, in terms of estimation precision and forecasting accuracy, when I estimate the regression using the TVC model in place of OLS? Our results suggest that the losses from using the TVC model are generally quite small and they are comparable to the losses from using frequentist breakpoint detection procedures, such as Perron's (1998 and sequential procedure and its model-averaging variant (Pesaran and Timmermann -2007). Under most simulation scenarios, the mean squared estimation error increases by about 5 per cent when one of the proposed TVC estimators is used in place of OLS to estimate the coe¢ cients of a stable regression.
Another concern is robustness to mis-speci…cation. Suppose my data has been generated by a regression with few discrete structural breaks; how much do I loose from using the TVC model instead of standard frequentist procedures for breakpoint detection? Our Monte Carlo evidence indicates that also in this case the estimation precision and the forecasting accuracy of the TVC model are comparable to those of standard frequentist procedures.
Finally, a third concern is e¢ ciency under the null of instability. Even in the presence of frequently changing coe¢ cients, does the TVC model provide better estimation precision and forecasting performance than other, possibly mis-speci…ed, models? We …nd that it generally does and that in some cases this gain in e¢ ciency can be quite large (TVC can reduce the mean squared estimation error by up to 60 per cent with respect to the best performing OLS-based method).
All in all, the TVC model seems to be a valid complement to frequentist procedures for breakpoint detection, as the performances of the two approaches are, in general, comparable, but the TVC model fares better in the presence of frequently changing coe¢ cients. There is, however, an important exception to this general result: when the regression includes a lag of the dependent variable and the autoregressive coe¢ cient is near unity. In this case, the performance of the TVC model degrades steeply, and so, but to a lesser extent, does the performance of frequentist methods for breakpoint detection. We argue that this phenomenon is due to an identi…cation problem (already pointed out in similar contexts by Hatanaka andYamada -1999 andZhu -2005) which can be alleviated by adding more regressors or increasing the sample size.
The Monte Carlo study is also complemented by a brief demonstration of how the TVC model can be applied to a real-world empirical problem. We consider a regression commonly employed to estimate how stock returns are related to marketwide risk factors. We …nd that the coe¢ cients of this regression are unstable with high probability for a vast majority of the stocks included in the S&P 500 index. We also …nd that the TVC model helps to better predict the exposures of these stocks to the risk factors.
Our model belongs to the family of Class I multi-process dynamic linear models de…ned by West and Harrison (1997). In our speci…cation there is a single mixing parameter that takes on …nitely many values between 0 and 1. The parameter measures the stability of regression coe¢ cients: if it equals 0, then the regression is stable (coe¢ cients are constant); the closer it is to 1, the more unstable coe¢ cients are.
We propose two measures of stability that can be derived analytically from the posterior distribution of the mixing parameter, one based on credible intervals and one based on posterior odds ratios. We analyze the performance of a simple decision rule based on these measures of stability: "use OLS if they do not provide enough evidence of instability, otherwise use TVC". We …nd that such a decision rule performs well across di¤erent scenarios, leading to the smallest losses under the null of stability and still being able to produce satisfactory results when coe¢ cients are indeed unstable.
Some features of our model are borrowed from existing TVC models (in particular Doan, Littermann and Sims -1984, Stock and Watson -1996, Cogley and Sargent -2001, whereas other features are completely novel. First of all, we propose an extension of Zellner's (1986) g-prior to dynamic linear models. Thanks to this extension, posterior probabilities and coe¢ cient estimates are invariant to re-scalings of the regressors 3 : this property is essential to obtain a completely automatic speci…cation of priors. Another original feature of the model is the use of an invariant geometricallyspaced support for the prior distribution of the mixing parameter. We argue that this characteristic of the prior allows the model to capture both very low and very high degrees of coe¢ cient instability, while retaining a considerable parsimony. Our modelling choices have two main practical consequences: 1) the priors are speci…ed in a completely automatic way so that regressors and regressands are the only input required from the …nal user 4 ; 2) the computational burden of the model is minimized, 3 Before arriving to the speci…cation of priors proposed in this paper we tried several other spec-i…cations and we found that results can indeed be quite sensitive to rescalings if one chooses other priors. 4  where y is a T 1 vector of observations on the dependent variable and X is a T K matrix of regressors.
8 because analytical estimators are available both for the regression coe¢ cients and for their degree of instability. To our knowledge, none of the existing models has these two characteristics, that allow to use the model in large-scale applications such as Monte Carlo simulations. The paper is organized as follows: Section 2 presents the model; Section 3 describes the speci…cation of priors; Section 4 introduces the two measures of (in)stability; Section 5 reports the results of the Monte Carlo experiments; Section 6 contains the empirical application; Section 7 concludes. Proofs and other technical details are relegated to the Appendix.

The Bayesian model
We consider a dynamic linear model (according to the de…nition given by West and Harrison -1997) with time-varying regression coe¢ cients: where x t is a 1 k vector of observable explanatory variables, t is a k 1 vector of unobservable regression coe¢ cients and v t is an i.i.d. disturbance with normal distribution having zero mean and variance V . Time is indexed by t and goes from 1 to T (T is the last observation in the sample). The vector of coe¢ cients t is assumed to evolve according to the following equation:

Notation
Let information available at time t be denoted by D t . D t is de…ned recursively by: and D 0 contains prior information on the parameters of the model (to be speci…ed below).
We denote by (z jD t ) the distribution of a random vector z, given information at time t and by p (z jD t ) its probability density (or mass) function.
If a random vector z has a multivariate normal distribution with mean m and covariance matrix S, given D t , we write: If a k 1 random vector z has a multivariate Student's t distribution with mean m, scale matrix S and n degrees of freedom, we write: and its density is parametrized as follows: If z has a Gamma distribution with parameters V and n, we write and its density is parametrized as follows: Finally, de…ne W = V 1 W and denote by X the design matrix

Structure of prior information and updating
In this subsection we state the main assumptions on the structure of prior information and we derive the formulae for updating the priors analytically. The …rst set of assumptions regards 1 , the vector of regression coe¢ cients at time t = 1, and V , the variance of the regression disturbances. We impose on 1 and V a conjugate normal/inverse-gamma prior 5 , i.e.: 1 has a multivariate normal distribution conditional on V , with known mean b 1;0 and covariance equal to V F ;1;0 where F ;1;0 is a known matrix; the reciprocal of V has a Gamma distribution, with known parameters b V 0 and n 0 .
The second set of assumptions regards W , which is assumed to be proportional to the prior variance of 1 : where is a coe¢ cient of proportionality 6 . When = 0, the covariance matrix of w t is zero and regression coe¢ cients are stable. On the contrary, when > 0, w t has non-zero covariance matrix and the regression coe¢ cients are unstable (i.e. they change through time). The higher is, the greater the variance of w t is and the more unstable regression coe¢ cients are.
The constant of proportionality is parametrized as: where ( ) is a strictly increasing function and is a random variable with …nite support R : The prior probabilities of the q possible values of are denoted by p 0;1 , . . . , p 0;q . The discussion of how 1 , . . . , q and p 0;1 , . . . , p 0;q are chosen is postponed to the next section.
The assumptions on the priors and the initial information are summarized as follows: Assumption 1 The priors on the unknown parameters are: p ( i jD 0 ) = p 0;i ; i = 1; : : : ; q and the initial information set is: Given the above assumptions, the posterior distributions of the parameters of the regression can be calculated as follows: Proposition 2 Let priors and initial information be as in Assumption 1. Let p t;i = p ( = i jD t ). Then: and the parameters of the above distributions are obtained recursively as: starting from the initial conditions b and n 0;i = n 0 , while the mixing probabilities are obtained recursively as: starting from the prior probabilities p 0;1 ; : : : ; p 0;q .
The updated mixing probabilities in the above proposition can be interpreted as posterior model probabilities, where a model is a TVC regression with …xed . Hence, for example, p T;1 is the posterior probability of the regression model with stable coe¢ cients ( = 0). A crucial property of the framework we propose is that posterior model probabilities are known analytically: they can be computed exactly, without resorting to simulations.
In the above proposition, the priors on the regression coe¢ cients t in a generic time period t are updated using only information received up to that same time t. However, after observing the whole sample (up to time T ), one might want to revise her priors on the regression coe¢ cients t in previous time periods (t < T ), using the information subsequently received. This revision (usually referred to as smoothing) can be accomplished using the results of the following proposition: Proposition 3 Let priors and initial information be as in Assumption 1. Then, for 0 T 1: The mixing probabilities p T;i and the parameters b V T;i , n T;i are obtained from the recursions in Proposition 2 while the parameters b Other important quantities of interest are known analytically, as shown by the following: Lemma 4 For 1 t T and s 2 ft 1; t; T g, the following equalities hold 7 : and Var [y t jx t ; D s ] can be calculated analytically for each i as in Propositions 2 and 3.
Thus, parameter estimates (E [ t jD s ]) and predictions (E [y t jD s ]) in any time period can be computed analytically and their variances are known in closed form. The probability distributions of t and y t in a certain time period given information D s are mixtures of Student's t distributions. Their quantiles are not known analytically, but they are easy to simulate by Monte Carlo methods. For example, if the distribution of T conditional on D T is the object of interest, one can set up a Monte Carlo experiment where each simulation is conducted in two steps: 1) extract z from a uniform distribution on [0; 1]; …nd k such that k = arg min

The speci…cation of priors
Our speci…cation of priors aims to be: 1. objective, in the sense that it does not require elicitation of subjective priors; 2. fully automatic, in the sense that the model necessitates no inputs from the econometrician other than regressors and regressands, as in plain-vanilla OLS regressions with constant coe¢ cients.
The above goals are pursued by extending Zellner's (1986) g-prior to TVC models and by parametrizing ( ) in such a way that the support of is invariant (it needs not be speci…ed on a case-by-case basis).

The prior mean and variance of the coe¢ cients
We use a version of Zellner's (1986) g-prior for the prior distribution of the regression coe¢ cients at time t = 1:

Assumption 5
The prior mean is zero, corresponding to a prior belief of no predictability: while the prior covariance matrix is proportional to X > X 1 : where g is a coe¢ cient of proportionality.
Zellner's (1986) g-prior is widely used in model selection and model averaging problems similar to ours (we have a range of regression models featuring di¤erent degrees of instability), because it greatly reduces the sensitivity of posterior model probabilities to the speci…cation of prior distributions (Fernandez, Ley and Steel -2001), thus helping to keep the analysis as objective as possible. Furthermore, Zellner's (1986) g-prior has a straightforward interpretation: it can be interpreted as information provided by a conceptual sample having the same design matrix X as the current sample (Zellner 1986;George and McCulloch 1997;Smith and Kohn 1996).
To keep the prior relatively uninformative, we follow Kass and Wasserman (1995) and choose g = T (see also Shively, Kohn and Wood -1999): The coe¢ cient of proportionality is g = T .
Thus, the amount of prior information (in the Fisher sense) about the coe¢ cients is equal to the amount of average information contained in one observation from the sample.
Remark 7 Given that W = ( ) F ;1;0 (equations 3 and 4), Zellner's prior (10) implies that also the covariance matrix of w t is proportional to X > X 1 : This proportionality condition has been imposed in a TVC model also by Stock and Watson 8 (1996), who borrow it from Nyblom (1989). A similar hypothesis is adopted also by Cogley and Sargent 9 (2001). (10) and (9), all the coe¢ cients t have zero prior mean and covariance proportional to X > X 1 conditional on D 0 : This property will be used later, together with other properties of the priors, to prove that posterior model probabilities are scale invariant in the covariates.

The variance parameters b
V 0 and n 0 In objective Bayesian analyses, the prior usually assigned to V in conjunction with Zellner's (1986) g-prior (e.g.: Liang et al. -2008) is the improper prior: However, they assume that F ;1;0 is proportional to the identity matrix, while we assume that also F ;1;0 is proportional to (X | X) 1 . Furthermore, they do not estimate V . Their analysis is focused on the one-step-ahead predictions of y t , which can be computed without knowing V . They approach the estimation of in a number of di¤erent ways, but none of them allows to derive analytically a posterior distribution for . 9 In their model the prior covariance of w t is proportional to (X | X) 1 , but X is the design matrix of a pre-sample not used for the estimation of the model. With this choice, the updating equations in Proposition 2 would have to be replaced with a di¤erent set of updating equations until reaching the …rst non-zero observation of y t (see e.g. West and Harrison -1997). Furthermore, the updating of posterior probabilities would be slightly more complicated. To avoid the subtleties involved in using an improper prior, we adopt a simpler procedure, which yields almost identical results in reasonably sized samples: The …rst observation in the sample (denote it by y 0 ) is used to form the prior on V : After using it to form the prior, we discard the …rst observation and start updating the equations (5)-(8) from the following observation. If the …rst observation is zero (y 0 = 0) we discard it and use the next to form the prior (or repeat until we …nd the next non-zero observation).

The mixing parameter
We have assumed that W = ( ) F ;1;0 where is a random variable having …nite support R = f 1 ; : : : ; q g [0; 1], 1 = 0 and ( ) is strictly increasing in and such that ( 1 ) = 0. We now propose a speci…cation of the function ( ) that satis…es the above requirements and allows for an intuitive interpretation of the parameter , while also facilitating the speci…cation of a prior distribution for .
First, note that: Hence, given , x t and the initial information D 0 , the variance generated by innovations at time t is: Assumption 10 is the fraction of Var [x t w t + v t jx t ; D 0 ; ] generated on average by innovations to the regression coe¢ cients: Given this assumption on , it is immediate to prove that is strictly increasing in and such that (0) = 0, as required. Hence, when = 0 the regression has stable coe¢ cients. Furthermore, by an appropriate choice of , any degree of coe¢ cient instability can be reproduced (when tends to 1, approaches in…nity).
As far as the support of is concerned, we make the following assumption:
Notice that max cannot be chosen to be exactly equal to 1 (because (1) = 1), but it can be set equal to any number arbitrarily close to 1.
Using a geometrically spaced grid is the natural choice when the order of magnitude of a parameter is unknown (e.g.: Guerre and Lavergne -2005, Horowitz and Spokoiny -2001, Lepski, Mammen and Spokoiny -1997): in our model, it allows to simultaneously consider both regressions that are very close to being stable and regressions that are far from being stable, without requiring too …ne a grid 10 .
If the geometric grid is considered as an approximation of a …ner set of points (possibly a continuum), the geometric spacing ensures that the maximum relative round-o¤ error is constant on all subintervals [ i ; i+1 ] such that 1 < i < q. The maximum relative round-o¤ error is approximately 1 c 2 on these subintervals and it can be controlled by an appropriate choice of c. On the contrary, the maximum relative round-o¤ error cannot be controlled (it always equals 1) on the subinterval [ 0 ; 1 ], because the latter contains the point 0 = 0. Only the absolute round-o¤ error (equal to max 2 c q 2 ) can be controlled on [ 0 ; 1 ], by an appropriate choice of q. Therefore, setting the two parameters c and q can be assimilated to setting the absolute and relative error tolerance in a numerical approximation problem.
Assuming prior ignorance on the order of magnitude of , we assign equal probability to each point in the grid:

Assumption 12
The prior mixing probabilities are assumed to be: p ( i jD 0 ) = q 1 ; i = 1; : : : ; q Note that, given the above choices, the prior on and its support are invariant, in the sense that they do not depend on any speci…c characteristic of the data to be analyzed, but they depend only on the maximum percentage round-o¤ error 1 c 2 . As a consequence, they allow the speci…cation of priors to remain fully automatic.

Scale invariance
A crucial property of the automatic speci…cation of priors proposed in the previous sections is that it guarantees scale invariance. The scale invariance property is satis…ed if multiplying the regressors by an invertible matrix R, the posterior distribution of the coe¢ cients is re-scaled accordingly 11 (it is multiplied by R 1 ). Virtually all the TVC models we have found in the literature do not satisfy the scale invariance property, in the sense that they do not contemplate a mechanism to guarantee scale invariance by automatically re-scaling priors when the scale of regressors is changed. Although scale invariance might seem a trivial property, it is indispensable to achieve one of the main goals of this paper: having a completely automatic model that requires only regressors and regressands as inputs from the econometrician. Furthermore, it guarantees replicability of results: two researchers using the same data, but on di¤erent scales, will obtain the same results.
Scale invariance is formally de…ned as follows: De…nition 13 Given the initial information set D 0 , the information sets D t = D t 1 [ fy t ; x t g t = 1; : : : ; T and a full-rank k k matrix R, an initial information set D 0 is said to be R-scale invariant with respect to D 0 if and only if: Note that the initial information set D 0 , which contains the priors, is automatically speci…ed as a function of y 0 and X > X 1 . We can write: The following proposition, proved in the Appendix, shows in what sense our TVC model is scale-invariant: Proposition 14 For any full-rank k k matrix R, the initial information set D 0 de…ned by: is R-scale invariant with respect to the initial information set D 0 , as de…ned in (11).

Measures of (in)stability
After computing the posterior distribution of , a researcher might naturally ask: how much evidence did the data provide against the hypothesis of stability? Here, we discuss some possible ways to answer this question.
The crudest way to evaluate instability is to look at the posterior probability that = 0. The closer to 1 this probability is, the more evidence of stability we have. However, a low posterior probability that = 0 does not necessarily constitute overwhelming evidence of instability. It might simply be the case that the sample is not large enough to satisfactorily discriminate, a posteriori, between stable and unstable regressions: in such cases, even if the true regression is stable, unstable regressions might be assigned posterior probabilities that are only marginally lower than the probability of the stable one. Furthermore, if R contains a great number of points, it can happen that the posterior probability that = 0 is close to zero, but still much higher than the posterior probability of all the other points.
We propose two measures of stability to help circumvent the above shortcomings. The …rst measure of stability, denoted by , is based on credible intervals (e.g.: Robert -2007): De…nition 15 ( -stability) Let H be a higher posterior probability set de…ned as follows: 12 The stability measure is de…ned by: where we adopt the convention 0=0 = 0: When = 1, = 0 is a mode of the posterior distribution of : we attach to the hypothesis of stability a posterior probability that is at least as high as the posterior probability of any alternative hypothesis of instability. On the contrary, when = 0, the posterior probability assigned to the hypothesis of stability is so low that all unstable models are more likely than the stable one, a posteriori. In the intermediate cases (0 < < 1), provides a measure of how far the hypothesis of stability is from being the most likely hypothesis (the lower , the less likely stability is).
The second measure of stability, denoted by , is constructed as a posterior odds ratio and it is based on the probability of the posterior mode of .
De…nition 16 ( -stability) Let p be the probability of (one of) the mode(s) of the posterior distribution of : 12 H contains all points of R having higher posterior probability than = 0 (recall that = 0 means that regression coe¢ cients are stable).

22
The stability measure is de…ned by: As with the previously proposed measure, when = 1, = 0 is a mode of the posterior distribution of and stability is the most likely hypothesis, a posteriori. On the contrary, the closer is to zero, the less likely stability is, when compared with the most likely hypothesis. For example, when = 1=10, there is an unstable regression that is 10 times more likely than the stable one.
Both measures of stability ( and ) can be used to make decisions. For example, one can …x a threshold and decide to reject the hypothesis of stability if the measure of stability is below the threshold ( < or < ). In case is used, the procedure can be assimilated to a frequentist test of hypothesis, where 1 represents the level of con…dence.
can be interpreted as a sort of Bayesian p-value (e.g.: Robert -2007): the lower is, the higher is the con…dence with which we can reject the hypothesis of stability 13 . In case is used, one can resort to Je¤reys' (1961) scale to qualitatively assess the strength of the evidence against the hypothesis of stability (e.g.: substantial evidence if 1 3 < 1 10 , strong evidence if 1 10 < 1 30 , very strong evidence if 1 30 < 1 100 ). In the next section we explore the consequences of using these decision rules to decide whether to estimate a regression by OLS or by TVC.

Performance when the DGP is a stable regression
In this subsection we present the results of a set of Monte Carlo simulations aimed at evaluating how much e¢ ciency is lost when a stable regression is estimated with our TVC model. We compare the forecasting performance and the estimation precision of the TVC model with those of plain vanilla OLS and of a standard frequentist procedure used to identify breakpoints and estimate regression coe¢ cients in the presence of structural breaks. In particular, we consider the performance of Perron's (1998 and sequential procedure, as implemented by Pesaran andTimmermann (2002 and. For our Monte Carlo experiments, we adapt a design that has already been employed in the literature on parameter instability (Hansen -2000).
The design is as follows: Data generating process: y t is generated according to: and u t and z t are serially and cross-sectionally independent.
Estimated equations: two equations are estimated. In the …rst case, a constant and the …rst lags of y t and u t are included in the set of regressors; hence, the estimated model is (1), where In the second case, a constant and the …rst three lags of y t and u t are included in the set of regressors; hence, the estimated model is (1), where Parameters of the design: simulations are conducted for three di¤erent sample sizes (T = 100; 200; 500), four di¤erent values of the autoregressive coe¢ cient ( = 0; 0:50; 0:80; 0:99) and the two estimated equations detailed above, for a total of 24 experiments.
Each Monte Carlo experiment consists of 10,000 simulations. The loss in estimation precision is evaluated comparing the estimate of the co-e¢ cient vector at time T (denote it by e T ) with its true value. We consider seven di¤erent estimates: model averaging (TVC-MA) estimates, where: and j = arg max j p T;j i.e. only the model with the highest posterior probability is used to make predictions; estimates obtained from the regression model with stable coe¢ cients when 0:1 and from model averaging when < 0:1 (denoted by TVC-): i.e. coe¢ cients are estimated with the TVC model only if there is enough evidence of instability ( < 0:1); otherwise, the standard OLS regression is used. This is intended to reproduce the outcomes of a decision rule whereby the econometrician uses the TVC model only if the TVC model itself provides enough evidence that OLS is inadequate; estimates obtained from the regression model with stable coe¢ cients when 0:1 and from model averaging when < 0:1 (denoted by TVC-): This estimator is similar to the previous one, but is used in place of to decide whether there is enough evidence of instability; estimates obtained from the regression model with stable coe¢ cients (OLS): OLS estimates obtained from Perron's (1998 and) sequential 14 procedure (denoted by BP), using the SIC criterion to choose the number of breakpoints (Pesaran andTimmermann -2002 and. If e is the last estimated breakpoint date in the sample, then e T is the OLS estimate of T obtained using all the sample points from e to T ; estimates obtained from Pesaran and Timmermann's (2007) model-averaging procedure (denoted by BP-MA): the location of the last breakpoint is estimated with Bai and Perron's procedure (as in the point above); if e is the last estimated breakpoint date in the sample, then: where e T; is the OLS estimate of T obtained using all the sample points from to T ; w is a weight proportional to the inverse of the mean squared prediction error committed when using only the sample points from onwards to estimate the regression and predict y t ( + k + 1 t T ).
The Monte Carlo replications are used to estimate the mean squared error of the coe¢ cient estimates: where kk is the Euclidean norm and j =TVC-MA, TVC-MS, TVC-, TVC-, OLS, BP, BP-MA depending on which of the above methods has been used to estimate T .
The two parameters regulating the granularity of the grid for are chosen as follows: q = 100 and c = 0:9. To avoid degeneracies, rather than setting max = 1 14 We estimate the breakpoint dates sequentially rather than simultaneously to achieve a reasonable computational speed in our Monte Carlo simulations. Denote by S the number of breakpoints estimated by the sequential procedure and by the number estimated by the simultaneous procedure. Given that we are using the SIC criterion to choose the number of points, if 1, then S = ; otherwise, if > 1, then S . Therefore, in our Montecarlo simulations (where the true number of breakpoints is either 0 or 1), the sequential procedure provides a better estimate of the number of breakpoints than the simultaneous procedure.
(the theoretical upper bound on ), we choose a value that is numerically close to 1 ( max = 0:999). Thus, the relative round-o¤ error is bounded at 5 per cent and the model is able to detect degrees of instability as low as ' 3 10 5 (for concreteness, this means that coe¢ cient instability can be detected by the model also in cases in which less than 0:01 per cent of total innovation variance is generated by coe¢ cient instability).
Panel A of Table 1 reports the Monte Carlo estimates of M SE j for the case in which x t includes only the …rst lags of y t and u t . Not surprisingly, the smallest MSE is in all cases achieved by the OLS estimates. As anticipated in the introduction, there are signi…cant di¤erences between the case in which the autoregressive component is very persistent ( = 0:99) and the other cases ( = 0, 0:50, 0:80). In the latter cases, the TVC-coe¢ cient estimates are those that yield the smallest increase in MSE with respect to OLS (in most cases under 5 per cent). The performance of BP-MA is the second best, being only slightly inferior to that of TVC-, but slightly superior to that of TVC-. The unsatisfactory performance of the TVC and BP estimates in the case of high persistence can arguably be explained by an identi…cation problem. In the unit root case, the regression generating the data is: For any < 1, it can be rewritten as: where t = (1 ) y t 1 is an intercept following a random walk. Furthermore, its innovations ( t t 1 ) are contemporaneously independent of the innovations v t . There-fore, if the estimated equation includes a constant and time-varying coe¢ cients are not ruled out, it is not possible to identify whether the regression has a unit root and stable coe¢ cients or has a stationary autoregressive component and a time-varying intercept 15 . When is near unity, identi…cation is possible, but it will presumably be weak, giving rise to very imprecise estimates of the coe¢ cients and of their degree of stability. Note that the two equivalent (and unidenti…ed) representations above obviously yield the same one-step-ahead forecasts of y t . Therefore, if our conjecture that this weak identi…cation problem is a¤ecting our results is correct, we should …nd that the out-of-sample forecasts of y t produced by the TVC model are not as unsatisfactory as its coe¢ cient estimates. This is exactly what we …nd and document in the last part of this subsection.
Panel B of Table 1 reports the Monte Carlo estimates of M SE j for the case in which x t includes three lags of y t and u t . In the case of low persistence, the BP-MA estimates are those that achieve the smallest increase in MSE with respect to the OLS estimates (on average below 2 per cent). The performance of the TVC-estimates is only slightly inferior (around 3 percent increase in MSE with respect to OLS). All the other estimates (TVC-MA, TVC-MS, TVC-and BP) are somewhat less e¢ cient, but their MSEs seldom exceed those of the OLS estimates by more than 30%. As far as the highly persistent case ( = 0:99) is concerned, we again observe a degradation in the performance of the TVC and (to a lesser extent) of the BP estimates. However, the degradation is less severe than the one observed in the case of fewer regressors. Intuitively, adding more regressors (even if their coe¢ cients are 0) helps to alleviate the identi…cation problem discussed before, because the added regressors have stable coe¢ cients and hence help to pin down the stable representation of the regression.
The loss in forecasting performance is evaluated using a single out-of-sample prediction for each replication. In each replication, T + 1 observations are generated, the …rst T are used to update the priors, the vector of regressors x T +1 is used to predict y T +1 and the prediction (denote it by e y T +1 ) is compared to the actual value y T +1 . As for coe¢ cient estimates, we consider seven di¤erent predictions: model averaging (TVC-MA) predictions, where: predictions generated by the regression model with stable coe¢ cients when 0:1 and by model averaging when < 0:1 (denoted by TVC-): predictions generated by the regression model with stable coe¢ cients when where j =TVC-MA, TVC-MS, TVC-, TVC-, OLS, BP, BP-MA depending on which of the above methods has been used to forecast y T +1 .
To increase the accuracy of our Monte Carlo estimates of M SE y j ; we use the fact that: Since E v 2 T +1 is known, we use the Monte Carlo simulations to estimate only the second summand on the right hand side of the above equation. Table 2 reports the Monte Carlo estimates of M SE y j . The variation in M SE y j across models and design parameters broadly re ‡ects the variation in M SE j we have discussed above. To avoid repetitions, we point out the only signi…cant di¤erence, which concerns the highly persistent design ( = 0:99): while the TVC and BP estimates give rise to an M SE j that is around two orders of magnitude higher than M SE OLS , the part of their M SE y j attributable to estimation error (M SE y j 1) compares much more favorably to its OLS counterpart, especially in the designs where x t includes three lags of y t and u t . This might be considered evidence of the identi…cation problem mentioned above.

Performance when the DGP is a regression with a discrete structural break
In this subsection we present the results of a set of Monte Carlo simulations aimed at understanding how our TVC model performs when regression coe¢ cients experience a single discrete structural break. As in the previous subsection, we analyze both losses in forecasting performance and losses in estimation precision.
The Monte Carlo design is the same employed in the previous subsection, except for the fact that the data generating process is now subject to a discrete structural break at an unknown date: 30 Data generating process: y t is generated according to: where y 0 = 0, u t T (0; 1; 5) i.i.d., v t N (0; 1) i.i.d. and u t and v t are serially and cross-sectionally independent; is the stochastic breakpoint date, extracted from a discrete uniform distribution on the set of sample dates (from 1 to T ); b N (0; 1) is the stochastic break in regression coe¢ cients.
The estimation precision and the forecasting performance are evaluated comparing the estimates of the coe¢ cient vector at time T and the predictions of y T +1 with their true values.
Panel A of Table 3 reports the Monte Carlo estimates of M SE j for the case in which x t includes only the …rst lags of y t and u t . As before, we …rst discuss the cases in which 6 = 0:99. The OLS estimates, which have the smallest MSEs in the stable case (see previous subsections) are now those with the highest MSEs. Both the frequentist methods (BP and BP-MS) and the TVC methods (all four kinds) achieve a signi…cant reduction of the MSE with respect to OLS. Although TVC-MA and TVC-MS perform slightly better than TVC-and TVC-, there is not a clear ranking between the former two and the two frequentist methods: their MSEs are on average comparable, but TVC-MA and TVC-MS tend to perform better when the sample size is small (T = 100), while BP and BP-MA tend to perform better when the sample size is large (T = 200; 500). This might be explained by the fact that BP and BP-MA require the estimation of a considerable number of parameters when one or more break-dates are found and these parameters are inevitably estimated with low precision when the sample size is small. In the case in which = 0:99, results are again substantially di¤erent: the MSEs of the TVC estimates (all four kinds) and of the BP estimates become much larger than the MSEs of the OLS estimates (and the BP estimates fare better than the TVC estimates), while the MSEs of the BP-MA estimates remain below those of the OLS estimates. The remarks about potential identi…cation problems made in the previous subsections apply also to these results.
Panel B of Table 3 reports the Monte Carlo estimates of M SE j for the case in which x t includes three lags of y t and u t . The patterns are roughly the same found in Panel A (see the previous paragraph), with the relative performance of the TVC methods and the frequentist methods depending on the sample size T . The only di¤erence worth mentioning is that when = 0:99 the increase in the MSEs is milder and the TVC-MA estimates are more precise than the BP estimates.
As far as out-of-sample forecasting performance is concerned (Table 4, Panels A and B), the patterns in the M SE y j broadly re ‡ect the patterns in the M SE j . Again, there is an exception to this: when = 0:99, high values of M SE j do not translate into high values of M SE y j ; as a consequence, despite the aforementioned identi…cation problem, the BP and the four TVC forecasts are much more accurate than the OLS forecasts (and in some cases also more accurate than the BP-MA forecasts).

Performance when the DGP is a regression with frequently changing coe¢ cients
In this subsection we present the results of a set of Monte Carlo simulations aimed at understanding how our TVC model performs when regression coe¢ cients experience frequent changes.
We analyze both losses in forecasting performance and losses in estimation precision, using the same Monte Carlo design employed in the previous two subsections. The only di¤erence is that the data is now generated by a regression whose coe¢ cients change at every time period: Data generating process: y t is generated according to: and u t , v t and w t are serially and cross-sectionally independent. To ease comparisons with the previous subsection, W is chosen in such a way that b T N (1; 1), irrespective of the sample size T : Note that, although one coe¢ cient of the regression is frequently changing (b t ), the other coe¢ cient ( ) is stable. As a consequence, the true DGP does not …t exactly any of the possible DGPs contemplated by the TVC model. We prefer to adopt this speci…cation over a speci…cation in which the TVC model is correctly speci…ed, because the results obtained with the latter speci…cation are trivial (the TVC estimates are the best possible estimates). Furthermore, controlling (keeping it …xed) allows to better understand its e¤ects on model performance.
Panel A of Table 5 reports the Monte Carlo estimates of M SE j for the case in which x t includes only the …rst lags of y t and u t . We …rst summarize the results obtained when 6 = 0:99. The lowest MSEs are achieved by the TVC-MA estimates. The TVC-MS estimates are the second best (in some cases M SE T V C M S is almost identical to M SE T V C M A ). TVC-and TVC-also have a performance comparable to that of TVC-MA (the increase in the MSEs is on average less than 5 per cent). The BP estimates are signi…cantly less precise than the TVC estimates (their MSEs are roughly between 30 and 70 per cent higher than M SE T V C M A ). Finally, BP and BP-MA have a comparable performance when T = 100, but BP-MA is much less precise when the sample size increases (T = 200; 500). When = 0:99, we again observe a sharp increase in the MSEs of the TVC estimates (all four kinds) and of the BP estimates: their MSEs become several times those of the OLS estimates. BP-MA achieves a signi…cant reduction in MSE over OLS with larger sample sizes (T = 200; 500). Thus, also with frequently changing coe¢ cients, BP-MA seems to be the only method capable of dealing simultaneously with coe¢ cient instability and a highly persistent lagged dependent variable.
Panel B of Table 5 reports the Monte Carlo estimates of M SE j for the case in which x t includes three lags of y t and u t . Similarly to what we found in the previous subsections, the only noticeable di¤erence with respect to the one-lag case is that when = 0:99 the increase in the MSEs is milder.
As far as out-of-sample forecasting performance is concerned ( Table 6, Panels A and B), the patterns in the M SE y j broadly re ‡ect the patterns in the M SE j . Again, the case = 0:99 constitutes an exception: despite their high M SE j , the BP and the four TVC forecasts are more accurate than the OLS forecasts (and the TVC-MA and TVC-forecasts are also more accurate than the BP-MA forecasts).

Empirical application: estimating common stocks' exposures to risk factors
In this section we brie ‡y illustrate an empirical application of our TVC model. We use the model to estimate the exposures of S&P 500 constituents to market-wide risk factors. We track the weekly returns of the S&P 500 constituents for 10 years (from January 2000 to December 2009). An uninterrupted time series of returns is available for 432 of the 500 constituents (as of December 2009). The list of constituents and their returns are downloaded from Datastream. The risk factors we consider are the Fama and French's (1993and 1996 risk factors (excess return on the market portfolio, return on the Small Minus Big portfolio, return on the High Minus Low portfolio), downloaded from Kenneth French's website.
The exposures to the risk factors are the coe¢ cients t in the regression where y t is the excess return on a stock at time t, x t = h 1 r M;t r f;t SM B t HM L t i r M;t is the return on the market portfolio at time t, r f;t is the risk-free rate of return and SM B t and HM L t are the returns at time t on the SMB and HML portfolios respectively. The procedures illustrated in the previous section are employed to understand whether the risk exposures t are time-varying and whether the TVC model provides good estimates of these risk exposures.
For a vast majority of the stocks included in our sample, we …nd evidence that t is indeed time-varying 16 . = 0 is the posterior mode of the mixing parameter only for 11 stocks out of 432. Furthermore, < 0:1 and < 0:1 for 92% and 81% of the stocks respectively. On average, is 0.046 and is 0.010. Also the frequentist method provides evidence that most stocks experience instability in their risk exposures: according to the BP sequential estimates, more than 78% of stocks experience at least one break in t .
To evaluate the forecasting performance, we use the out-of-sample forecasts of y t obtained after the …rst 400th week. The methods used to make predictions are those described in the previous section ( j =TVC-MA, TVC-MS, TVC-, TVC-, OLS, BP, BP-MA ). For each stock i and for each prediction method j, the mean squared error is computed as: where T 0 is the number of periods elapsed before the …rst out-of-sample forecast is produced, e y t;i;j denotes the prediction of the excess return of the i-th stock at time t, conditional on x t , produced by method j, and y t;i is the corresponding realization.
To be able to compare the performance of the various methods across stocks, we use the performance of OLS forecasts as a benchmark. Thus, the gain from using model j with stock i is de…ned as: i.e. GAIN i;j is the average reduction in MSE achieved by using model j instead of OLS. A positive value indicates an improvement in forecasting performance. Table 7 reports some summary statistics of the sample distribution of GAIN i;j (each stock i represents a sample point). All the TVC methods achieve a reduction in MSE and, among the TVC methods, TVC-MA achieves the maximum average reduction (approximately 3 per cent). BP performs very poorly (it actually causes a strong increase in MSE), while the average reduction achieved by BP-MA is similar to that of TVC-MA (again, approximately 3 per cent). The four TVC models have similar sample distributions of gains, characterized by a pronounced skew to the right (several small gains and few very large gains); furthermore, all four have a more dispersed distribution than the BP-MA model.

Conclusions
We have proposed a Bayesian regression model with time-varying coe¢ cients (TVC). With respect to existing TVC models, we have introduced some technical innovations aimed at making TVC models less computationally expensive and completely automatic (by completely automatic we mean that regressors and regressands are the only input required from the econometrician, so that he/she does not need to engage in technically demanding speci…cations of priors and model parametrizations).
We have conducted several Monte Carlo experiments to understand the pros and cons that might be encountered when using the TVC model in applied econometric analyses. We have found that the cons are generally limited, in the sense that the TVC model has satisfactory estimation precision and forecasting performance also when regression coe¢ cients are indeed stable or when coe¢ cient instability is present but the TVC model is mis-speci…ed. In the presence of coe¢ cient instability, there are potential rewards from using the TVC model: in some cases, its estimation precision and forecasting accuracy are signi…cantly better than those of competing models.
To demonstrate a real-world application of our TVC model, we have used it to estimate the exposures of S&P 500 stocks to market-wide risk factors. We have found that a vast majority of stocks have time-varying risk exposures and that the TVC model helps to better forecast these exposures.
Before concluding, two remarks on the applicability of our TVC model are in order. First, we have con…ned attention to single equation regression models, but the results presented in the paper can be extended in a straightforward manner to multiple equation models (for example VARs), by imposing the usual normal / inverse Wishart priors on the initial parameters. Second, we have not discussed the use of the model for the analysis of cross-sectional data: however, it is possible to use TVC models like ours to analyze cross-sectional data in the presence of non-linearities that are not explicitly captured by the regressors (see West and Harrison -1997); this is usually accomplished by replacing the time index t with the rank statistic of the regressor that is presumably responsible for the non-linearity.

Proofs of propositions 2 and 3
In this section we derive the formulae presented in Propositions 2 and 3. To facilitate the exposition, we start from simpler information structures and then we tackle the more complex information structure assumed in Propositions 2 and 3 and summarized in Assumption 1.

V and known, 1 unknown
We start from the simple case in which V and are both known. The assumptions on the priors and the initial information are summarized as follows:

Case 17 (Priors and initial information)
The priors on the unknown parameters are: and the initial information set is: o Note that also W = ( ) F ;1;0 and W = V ( ) F ;1;0 are known, because and V are known. The information sets D t satisfy the recursion D t = D t 1 [ fy t ; x t g, starting from the set D 0 . Given the above assumptions, as new information becomes available, the posterior distribution of the parameters of the regression can be calculated using the following results: Proposition 18 (Forward updating) Let priors and initial information be as in Case 17. Then: where the means and variances of the above distributions are calculated recursively as starting from the initial values b 1;0 and F ;1;0 .
Proof. Note that, given the above assumptions, the system: is a Gaussian linear state-space system, where y t = x t t + v t is the observation equation and t = t 1 + w t is the transition equation. Hence, the posterior distribution of the states can be updated using the Kalman …lter. The recursive equations (12) are just the usual updating equations of the Kalman …lter (e.g.: Hamilton -1994).
The smoothing equations are provided by the following proposition: Proposition 19 (Backward updating) Let priors and initial information be as in Case 17. Then: where the means and the variances of the above distributions are calculated recursively (backwards) as follows: and the backward recursions start from the terminal values of the forward recursions (12).

known, 1 and V unknown
In this subsection we relax the assumption that V (the variance of v t ) is known and we impose a Gamma prior on the reciprocal of V . The assumptions on the priors and 38 the initial information are summarized as follows: Case 20 (Priors and initial information) The priors on the unknown parameters are: and the initial information set is: o Note that also W = ( ) F ;1;0 is known, because is known. The information sets D t satisfy the recursion D t = D t 1 [ fy t ; x t g, starting from the set D 0 . Given the above assumptions, the posterior distributions of the parameters of the regression can be calculated as follows: Proposition 21 (Forward updating) Let priors and initial information be as in Case 20. Then: where the parameters of the above distributions are calculated recursively as in (12) and as follows: Proof. The proof is by induction. At time t = 1, p ( 1 jD 0 ; V ) and p (1=V jD 0 ) are the conjugate normal / inverse gamma priors of a standard Bayesian regression model with constant coe¢ cients (e.g. Hamilton -1994). Therefore, the usual results on the updating of these conjugate priors hold: Since 2 = 1 + w 2 and then, by the additivity of normal distributions: Therefore, at time t = 2, p ( 2 jD 1 ; V ) and p (1=V jD 1 ) are again the conjugate normal / inverse gamma priors of a standard Bayesian regression model with constant coe¢ cients. Proceeding in the same way as for t = 1, one obtains the desired result for t = 2 and, inductively, for all the other periods. Posterior distributions of the coe¢ cients that take into account all information received up to time T are calculated as follows: Proposition 22 (Backward updating) Let priors and initial information be as in Case 20. Then: where b V T and n T are calculated as in Proposition 18 and the other parameters of the above distributions are calculated recursively (backwards) as in Proposition 19.
Proof. From Proposition 19, we know that: In this subsection we relax the assumption that is known, using the same priors and initial information of the propositions in the main text of the article (Propositions 2 and 3):

Case 23
The priors on the unknown parameters are: p ( i jD 0 ) = p 0;i ; i = 1; : : : ; q and the initial information set is: The information sets D t satisfy the recursion D t = D t 1 [ fy t ; x t g, starting from the set D 0 . Note that the assumptions introduced in Cases 17 and 20 in the previous subsections had the only purpose of introducing the more complex Case 23. Given the above assumptions, the posterior distributions of the parameters of the regression can be calculated as follows: Proposition 24 Let priors and initial information be as in Case 23. Let p t;i = p ( = i jD t ). Then: The mixing probabilities are obtained recursively as: starting from the prior probabilities p 0;1 ; : : : ; p 0;q . The conditional densities are calculated for each i as in Propositions 18 and 21.
Proof. Conditioning on = i , the distributions of the parameters t and V and of the observations y t are obtained from Propositions 18 and 21 (it su¢ ces to note that D t [ = D t ). Not conditioning on = i , the distributions of the parameters t and V and of the observations y t are obtained marginalizing their joint distribution with . For example: The mixing probabilities are obtained using Bayes'rule: Proposition 2 in the main text is obtained by combining propositions 18, 21 and 24 above. Proposition 3 results from propositions 19, 22 and 24 above.

Proof of proposition 14 (scale invariance)
When x t R is the vector of regressors, the prior covariance is: The constant ! is una¤ected by the rotation: Since R does not depend on the data and ( ) = ! ! the fact that ! does not change implies that also R (the support of ) remains unchanged. The prior probabilities assigned to the elements of R also do not depend on the data. So, the prior distribution of is una¤ected by the rotation. As far as the recursive equations in Proposition 2 are concerned, note that the initial conditions are not a¤ected by the rotation, while the initial condition changes (it is pre-multiplied by R 1 and post-multiplied by (R 1 ) > ).
For t > 0, it can be easily checked that b y t;t 1;i , F y;t;t 1;i , e t;i , n t;i , b V t;i remain unchanged, while F ;t;t 1;i and F ;t;t;i are pre-multiplied by R 1 and post-multiplied by (R 1 ) > and b t;t 1;i , b t;t;i and P t;i are pre-multiplied by R 1 . Therefore: and ( t jD t ; = ( i ) ) = R 1 t jD t ; = ( i ) 8t T , i = 1; : : : ; q The model probabilities p t;1 , . . . , p t;q depend only on b y t;t 1;i , F y;t;t 1;i , n t 1;i and b V t 1;i , which remain unchanged, so they remain unchanged as well. As a consequence, also unconditionally: Using similar arguments on the backward recursions of proposition 3, it is possible to prove that the above equality holds for any s T .

Tables
This section gathers all the tables described in the paper. All estimates of population quantities obtained from Monte Carlo simulations are complemented by an estimate of the Monte Carlo standard error (in parentheses). 49