A Bayesian Method for Characterizing Population Heterogeneity

A stylized fact from laboratory experiments is that there is much heterogeneity in human behavior. We present and demonstrate a computationally practical non-parametric Bayesian method for characterizing this heterogeneity. In addition, we define the concept of behaviorally distinguishable parameter vectors, and use the Bayesian posterior to say what percentage of the population lies in meaningful regions. These methods are then demonstrated using laboratory data on lottery choices and the rank-dependent expected utility model.


Introduction.
A stylized fact from laboratory experiments is that there is much heterogeneity in the subject population. How to characterize that heterogeneity is an active research area among experimentalists and econometricians. The approaches include individual parameter estimation, random coefficient models, mixture models of different types, and Bayesian methods. 1 It is not the intention of this paper to explore all the methods, but rather to present and demonstrate a computationally practical, non-parametric Bayesian method to characterize the heterogeneity in a population of subjects. One approach is to find the  that maximizes f(xi | ) for each i, and to treat each MLE as a random sample from the population. A scatter plot of { , 1, … , gives a view of the sample distribution of i from the population. However, the uncertainty of the MLEs is not represented in such a plot. Standard kernel density estimation methods are inappropriate because they essentially assume a common variance-covariance () matrix. Estimating i matrices for each i entails many more parameter estimates, and still any density estimation using these i matrices would depend upon additional assumptions about the kernel of each i, such as normality: N( , i).
Random coefficient models assume a parametric form for the population distribution: g( | ), where  is a low-dimensional parameter vector. Typically, g( | ) is a family of unimodal distributions in which  stands for the mean and  matrix. Obviously, these parametric restrictions could be very wrong. For example, the simple scatter plot of the individual MLEs { , 1, … , may have clusters that suggest the true distribution is multimodal.
One way to embrace the multimodal possibility is a mixture model of K distributions g( | k) for k  {1, ..., K}. In addition, one has the mixture parameters {k  0} such that ∑ = 1, so g()  ∑ g( | k). Then, the econometric task is to estimate the coefficients {k, k, k = 1, …, K}. Obviously, this method still suffers from potential mis-specification via the parametric restrictions on the distributions. In addition, there are many specifications of K distributions (typically called types) that yield the same aggregated g(), so without further restrictions, the model is under-identified.
To review the Bayesian approach, let G denote the space of distributions g(), and let (G) denote the space of probability measures on G. The standard Bayesian approach requires us to have a prior belief 0  (G). Note that 0 is a probability measure on G, whereas g is a point in G and a probability measure on . Given observed behavior x  {xi, i=1,...,N}, the posterior belief according to Bayes rule is Since both G and (G) are infinite dimensional spaces, in practice this an impossible calculation to carry out exactly. The paper is organized as follows. Section 2 presents the Bayesian approach. Section 3 presents the encompassing econometric model, and the HO data set. Section 4 presents the results of our Bayesian method. Section 5 develops the formal concept of "behavioral distinguishability". Section 6 asks and answers pertinent questions about behaviorally distinguishable types. For example, we find that of the subpopulation that is behaviorally distinguishable from uniformly-random, 84.2% is not behaviorally distinguishable from the ordinary Expected Utility (EU) model. Section 7 concludes.

Our Bayesian Alternative.
To develop our Bayesian approach let xi denote the observed behavior for subject i, and let f(xi | ) denote the probability of xi given parameter vector   . Given a prior g0 on , by Bayes rule, the posterior on  is However, eq(2) does not use information from the other subjects even though those subjects were randomly drawn from a common subject pool. Let N be the number of subjects in the data set. When considering subject i, it is reasonable to use as a prior, not g0, but In other words, having observed N-1 subjects, gi() is the probability that the N th random draw from the subject pool will have parameter vector . We then compute where x denotes the entire N-subject data set. Finally, we aggregate these posteriors to obtain Electronic copy available at: https://ssrn.com/abstract=3346557 We can interpret g*( | x) as the probability density that a random draw from the subject pool will have parameter vector . Note that eq(5) puts equal weight on each xi, so we are using each individual's data effectively only once, in contrast to empirical Bayes methods. Also note that while MCMC methods could be used to simulate random draws from each g( | xi), since eq (5) requires that each i ĝ ( | x) be properly normalized, MCMC methods cannot be used to simulate random draws from g*( | x).
When implementing this approach we construct a finite grid on the parameter space  and we replace the integrals by summations over the points in that grid. However, we do not need to integrate over the space of distributions (G), so we avoid the need for a grid on (G) which would be computationally infeasible.
Since eq(4) uses a prior that is informed by the data of N-1 other individuals, the influence of g0 in the first step is overwhelmed by the influence of the data. Thus, the specification of g0 is much less an issue and can be chosen based on computational ease.
3. The Rank-Dependent Expected Utility Model and the HO Data.

a. The Behavioral Model.
The Rank-Dependent Expected Utility (RDEU) model 9 was introduced by Quiggin (1982,1993). A convenient feature is that it nests EU and Expected Monetary Value (EMV).
RDEU allows subjects to modify the rank-ordered cumulative distribution function of lotteries as follows. Let Y  {y0, y1, y2, y3} denote the set of potential outcomes of a lottery, where the outcomes are listed in rank order from worst to best. Given rank-ordered cumulative distribution (1) A widely used parametric specification of the transformation function, suggested by Tversky and Kahneman (1992), is where  > 0. Obviously,  = 1 corresponds to the identify transformation, in which case the RDEU model is equivalent to the EU model.
Given value function v(yj) for potential outcome yj, the rank-dependent expected utility is To confront the RDEU model with binary choice data (F A vs. F B ), we assume a logistic choice function: where   0 is the precision parameter. Prob(F A ) gives the probability that lottery F A is chosen rather than lottery F B . As in EU theory, w.l.o.g. we can assign a value of 0 to the worst outcome and a value of 1 to the best outcome. Accordingly, for the data we specify v0  v(y0) = 0 and v3  v(y3) = 1. This leaves two free utility parameters: v1  v(y1) and v2  v(y2) subject to monotonicity v2  v1. Hence, the empirical RDEU model entails four parameters: (, v1, v2, ).

b. The HO Data.
The HO dataset contains 100 unique binary choice tasks. 11 Each task was a choice between two lotteries with three prizes drawn from the set {0£, 10£, 20£, 30£}. A crucial design factor was the ratio of (i) the difference between the probability of the high outcome for lottery A and the probability of the high outcome for lottery B to (ii) the difference between the probability of the low outcome for lottery A and the probability of the low outcome for lottery B. It is insightful to represent this choice paradigm in a Machina (1982) triangle, as shown in Figure 1. 10 As pointed out by Harrison and Swarthout (2014), this specification implicitly assumes the "compound independence axiom". Since we view EU and RDEU as behavioral models, we are comfortable with this implicit assumption.
11 These 100 tasks were presented to the same subjects again one week later. We do not consider that data here because the test that the same model parameters that best fit the first 100 choices are the same as those that best fit the second 100 choices fails. Possible explanations for this finding are (i) that learning took place between the sessions, (ii) preferences changed due to a change in external (and unobserved) circumstances, and (iii) the subjects did not have stable preferences. Therefore, we focus our attention on the first 100 choice tasks.

Figure 1. Example of Lottery Choice Pairs
The ratio for the A-B pair is the slope of the line connecting A and B, which is greater than 1. The ratio for the A'-B' pair is the slope of the line connecting A' and B', which is clearly less than 1. According to EU indifference curves are parallel straight lines with positive slope in this triangle, and the indifference curves of a risk neutral subject would have slope equal to 1. A wide range of ratios was used in order to identify indifference curves and to test the implications of EU (as well as alternative theories). After all choices were completed, one task was randomly selected and the lottery the subject chose was carried out to determine monetary payoffs. 12 One can estimate these parameters for each subject in the HO data set. That approach entails (480=320) parameters, even without the corresponding variance-covariance matrices. 12 Loomes and Sugden (1998) is a similar study as Hey and Orme (1994), except that their analysis of the data is based on non-parametric tests involving the number of "reversals" and violations of dominance. Harrison and Rutström (2009) replicate HO and also run a similar experiment using 30 unique tasks. Bruhin,et al. (2010) also explore heterogeneity, but they elicit certainty equivalents, so the task is arguably different from binary choices as in the other studies. Electronic copy available at: https://ssrn.com/abstract=3346557 Table 1 gives the population mean and standard deviation of the point estimates 13 . The last column "LL" gives the sum of the individually maximized log-likelihood values. Note that there is substantial heterogeneity across subjects in the parameter estimates for  and . These comparisons involve estimates of a large number of parameters. For each individual subject, we obtain point estimates of the parameters, but no confidence interval. One could use a bootstrap procedure to obtain variance-covariance matrices for each individual, but that would be a computationally intense task and entail 12 additional parameters per subject.
Further, the estimates for each subject would ignore the fact that the subjects are random draws from of a population of potential subjects and that therefore the behavior of the other subjects contains information that is relevant to each subject. In contrast, the Bayesian approach is better suited to extract information from the whole sample population. Consequently, we turn to the Bayesian approach. 14 4. Implementing our Bayesian Method.
When implementing our Bayesian method we specify the prior g0 as follows. For the logit precision parameter, we specify  = 20ln[p/(1-p)] with p uniform on [0.5, 0.999]. In this formulation, p can be interpreted as the probability an option with a 5% greater value will be chosen. Since the mean payoff difference between lottery pairs in the HO data set is about 5%, this is a reasonable scaling factor. 15 (v1, v2) is uniform on the unit triangle such that v2  v1. (3)]. 16 These three distributions are assumed to be independent.
Since we cannot display a four-dimensional distribution, we present two two-dimensional marginal distributions. Figure 2 shows the marginal on (p(), ), where p()  1/[1 + exp(-0.05)]. 17 From Figure 2 we see that the distribution is concentrated around  = 0.95, and that the precision values are large enough to imply that a 5% difference in value is behaviorally significant (i.e. p() > 2/3.

Figure 2. Marginal of g* on (p(), ).
15 The following graphs and results are robust to this specification of the prior on .
16 95% of the individual MLEs for  lie in this range. Using a wider interval for the prior on  has no noticeable effect on the Bayesian posterior at the cost of more grid points. 17 Thus, p() is the probability the subject will choose the option with the greater value whenever that value is 5% higher than that of the other option. Electronic copy available at: https://ssrn.com/abstract=3346557 Figure 3 shows the marginal on (v1, v2). From Figure 3 we see that the distribution is concentrated along the line v2 = (v1 + 1)/2, which implies that utility is essentially linear above 0.
We can also see a spike near the EMV point (1/3, 2/3). Given g*( | x) we can compute several statistics. First, the log-likelihood of the HO data is LL(g*) = -3215.55. In contrast, the log-likelihood of the four-parameter RDEU representative-subject model is -4423.62 . Obviously, the heterogeneity implicit in g* fits the data much better than a representative-agent model. 18 Compared to -2828.46 (Table 1), the loglikelihood from the Bayesian method appears to be much worse. However, the direct comparison is inappropriate. LL(g*) is computed as if each subject were drawn independently from g*. In contrast, -2828.46 is the sum of individually computed log-likelihoods using the subject-specific estimated parameters. 18 One can consider this Bayesian approach as an alternative random parameter model as used by Wilcox (2008). However, in contrast to Wilcox, we assume that each subject draws from this distribution once and uses those parameters for all choice tasks, rather than drawing for each choice task. The latter can be viewed as a "diverse" representative agent model, while the former is a heterogeneous agent model. To test for over-fitting, we compute g* based only on the first 50 tasks in the HO data, and use this g* to predict the behavior for the second 50 tasks. We find that the log-likelihood of the latter is -1538.05. In contrast, using individual parameter estimates from just the first 50 tasks, the log-likelihood of the second 50 tasks is -1851.34. This result suggests that the approach of individual parameter estimates is more susceptible to over-fitting and less reliable than the Bayesian approach. The most productive use of g*( | x) is to test hypotheses. For example, we can ask what percent of the subject pool has  = 1. The answer is 10.5%; however, this number is an artifact of the discrete grid used for computation. Assuming g* is absolutely continuous, as the grid becomes finer and finer, we would expect the percentage with  = 1 to approach 0. On the other hand, what we really want to know is the percent of the population that is behaviorally indistinguishable from EU (i.e.  = 1). The behavior is simply the choice data for a random subject xi.
To assess whether this data was generated by  or ', we typically compute the log of the If either of these error rates is too large, we might say that  and ' are behaviorally indistinguishable. Classical statistics suggests that a proper test statistic would have these error rates not exceed 5%.
Of course, by increasing the number of observations in xi, we can drive these error rates lower and lower. However, practical considerations often limit the number of observations. In laboratory experiments, boredom, time limitations and budget constraints place severe upper bounds on the number of observations. The HO dataset with 100 tasks is unusually large.
Moreover, to test for overfitting we would select a subset, say 50, to use for estimation, and the remaining 50 to assess parameter stability and prediction performance. Therefore, for the illustrative purposes of this paper, we use 50 as a reasonable sample size upon which to judge behavioral distinguishability. With 50 binary choices, there are 2 50 ( 10 30 ) possible xi vectors.
For the tests we want to conduct, generating all these possible xi vectors and computing er1 and er2 is obviously not feasible. Instead, we generate 1000 xi vectors from f(xi | ) and 1000 from f(xi | '). 19 Then, er1 is approximated by the proportion of xi generated by f(xi | ) that lie in X1, and er2 is approximated by the proportion of xi generated by f(xi | ') that lie in X2.
In summary, we define  and ' to be behaviorally distinguishable if both of the simulated type-I and type-II error rates are less than or equal to 5%, and to be behaviorally indistinguishable otherwise.
The questions we are interested in answering are easily framed in terms of our behaviorally indistinguishable relationship on the parameters. To begin, we want to know what percent of the population is behaviorally indistinguishable from 50:50 random choices (hereafter referred to as Level-0 behavior). Since the latter entails the simple restriction that  = 0, we can compute whether  = (, u1, u2, ) is behaviorally indistinguishable from (0, v1, v2, ), and then sum g*(, v1, v2, ) over all the grid points (, , v1, v2) that are behaviorally indistinguishable from (0, v1, v2, ). The answer is 4.0%, which leaves 96.0% that is behaviorally distinguishable from Level-0. We are not interested in dissecting Level-0 behavior. Therefore, all our subsequent questions are conditional on the parameters being behaviorally distinguishable from Level-0.
Since Figure 3 provides strong evidence that the utility function is Linear Above Zero (LAZ), our next question is what percent of the population is behaviorally distinguishable from Level-0 but behaviorally indistinguishable from LAZ? The latter criteria can be stated as: (, v1, v2, ) is behaviorally indistinguishable from (, v1, (1+v1)/2, ). The answer is 92.5%. Hence, of the subpopulation that is behaviorally distinguishable from Level-0, 96.4% (= 92.5/96.0) is 19 We also made these computations with only 100 simulated x i vectors, and found virtually the same results. Therefore, we are confident that 1000 simulated x i vectors are adequate for our purposes.
behaviorally indistinguishable from LAZ. We will not further dissect the non-LAZ subpopulation.
Perhaps the question of most interest is what percent are behaviorally indistinguishable from EU. To answer this, we ask how much mass g* puts on the set of parameters (, v1, v2, ) that are behaviorally distinguishable from Level-0 but indistinguishable from (, v1, (1+v1)/2, 1)?
Our fourth and final question concerns the apparent aversion to 0 payoffs. What percent of the population are pure EMVs (i.e. maximize EMV with no aversion for 0 payoffs)? This additional criteria can be stated as: (, v1, v2, ) is behaviorally indistinguishable from (, 1/3, 2/3, 1). The answer is 12.5%. Hence, of the EU and LAZ subpopulation, 17.3% (= 12.5/78.2) have no aversion to $0 and 82.7% are averse to $0. Aversion to 0 is akin to loss aversion 20 , and the latter is a common result in the psychology literature (e.g. Kahneman and Tversky, 1979;Erev, Ert and Yechiam, 2008). Level-0 represents the 4.0% that are behaviorally indistinguishable from Level-0. The second section labelled "not LAZ" represents the 3.5% (=96 -92.5) that are behaviorally distinguishable from Level-0 and LAZ. The third section labelled "Not Beta=1" represents the 14.3% (= 92.5 -78.2) that are behaviorally distinguishable from Level-0 and =1 but not from LAZ. The fourth section represents the 12.5% that are pure EMV maximizers. The fifth and final section represents the 65.7% (= 78.2 -12.5) that are EMV maximizers but with an aversion to 0.

Conclusions and Discussion.
This paper has demonstrated the feasibility and usefulness of Bayesian methods when confronting laboratory data, especially when addressing heterogeneous behavior. Specifically we have presented a nonparametric computationally feasible approach. To extend our approach to models with more parameters, statistical sampling techniques can be employed to tame the curse of dimensionality. 21 Our Bayesian analysis has characterized substantial heterogeneity in the subject population. On the other hand it has revealed that 78.2% of the population is behaviorally indistinguishable from EU behavior (84.2% of the subpopulation that is behaviorally distinguishable from Level-0). Another interesting finding is that the vast majority of subjects are behaviorally indistinguishable from having a linear utility function from 10£ to 30£, although a majority exhibit an aversion to a 0 payoff. 21 E.g. see Rubinstein and Kroese (2016