1 STATISTICAL SAFETY MODELING. Ezra Hauer, 35 Merton Street, Toronto, M4S 3G4, Ezra.Hauer@utoronto.ca 5389 words + 7 figures + 1 table 2 STATISTICAL SAFETY MODELING. Ezra Hauer, 35 Merton Street, Toronto, M4S 3G4, Ezra.Hauer@utoronto.ca Abstract The hope is that by fitting multivariate statistical models to historical data about accidents, traffic, and traits of the road system, one can learn about the safety effect of many design elements. Whether what is hoped for will materialize is not clear. Success will be evident when several studies using diverse data will yield similar results. In this paper some methodological suggestions are made on how to improve the chance of success in this quest. To do good modeling it is not necessary to master a sophisticated statistical software; a lowly spreadsheet is sufficient. What is necessary is intuition, familiarity with the data and its origins, willingness to explore and backtrack, and a lot of patience. 1. INTRODUCTION Statistical safety modeling (SSM) is the fitting of a (multivariate) statistical model to data. The data are about past accidents and traits for a set of road segments, intersections or other infrastructure elements. The result of SSM is an equation with the estimate of expected accident frequency on the left and a function of traits on the right. The purposes of statistical safety modeling are two. A. To estimate the safety of an infrastructure element based on its traits; B. To estimate the safety effect of change in the traits of an infrastructure elements. On the surface, the required computations are similar for both purposes. For safety estimation (purpose A) one plugs trait values into the equation and computes the estimate of the expected accident frequency. For safety effect estimation (purpose B) one does so twice, once for each value of the trait the effect of which is sought. This similarity in computation conceals a real difference between the two purposes; while purpose A is largely unproblematic, purpose B is fraught with difficulties. To illustrate, suppose that, holding all other values constant, a model estimates X accidents with 10 foot lanes and Y accidents with 11 foot lanes. In the mathematical equation, the 3 when building two identical roads except that one has 11' lanes and the other 10' lanes one should 10 foot 11 foot lanes may be of different vintage, located in differing jurisdictions and may differ in many other factors that are either imperfectly represented in the model, or absent from it altogether. Therefore, the difference between X and Y reflects partly the complex influence of all the missing and imperfectly accounted-for factors and only partly the causal influence of lane width. Still, it would be surprising if the causal factors that influence the probability of accident occurrence would not be reflected in data. It is important to hone the tools of analysis so that one can come closer to detecting and quantifying cause-effect relationships. The importance of the task derives mainly from the fact that cause-effect links about many design elements (cross-section, alignment etc.) are practically difficult to investigate by any other means. The aim of this paper is to suggest a few ways for improving the SSM process so that multivariate models come closer to representing cause and effect. Whether this aim is achievable is unclear. When models produced by different researchers and based on different data sets will begin to produce similar results, this will be a sign that modeling is on the right track. Many books have been written about multivariate statistical modeling and sophisticated statistical software packages are now in common use. The books tend to dwell on the distribution of the dependent variable (accident counts in SSM), on approaches to the estimation of model parameters, on the precision of parameter estimates, on measures of goodness of fit etc. In this paper the emphasis on an aspect of modeling that is less well represented in books - on the question of choosing and improving the functional form of the model. The software packages, once acquired and mastered, make parameter estimation easy. This may be a mixed blessing. Because the cause-effect relationships are veiled by a multitude of interdependencies, modeling requires exploration, backtracking, intimate familiarity with the data etc.; it is not well served by routine and by automation. This is why, in this paper, the use of canned software is shunned. 4 (1) 2. THE ELEMENTS. The central element of SSM is the model equation; it is an equation used to predict the number of accidents of some kind that may be expected to occur per unit of time on an entity (e.g. a road segment or an intersection) as a function of its traits (traffic, geometry and environment). The use of a mathematical equation may create the illusion of scientific rigor. However, there is (at present) no theory to guide us on how the number of accidents should increase as traffic increases or how accident frequency should be related to the radius of a curve. Therefore, the SSM is no more than curve-fitting, and a model equation is only a convenient way to summarize some regularities in the data. Still, the important decisions in SSM are about the overall form of the model equation and the choice of the functional form for each of the variable in the model. These issues are discussed in section 3. The second element of SSM is the distributional assumption; the assumption made about the probability distribution of recorded accidents counts around the mean which the model equation attempts to represent. Third, there is the criterion of optimization. The most common optimization criteria are those of minimizing (weighed) residuals or of maximizing the likelihood function. It is the action of optimization (maximization or minimization) that facilitates the estimation of the unknown parameters in the model equation. In this paper, the second element (distribution of accident counts) and the third element (criterion of optimization) come together in section 4. A model is not done when its parameters are estimated. One needs to ensure that the estimated model actually fits the data for all ranges of every variable and, if required, it needs to be revised and re-estimated. A new technique for the examination of residuals is described in section 5. Finally, there is the suitable modeling environment, which in this paper (section 6) will be provided by the lowly spreadsheet. 3. THE MODEL EQUATION The generic model equation is where y denotes the expected number of accidents per unit of time for entities with trait values X1, X2, X3. . . Xn, when the parameters are $1, $2, $3,. . .$m, and f() is some function. In books on 5 variables and shaping the function f() so that it can represent the regularities hidden in the data. Not all functions f() can serve equally well. Consider, e.g., the linear additive model often used in the past. (For intersections set Segment Length=1). To illustrate the main deficiency, think of X1 as representing traffic flow and X2 representing lane width. The essential logic of the additive model form implies that a certain lane width adds a fixed number of accidents to y/(Segment Length) irrespective of whether AADT=1,000 or 10,000. A reflection of the same deficiency is the logical blemish that if even there was no traffic on an entity, the model would predict some accidents. The second deficiency is rooted in the limitations of the liner building blocks. Suppose, e.g., that the heavier the traffic the less likely is an out-of-control vehicle to run off the road without first colliding with another vehicle. If so, the number of single-vehicle accidents at AADT=10,000 will be less than ten times the number at AADT=1,000. In other words, the linear form $X cannot represent a non-linear reality. For these and similar reasons the linear additive model form is a poor choice. More popular nowadays are multiplicative models of the form or and models consisting of the product of factors X$ and e$X.. Such models also have shortcomings. First, the factors X$ and e$X can take only the shapes in Figures 1 and 2; they cannot represent a relationship that has a peak, a valley, or points of inflection. Thus, e.g., were it true that accident frequency diminishes when lane width increases from 10' to 11' but increases when lane width increases from 12' to 13' , such a relationship could not be found in the modeling process using these factors. 6 Figure 1. The shapes X$ can take. Figure 2. The shapes e$X can take. Second, the representation of road segment safety does require the use of some additive components. the number of accidents which they generate depends mostly on the traffic and conditions in their vicinity and not on, say, the overall length of the segment. Accidents generated by point hazards need to be added to the number that would have materialized without the point hazard. To show that multiplication cannot replace addition consider a segment on which 1 non-driveway related accident is expected and, on which, a driveway is expected to add 0.1 accidents. Thus, with 1 driveway there would be 1.1 accidents, with two driveways 1.2 and with four driveways 1.4 accidents, on the average. To represent the contribution of these driveways by a multiplicative factor, driveway density would have to be used. The doubling of driveway density from 1 to 2 should have a factor of 1.2/1.1=1.09 and a further doubling from 2 to 4 should have a factor of 1.4/1.2=1.17. Thus, there is no single multiplicative factor that can adequately represent the influence of driveway density. These considerations make it possible to suggest a model equation form that is freer of obvious flaws. First, in principle, it should allow for multiplicative and additive components. The role of the multiplicative component is to represent the influence of factors that naturally apply to a stretch of road whereas the role of the additive component is to account for the influence of point hazards. Second, the building blocks of both component types should not be restricted to the functional forms $X, X$ or e$X. The choice of functional from for the building blocks should be guided by the need to replicate what the data suggests. Thus, the general form of a model equation could be: 7 (3) (2) In equation 2, y is the estimate of the annual number of accidents expected to occur on a road segment may also change from year to year, reflecting changes in weather, road user demography, accident set to 1 for intersections, grade crossings and similar facilities. In equation 3, f(), g(), h(), i(), j() of several variables. In practice it is seldom possible to get cues from the data about the shape of a two or more-dimensional function. 3.1 Choosing Variables and Functional Forms. In building up the model equation, the role of each variable requires individual attention. A decision must be made concerning whether to retain a given variable in the final model, and if so, with what functional form? This is best accomplished by introducing into the model single variables or variable pairs, one after another. The functional form of each constituent building block is to be chosen so that it can replicate the important features of the relationship present in the data - its peaks, valleys, points of inflection. This can be done as follows. Suppose that some variables were already introduced if yes, by what functional form should its influence be represented. Begin by grouping the entities into bins such that each bin contains entities with the same or similar values of V=V1, V2,. . ., Vi, ... . Thus, with 11 foot lanes, etc.; if V stands for %Trucks the bins might be for V1= 0%-2%, V2=3%-4% etc. 8 (4) Injury Accidents V=Lane Width (ft.) Predicted Recorded R=Recorded/Predicted F R+F R-F 10 160.9 163 1.01 0.08 1.09 0.93 11 719.4 698 0.97 0.04 1.01 0.93 12 1251.2 1278 1.02 0.03 1.05 0.99 13 308.7 307 0.99 0.06 1.05 0.94 14 90.9 82 0.90 0.10 1.00 0.80 15 3.0 6 2.02 0.83 2.85 1.20 Table 1. R Ratios for Lane Width For each entity in a bin we have the recorded number of accidents and the number of accidents predicted by the current model which does not yet include the variable V. Summing over all entities in each bin compute the sum of recorded accidents Nrecorded(V) and the sum of accidents predicted by the current model Npredicted(V). By examining the relationship between Nrecorded(V) and Npredicted(V) for functional form can represent it. Since variables can be represented either in the multiplicative or the additive portion of the model, two cases need to be examined. Case a. The variable is to be represented in the multiplicative portion of the model. For all chosen values of V compute: When R(V)>1, too few accidents are predicted by the model for entities with that V; when R(V)<1, too many accidents are predicted. To make predicted and recorded accidents equal, the current model equation needs to be multiplied by a factor similar to R(V). The process is illustrated for V=Lane Width with the numbers in Table 1 that are based on the modeling of on-the-road injury accidents for four-lane undivided urban roads described in (1). Thus, e.g., the group of road segments with 10 foot lanes in the data-base recorded 163 accidents while the model, which at that point already accounted for six variables, predicted for the same road segments160.9 accidents. 9 Figure 3 R versus % Trucks If the points R(V) form an orderly relationship when plotted against V, a functional form that would fit these points can be identified. This functional form is a good candidate to represent the influence of V in the model (provided that the multiplicative part of the model makes up the bulk of y, which is usually the case). The main consideration for introducing a variable into the model, is appropriate functional form is chosen to represent it. If no orderly relationship is evident, the variable may not need to be introduced into the model. In Table 1 there is no evidence of an orderly relationship between R and Lane width. Therefore this variable was not used. A different conclusion has been reached in the same project (1) when the relationship the functional form $1exp[$2(%Trucks)] + $3(%Trucks) was used and its parameters estimated. The result is in Figure 4. 10 Figure 4. Multiplier for %Trucks variable (5) Case b. The variable is to be represented in the additive portion of the model. For all chosen values of V compute: When R(V)>0 too few accidents are predicted by the model for entities with that V, when R(V)<0 too many accidents are predicted. To make predicted and recorded accidents equal, R(V) needs to be added to the additive portion of the model equation. From there on the procedure and considerations Variables used in road safety modeling are usually correlated. One principal path for correlation is through AADT; roads that are heavily traveled are usually built and maintained to more generous standards than lightly traveled roads. It follows, that while the order in which variables are introduced into the model does not affect the parameter estimates (since all parameters are re-estimated when a new variable enters the model) the order of variable introduction may affect the choice of functional form. Thus, e.g., one may expect the values R(%Trucks) to depend on whether the predicted values were produced with only AADT in the model or, say, with AADT&Speed Limit&Parking &Horizontal Alignment in the model. This is why the choice of functional form for each variable may have to be revisited during the course of modeling. In the project used here for illustration, the existence and 11 (6) (7) nature of the relationship between R and %Trucks has been revisited after all 11 variables were in the model. It remained virtually unchanged. Because of the dominant role of AADT, it should be the first variable to be introduced. 3.2 Variables the Value of Which Changes Within a Segment. When preparing the data for SSM, the data record is usually a segment of road, and the aim is width, shoulder width etc. remain the same throughout the segment. This is only partly achievable for two reasons. First, the road segment should not be shorter than what the precision with which accident location is determined. If the accident location is determined to the nearest tenth of a mile, then segments should not be shorter than 0.1 miles with the 0.1milepost at their center. Second, some variable values cannot really be kept constant. Thus, e.g., the grade changes continuously along a vertical curve. Similarly, a segment may be part straight and part horizontal curve. For these two reasons a modeling device needs to be created to accommodate variables the value of which changes within a segment. To illustrate how this can be done consider a road segment of length L. This segment is made up of two sub-segments (n=2) such that for length L1 the grade is approximately G1 and for the remaining length L2=L-L1 the grade is approximately G2. On sub-segment 1 one expects y1 accidents per year and on sub-segment 2 one expects y2 accidents per year. By equations 2 and 3 one expects for the entire segment Equation 6 can be re-arranged into: length of sub-segment i is Li and Vi is the value of some variable V on sub-segment i. The segment 12 (8) (9) (10) (11) length L is the sum of the n values L1+L2+. . . +Ln . Let p(V) denote the joint multiplicative contribution of the V1, V2, ..., Vn. Following the earlier reasoning, The approach presented here has been introduced into SSM S-P Miaou et al.(2). 4. THE NEGATIVE MULTINOMIAL LIKELIHOOD FUNCTION It is common to assume that accident counts come from a Poisson or Negative Binomial distribution. This is adequate if the data represent a short enough period of time over which all the traits of the entities, including traffic, can be assumed to remain approximately constant. When this assumption is strained, a more general representation of the accident count distribution is appropriate. The following derivation of the negative multinomial likelihood function is based on Guo (3) . The mean accident frequency :ij for entity i in time period j is given by the product In this, 0ij is the mean of the :ij in an imagined population of entities with the same measured traits as those of entity i. Multiplication by 2i converts the population mean (0ij) to the mean of entity i (:ij). probability density of which is in equation 10. For this distribution the mean is 1 and the variance is 1/n.. For entity i we record accident counts ai1, ai2, . . .,aij, . . .,aini in time periods 1,2, . . ., j, . . .,ni. Assume that these accident counts are Poisson distributed with means 0ij2i. If so, for 2i given, 13 (12) (14) (13) However, when only the probability distribution in equation 10 is given, the probability to record accident counts ai1, ai2, . . .,aij, . . .,aini is For notational brevity use With this notation, Equation 14 gives the probability to record ai1, . . .,aini. Replacing 0ij by yij from the model equation Since we seek parameters that maximize the logarithm of likelihood, the constant containing the factorials of aij can be omitted. Thus, the contribution to log-likelihood of entity i over time periods 1 to ni is 14 (16) (15) Maximum likelihood parameters are those that maximize This formulation of the likelihood function has one principal advantage; it allows for the correct accommodated. This has the practical consequence that data sets for SSM can cover more than a few years and thereby allow much richer data sets that may lead to more reliable models. 5. RESIDUALS Various measures of overall goodness-of-fit are in use. Because the SSM are to be used for prediction, it is important that the predicted value be a good fit for all practical values of a variable, not only a good fit overall. The CURE (CUmulative REsiduals) procedure described below is useful in this circumstance. It has been first published in (5). The difference the number of recorded and predicted accidents for an entity and time period is been chosen and how well the model fits the data. Textbook and statistical packages provide guidance in the matter. Unfortunately in road safety applications the standard guidance is seldom useful. Consider, e.g., Figure 5 which is a normal plot of residuals of PDO against AADT for the road segments used in earlier examples (1). There is not much that can be learned from this plot except that X$1exp($2X) that was chosen to represent the influence of AADT in that project makes for a good fit for all ranges of AADT. 15 Figure 5. Residuals versus AADT/10,000 Figure 6. Cumulative Residuals versus AADT/10,000 To produce a CURE plot for a variable, the residual is computed for each segment. The segments are then sorted in the ascending order of the variable. Once so sorted, the residuals are cumulated. This is the bold jagged line in Figure 6 when the data of Figure 5 are used. Thus, e.g., the sum of all residuals up to AADT=20,000 is about -80 accidents. (The total number of accidents in this data set was 2754). 16 A good CURE plot is one which moves up and down and oscillates around 0. Such a plot means that the model fits well in all ranges of the variable. A bad CURE plot is one which is all above or all below 0, except at the edges. If there is a wide range of the variable where the plot is steadily increasing (decreasing) as, e.g., between AADT 12,000 and 18,000 in Figure 6 in that range the predicted values tend to be systematically smaller (larger) than the number of recorded accidents. In that range, the model fit is poor and can perhaps be improved by a suitable modification of the functional form. A vertical jump in the CURE plot, such as that at near AADT=38.000 is an unusually large accidents is very different from what the model predicts. This should trigger a more detailed investigation and, later, a decision whether or not to retain the data point for modeling. one can find limits beyond which the plot should only rarely. Such limits are shown in Figure 6 by symmetrical dotted lines. The CURE plot shown encroaches on the lower limit and is therefore only marginally acceptable. The encroachment here is partly the consequence of the outlier point which, in this case, was kept in the data. The theory behind the dotted limits is described below. Let the residuals be numbered consecutively such that N is the total number of data points (residuals), n an integer between 1 and N and S(n) the sum of the residuals from 1 to n. For a perfect model the mean of all residuals is 0 and therefore so is the mean of their sums. If so, the variance of S(n), to be denoted as F2(n), can be estimated from the graph of cumulative squared residuals such as that in Figure 7 which pertains to the data used in Figures 5 and 6. 17 Figure 7. Estimates of F2(n) (17) Thus, e.g., with statistically independent residuals, at AADT=20,000, F2(721) is estimated as 860, at and at AADT= 30,000 (n=1390) as 3800,and at AADT=66,400 (n=1193) as 10,720. Even if the individual residuals do not have the same variance, one can invoke some loose version of the central limit theorem to argue that the sum S(n) is approximately normally distributed with mean=0 and variance read off a graph such as Figure 7. With this assumption, the probability density function for the realization that the random walk reaches S=s at n and returns to 0 at N is the product of two normal probability densities. One with mean 0 and variance F2(n) and the other with mean 0 and variance F2(N)-F2(n). The product can be written as: 18 (18) (19) One part of this expression can be rewritten as and the remaining part can be rewritten as It follows that the product in equation 17 represents a normal probability density with mean 0 and standard deviation F*. In a normal distribution the probability to exceed 2F*, the limits shown in Figure 6, is less than 0.03. The plot of cumulative squared residuals estimates F2(n) on the assumption that E{S(n)}=0. This will almost never be exactly true. If so,F2(n) is in truth less than the value of the cumulative squared 6. MODELING ON A SPREADSHEET historical records, not designed experiments. Box et al. devote ten pages to the description of the multitude of obstacles that confront attempts to make sense of happenstance data by multivariate regression. The tasks is certainly daunting and the obstacles real. Unfortunately, the conduct of safety experiments with roads is difficult and rare; happenstance data is all we have. Success in making sense of such data is going to elude researchers if the task of interpretation is tackled by the routine application of canned software. What is required is for success in each attempt are intuition, exploration, visualization, backtracking and patience. For this, a common spreadsheet may be the best environment. Most of what has been discussed in this paper can be done on a spreadsheet, including 19 well: and n). Below this, each road segment (or intersection) is represented by a row. Each row will have three parts for three kinds of entries: a. Data elements, b. Components of the Model Equation, and c. Computations for the log-likelihood expression. Data elements that change from year to year (accident counts, AADT, etc.) are in separate columns; data elements that remain constant over the years (Segment Length, degree of curve, etc.) get one column. Once the data is in the spreadsheet, the building of the model equation can begin. Set aside sufficient columns to account for the influence of each variable. Thus, e.g., if six years of AADT and Accident counts are used, six columns for the function f(AADT) are needed. The formula in each spreadsheet (Segment Length, AADTyear )and to the unknown parameter values for which space has been reserved (Scale Parameteryear$1, $2) . Once some parameter values are entered, the formula will return a number. These numbers are assembled as products or additions into the model predictions, each year with its prediction. For road segment i and year j the model prediction has been called yij. The next part of the row is devoted to the elements of the log-likelihood given by equation 13. This too can be built part by part and once an initial value for n is entered, a number for ln(Ri) materializes in the last column. The sum of these over all rows is the current log-likelihood for the parameter values in the values that maximize the log-likelihood. Using the same spreadsheet tools one can easily prepare a CURE plot or a plot of R values for the next variable to be introduced. Once a promising functional for the next variable is identified, it is entered in the reserved cells of the spreadsheet, added to the columns where the model equation is assembled, new parameters are estimated and process repeated for as long as necessary. 20 7. SUMMARY Learning about reality from happenstance data is difficult, perhaps impossible. For road safety, the attempt to do so is important because other research approaches are presently not available. In this paper an attempt has been made to focus on what (to the author) seems important: the specification of the model equation and an approach to modeling that is free from the constrains of routine and automation. References 1. Hauer, E., F.M. Council and Y. Mohammedshah, Safety Models for Urban Four-lane Undivided Road Segments. Submitted for presentation and publication at the 83rd Annual Meeting of the TRB.2004. 2 Miaou, S-P., Hu, P.S., Wright, T., Rathi, A.K., and Davis, S.C. "Relationship Between Truck Accidents and Geometric Design: A Poisson Regression Approach," Transportation Research Record No. 1376, Transportation Research Board, National Research Council, pp. 10-18, October 1992. 3. Guo, G., Negative multinomial regression models for clustered event counts. Sociological Methodology, Vol. 26, 113-132, 1996. 4. Hauer, E., Overdispersion in Modelling Accidents on Road Sections and in Empirical Bayes Estimation. Accident Analysis and Prevention. 33, 2001, 799-808. 5. Hauer, E. and Bamfo, J. Two tools for finding what function links the dependent variable to the explanatory variables. ICTCT 97, Lund, Sweden, pp. 1-19. 1997 6. Box, E.P., W.G. Hunter, J.S. Hunter, Statistics for experimenters. John Wiley & Sons, New York, 1978.


トップ   新規 一覧 検索 最終更新   ヘルプ   最終更新のRSS