1. STATISTICAL SAFETY MODELING.
Abstract.
The hope is that by fitting multivariate statistical models to historical data about accidents, traffic, and traits of the road system, one can learn about the safety effect of many design elements. Whether what is hoped for will materialize is not clear. Success will be evident when several studies using diverse data will yield similar results. In this paper some methodological suggestions are made on how to improve the chance of success in this quest. To do good modeling it is not necessary to master a sophisticated statistical software; a lowly spreadsheet is sufficient. What is necessary is intuition, familiarity with the data and its origins, willingness to explore and backtrack, and a lot of patience.
1. INTRODUCTION Statistical safety modeling (SSM) is the fitting of a (multivariate) statistical model to data. The data are about past accidents and traits for a set of road segments, intersections or other infrastructure elements. The result of SSM is an equation with the estimate of expected accident frequency on the left and a function of traits on the right. The purposes of statistical safety modeling are two.
A. To estimate the safety of an infrastructure element based on its traits; B. To estimate the safety effect of change in the traits of an infrastructure elements.
On the surface, the required computations are similar for both purposes. For safety estimation (purpose A) one plugs trait values into the equation and computes the estimate of the expected accident frequency. For safety effect estimation (purpose B) one does so twice, once for each value of the trait the effect of which is sought. This similarity in computation conceals a real difference between the two purposes; while purpose A is largely unproblematic, purpose B is fraught with difficulties. To illustrate, suppose that, holding all other values constant, a model estimates X accidents with 10 foot lanes and Y accidents with 11 foot lanes. In the mathematical equation, the change of lane width seems to have caused the change in accident frequency. Does this mean that when building two identical roads except that one has 11 lanes and the other 10 lanes one should expect the ratio of their accident frequencies to be about Y/X? The correct answer is either No or We do not know. The reason is that, in the data set from which the model was produced, roads with 10 foot 11 foot lanes may be of different vintage, located in differing jurisdictions and may differ in many other factors that are either imperfectly represented in the model, or absent from it altogether. Therefore, the difference between X and Y reflects partly the complex influence of all the missing and imperfectly accounted-for factors and only partly the causal influence of lane width. Still, it would be surprising if the causal factors that influence the probability of accident occurrence would not be reflected in data. It is important to hone the tools of analysis so that one can come closer to detecting and quantifying cause-effect relationships. The importance of the task derives mainly from the fact that cause-effect links about many design elements (cross-section, alignment etc.) are practically difficult to investigate by any other means. The aim of this paper is to suggest a few ways for improving the SSM process so that multivariate models come closer to representing cause and effect. Whether this aim is achievable is unclear. When models produced by different researchers and based on different data sets will begin to produce similar results, this will be a sign that modeling is on the right track. Many books have been written about multivariate statistical modeling and sophisticated statistical software packages are now in common use. The books tend to dwell on the distribution ofthe dependent variable (accident counts in SSM), on approaches to the estimation of model parameters, on the precision of parameter estimates, on measures of goodness of fit etc. In this paper the emphasis on an aspect of modeling that is less well represented in books - on the question of choosing and improving the functional form of the model. The software packages, once acquired and mastered, make parameter estimation easy. This may be a mixed blessing. Because the cause-effect relationships are veiled by a multitude of interdependencies, modeling requires exploration, backtracking, intimate familiarity with the data etc.; it is not well served by routine and by automation. This is why, in this paper, the use of canned software is shunned.
2. THE ELEMENTS. The central element of SSM is the model equation; it is an equation used to predict the number of accidents of some kind that may be expected to occur per unit of time on an entity (e.g. a road segment or an intersection) as a function of its traits (traffic, geometry and environment). The use of a mathematical equation may create the illusion of scientific rigor. However, there is (at present) no theory to guide us on how the number of accidents should increase as traffic increases or how accident frequency should be related to the radius of a curve. Therefore, the SSM is no more than curve-fitting, and a model equation is only a convenient way to summarize some regularities in the data. Still, the important decisions in SSM are about the overall form of the model equation and the choice of the functional form for each of the variable in the model. These issues are discussed in section 3. The second element of SSM is the distributional assumption; the assumption made about the probability distribution of recorded accidents counts around the mean which the model equation attempts to represent. Third, there is the criterion of optimization. The most common optimization criteria are those of minimizing (weighed) residuals or of maximizing the likelihood function. It is the action of optimization (maximization or minimization) that facilitates the estimation of the unknown parameters in the model equation. In this paper, the second element (distribution of accident counts) and the third element (criterion of optimization) come together in section 4. A model is not done when its parameters are estimated. One needs to ensure that the estimated model actually fits the data for all ranges of every variable and, if required, it needs to be revised and re-estimated. A new technique for the examination of residuals is described in section 5. Finally, there is the suitable modeling environment, which in this paper (section 6) will be provided by the lowly spreadsheet.