P(x = X) p q where p is the probability of success (a crash) and q = (1 − p) is the probability of failure (no crash). In general, if there are N independent trials (vehicles passing through an intersection, road segment, etc.) that give rise to a Bernoulli distribution, then it is natural to consider the random variable Z that records the number of successes out of the N trials. Under the assumption that all trials are characterized by the same failure process (this assumption is revisited later in the paper), the appropriate probability model that accounts for a series of Bernoulli trials is known as the binomial distribution, and is given as:
where n = 0,1,2, . . ., N. In Eq. (1), n is defined as the number of crashes or collisions (successes). The mean and variance of the binomial distribution are E(Z) = Np and VAR(Z) = Np(1−p) respectively.
These and other characteristics affecting the crash process create inconsistencies with the approximation illustrated in Eq. (2). Outcome probabilities that vary from trial to trial are known as Poisson trials (note: Poisson trials are not the summation of independent Poisson distributions; this term is used to designate Bernoulli trials with unequal probability of events). As discussed by Feller (1968), count data that arise from Poisson trials do not follow a standard distribution. However, the mean and variance for these trials share similar characteristics to the binomial distribution when the number of trials N and the expected value E(Z) are fixed. Unfortunately, these assumptions for crash data analysis are not valid: N is not known with certainty—but is an estimated value—and varies for each site, in which
The only difference between these two distributions is the computation of the variance. Drezner and Farnum (1993) and Vellaisamy and Punnen (2001) provide additional information about the properties described by Feller (1968). Neldelman andWallenius (1986) maintain that many phenomena observed in nature tend to exhibit convex relationships, concavity being very rare. As an example, they referred to a paper written by Taylor (1961), who evaluated 24 studies on the sampling of biological organisms for determining population sizes. Taylor found that 23 of the 24 studies followed a NB distribution. As discussed previously, crash data have been observed with variance-to-mean ratios above 1 (Abbess et al., 1981; Poch and Mannering, 1996; Hauer, 1997). Barbour et al. (1992) proposed several methods for determining if the unequal event of independent probabilities can be approximated by a Poisson process. They used the Stein–Chen method combined with coupling methods to validate these approximations. One of these procedures states that
Thus, since the individual pi values for road crash data are almost certainly very small, Poisson approximation to the total number of crashes occurring on a given set of roads over a given time period should be excellent. that if, One important limitation about the methods proposed by Barbour et al. is that the probability for each event must be known. Unfortunately, the individual crash risk (pi) cannot be estimated in field studies since it varies for each driver–vehicle combination and across road segments. Although Eqs. (5)–(8) are not ideally suited for motor vehicle crash data analyses, they illustrate that Poisson and NB distributions represent approximations of the underlying motor vehicle crash process that is derived from a Bernoulli distribution with unequal event probabilities (Neldelman and Wallenius, 1986; Barbour et al., 1992). Data that do not meet the equality found in Eq. (7) will showover-dispersion, representing a convex relationship. Crash data commonly exhibit this characteristic (see also Hauer, 2001 for additional information). In contrast, over-dispersion resulting from other types of processes (not based on a Bernoulli trial) can be explained by the clustering of data (neighborhood, regions, wiring boards, etc.), unaccounted temporal correlation, and model mis-specification. The reader is referred to Gourieroux andVisser (1986), Poormeta (1999) and Cameron and Trivedi (1998) for additional information on the characteristics of over-dispersion for different types of data. 3. Zero-inflated models.
Under this distribution, n = 0,1,2, . . ., K are inflated counts while the rest of the distribution K + 1,K + 2, . . ., N follows a Poisson process. Two different types of regression or predictive models have been proposed in the literature for handling this type of data. The first type is known as the hurdle model (Cragg, 1971; Mullahy, 1986). This type of model has not been extensively applied in the statistical field and is therefore not the focus of this paper. The reader is referred to Cragg (1971), Mullahy (1986) and Schmidt andWitte (1989) for additional information about hurdle models. The zero-inflated count models (also called zero-altered probability or count models with added zeros) represent an alternative way to handle data with a preponderance of zeros. Since their formal introduction by Lambert (1992) (who expanded the work of Johnson and Kotz (1969)), the use of these models has grown almost boundlessly and can be found in numerous fields, including traffic safety. For example, these models have been applied in the fields of manufacturing (Lambert, 1992; Li et al., 1999), economics (Green, 1994), epidemiology (Heilbron, 1994), sociology (Land et al., 1996), trip distribution (Terza andWilson, 1990) and political science (Zorn, 1996) among others. A transportation survey asks how many times you have taken mass transit to work during the past week. An observed zero could arise in two distinct ways. First, last week a respondent may have opted to take the vanpool instead of mass transit. 4. What empirical crash data tell us. It is possible that the low exposure may explain the preponderance of zeros in the data. The prevalence of zeros in the last dataset is probably explained by another phenomenon, which is discussed below. Table 2 summarizes the crash rate by functional class for rural highways in the United States. This table clearly shows that arterial and collector rural segments are more dangerous than interstate highways. In fact, a person is about five and two times more likely, given the exposure, to be involved in a crash on a minor collector or a principal arterial than a freeway respectively. Other researchers have confirmed the results shown in Table 2. For instance, Amoros et al. (2003) found that rural collector roads are generally about twice as dangerous as rural freeway segments for eight counties in France. Brown and Baass (1995) evaluated crash rates by severity for different types of highway located within the vicinity of Montreal, Quebec. They reported that a driver is about two to three times more likely to be involved in a collision on rural principal and arterial roads than on freeways. The crashes were also found to be more severe for principal arterial roads. The characteristics illustrated in Tables 1 and 2 show interesting, if not counterintuitive results. In a general sense, it is puzzling to note that a typical rural highway that has inherently safe segments would, at the same time, belong to a group classified as the most dangerous type of highways. This observation can be explained in one of two ways: either (1) rural highways include many inherently safe segments and a few tremendously dangerous segments, which results in an overall average crash rate much higher than that of freeway segments, or (2) crashes simply do not follow a dual-state process, but a single process characterized by low exposure. Another point worth mentioning is that rural highway segments characterized by high exposure or traffic volumes have never been found to follow a dual-state process (Hauer and Persaud, 1995; Harwood et al., 2000). Another characteristic detailed in Table 1 concerns the high percentage of zeros (80%) for approaches of signalized three-legged intersections located in Singapore (Kumara and Chin, 2003). At first glance, this high percentage appears to be counterintuitive, especially since intersection-related crashes usually account for about half of all crashes occurring in urban environments (NHTSA, 2000). In another study, Lord (2000) found that 3-legged signalized intersections in Toronto, Ontario, experienced on average 4.8 crashes per year and only 10% of the intersections had zero crashes (all approaches combined). This observation merits further reflection: given the characteristics of each city, their effort placed on improving safety, and the same level of exposure, are three-legged signalized intersections in Toronto truly three times more dangerous than in Singapore? Perhaps variations in driver behavior and traffic signal design may explain a portion of this difference. However, given that 80% of the approaches had no crashes, the excess zeros at signalized intersections in Singapore are likely attributable to non-reported crashes and perhaps other differences to a lesser degree. Kumara and Chin (2003) acknowledged that the excess zeros could be explained by non-reported crashes.