B.3 Fitting Discrete Distributions
This section illustrates how to model and fit a discrete distribution to data. Although the steps in modeling discrete and continuous distributions are very similar, the processes and tools utilized vary somewhat. The first thing to truly understand is the difference between discrete and continuous random variables. Discrete distributions are used to model discrete random variables. Continuous distributions are used to model continuous random variables. This may seem obvious but it is a key source of confusion for novice modelers.
Discrete random variables take on any of a specified countable list of values. Continuous random variables take on any numerical value in an interval or collection of intervals. The source of confusion is when looking at a file of the data, you might not be able to tell the difference. The discrete values may have decimal places and then the modeler thinks that the data is continuous. The modeling starts with what is being collected and how it is being collected (not with looking at a file!).
B.3.1 Fitting a Poisson Distribution
Since the Poisson distribution is very important in simulation modeling, the discrete input modeling process will be illustrated by fitting a Poisson distribution. Example B.1 presents data collected from an arrival process. As noted in Table B.1, the Poisson distribution is a prime candidate for modeling this type of data.
Since there are four 15 minute intervals in each hour for each two hour time period, there are 8 observations per day. Since there are 5 days per week, we have 40 observations per week for a total of \(40\times 4= 160\) observations. A sample of observations per 15 minute interval are presented in Table B.2. The full data set is available with the chapter files. Check whether a Poisson distribution is an appropriate model for this data.
Week | Period | M | T | W | TH | F |
---|---|---|---|---|---|---|
1 | 9:00-9:15 am | 8 | 5 | 16 | 7 | 7 |
1 | 9:15-9:30 am | 8 | 4 | 9 | 8 | 6 |
1 | 9:30-9:45 am | 9 | 5 | 6 | 6 | 5 |
1 | 9:45-10:00 am | 10 | 11 | 12 | 10 | 12 |
1 | 10:00-10:15 am | 6 | 7 | 14 | 9 | 3 |
1 | 10:15-10:30 am | 11 | 8 | 7 | 7 | 11 |
1 | 10:30-10:45 am | 12 | 7 | 8 | 3 | 6 |
1 | 10:45-11:00 am | 8 | 9 | 9 | 8 | 6 |
2 | 9:00-9:15 am | 10 | 13 | 7 | 7 | 7 |
3 | 9:00-9:15 am | 5 | 7 | 14 | 8 | 8 |
4 | 9:00-9:15 am | 7 | 11 | 8 | 5 | 4 |
4 | 10:45-11:00 am | 8 | 9 | 7 | 9 | 6 |
The solution to Example B.1 involves the following steps:
Visualize the data.
Check if the week, period, or day of week influence the statistical properties of the count data.
Tabulate the frequency of the count data.
Estimate the mean rate parameter of the hypothesized Poisson distribution.
Perform goodness of fit tests to test the hypothesis that the number of arrivals per 15 minute interval has a Poisson distribution versus the alternative that it does not have a Poisson distribution.
B.3.2 Visualizing the Data
When analyzing a data set it is best to begin with visualizing the data. We will analyze this data utilizing the R statistical software package. Assuming that the data is in a comma separate value (csv) file called PoissonCountData.csv in the R working directory, the following commands will read the data into the R enviroment, plot a histogram, plot a time series plot, and make an autocorrelation plot of the data.
p2 = read.csv("PoissonCountData.csv")
hist(p2$N, main="Computer Lab Arrivals", xlab = "Counts")
plot(p2$N,type="b",main="Computer Lab Arrivals", ylab = "Count", xlab = "Observation#")
acf(p2$N, main = "ACF Plot for Computer Lab Arrivals")
Table B.3 illustrates the layout of the data frame in R. The first column indicates the week, the 2nd column represents the period of the day, the 3rd column indicates the day of the week, and the last column indicated with the variable, \(N\), represents the count for the week, period, day combination. Notice that each period of the day is labeled numerically. To access a particular column within the data frame you use the $ operator. Thus, the reference, \(p\$N\) accesses all the counts across all of the week, period, day combinations. The variable, \(p\$N\), is subsequently used in the hist, plot, and acf commands.
Week | Period | Day | N |
---|---|---|---|
1 | 1 | M | 8 |
1 | 2 | M | 8 |
1 | 3 | M | 9 |
1 | 4 | M | 10 |
1 | 5 | M | 6 |
1 | 6 | M | 11 |
1 | 7 | M | 12 |
1 | 8 | M | 8 |
2 | 1 | M | 10 |
2 | 2 | M | 6 |
2 | 3 | M | 7 |
2 | 4 | M | 11 |
2 | 5 | M | 10 |
2 | 6 | M | 10 |
2 | 7 | M | 5 |
2 | 8 | M | 13 |
3 | 1 | M | 5 |
3 | 2 | M | 15 |
3 | 3 | M | 5 |
3 | 4 | M | 7 |
As can be seen in Figure B.3 the data has a somewhat symmetric shape with nothing unusual appearing in the figure. The shape of the histogram is consistent with possible shapes associated with the Poisson distribution.
The time series plot, shown in Figure B.4, illustrates no significant patterns. We see random looking data centered around a common mean value with no trends of increasing or decreasing data points and no cyclical patterns of up and down behavior.
An autocorrelation plot allows the dependence within the data to be quickly examined. An autocorrelation plot is a time series assessment tool that plots the lag-k correlation versus the lag number. In order to understand these terms, we need to provide some background on how to think about a time series. A time series is a sequence of observations ordered by observation number, \(X_{1}, X_{2},...X_{n}\). A time series, \(X_{1}, X_{2},...X_{n}\), is said to be covariance stationary if:
The mean of \(X_{i}\), \(E[X_{i}]\), exists and is a constant with respect to time. That is, \(\mu = E[X_{i}]\) for $i=1, 2, $
The variance of \(X_{i}\), \(Var[X_{i}]\) exists and is constant with respect to time. That is, \(\sigma^{2} = Var[X_{i}]\) for $i=1, 2, $
The lag-k autocorrelation, \(\rho_{k} = cor[X_{i},X_{i+k}]\), is not a function of time \(i\) but is a function of the distance \(k\) between points. That is, the correlation between any two points in the series does not depend upon where the points are in the series, it depends only upon the distance between them in the series.
Recall that the correlation between two random variables is defined as:
\[cor[X,Y] = \frac{cov[X,Y]}{\sqrt{var[X] Var[Y]}}\]
where the covariance, \(cov[X,Y]\) is defined by:
\[cov[X,Y] = E[\left(X-E[X]\right)\left(Y-E[Y]\right)] = E[XY] - E[X]E[Y]\]
The correlation between two random variables \(X\) and \(Y\) measures the strength of linear association. The correlation has no units of measure and has a range: \(\left[-1, 1 \right]\). If the correlation between the random variables is less than zero then the random variables are said to be negatively correlated. This implies that if \(X\) tends to be high then \(Y\) will tend to be low, or alternatively if \(X\) tends to be low then \(Y\) will tend to be high. If the correlation between the random variables is positive, the random variables are said to be positively correlated. This implies that if \(X\) tends to be high then \(Y\) will tend to be high, or alternatively if \(X\) tends to be low then \(Y\) will tend to be low. If the correlation is zero, then the random variables are said to be uncorrelated. If \(X\) and \(Y\) are independent random variables then the correlation between them will be zero. The converse of this is not necessarily true, but an assessment of the correlation should tell you something about the linear dependence between the random variables.
The autocorrelation between two random variables that are \(k\) time points apart in a covariance stationary time series is given by:
\[\begin{split} \rho_{k} & = cor[X_{i},X_{i+k}] = \frac{cov[X_{i},X_{i+k}]}{\sqrt{Var[X_{i}] Var[X_{i+k}]}}\\ & = \frac{cov[X_{i},X_{i+k}]}{\sigma^2} \; \text{for} \; k = 1,2,\dots \; \end{split}\]
A plot of \(\rho_{k}\) for increasing values of \(k\) is called an autocorrelation plot. The autocorrelation function as defined above is the theoretical function. When you have data, you must estimate the values of \(\rho_{k}\) from the actual times series. This involves forming an estimator for \(\rho_{k}\). (Law 2007) suggests plotting:
\[\hat{\rho}_{k} = \frac{\hat{C}_{k}}{S^{2}(n)}\]
where
\[\begin{equation} \hat{C}_{k} = \frac{1}{n-k}\sum\limits_{i=1}^{n-k}\left(X_{i} - \bar{X}(n) \right)\left(X_{i+k} - \bar{X}(n) \right) \tag{B.1} \end{equation}\]
\[\begin{equation} S^{2}(n) = \frac{1}{n-1}\sum\limits_{i=1}^{n}\left(X_{i} - \bar{X}(n) \right)^2 \tag{B.2} \end{equation}\]
\[\begin{equation} \bar{X}(n) = \frac{1}{n}\sum\limits_{i}^{n} X_{i} \tag{B.3} \end{equation}\]
are the sample covariance, sample variance, and sample average respectively.
Some time series analysis books, see for examples Box, Jenkins, and Reinsel (1994), have a slightly different definition of the sample autocorrelation function:
\[\begin{equation} r_{k} = \frac{c_{k}}{c_{0}} = \frac{\sum\limits_{i=1}^{n-k}\left(X_{i} - \bar{X}(n) \right)\left(X_{i+k} - \bar{X}(n) \right)}{\sum\limits_{i=1}^{n}\left(X_{i} - \bar{X}(n) \right)^2} \tag{B.4} \end{equation}\]
where
\[\begin{equation} c_{k} = \frac{1}{n}\sum\limits_{i=1}^{n-k}\left(X_{i} - \bar{X}(n) \right)\left(X_{i+k} - \bar{X}(n) \right) \tag{B.5} \end{equation}\]
Notice that the numerator in Equation (B.4) has \(n-k\) terms and the denominator has \(n\) terms. A plot of \(r_{k}\) versus \(k\) is called a correlogram or sample autocorrelation plot. For the data to be uncorrelated, \(r_{k}\) should be approximately zero for the values of \(k\). Unfortunately, estimators of \(r_{k}\) are often highly variable, especially for large \(k\); however, an autocorrelation plot should still provide you with a good idea of independence.
Formal tests can be performed using various time series techniques and their assumptions. A simple test can be performed based on the assumption that the series is white noise, \(N(0,1)\) with all \(\rho_{k} = 0\). Box, Jenkins, and Reinsel (1994) indicate that for large \(n\), \(\text{Var}(r_{k}) \approx \frac{1}{n}\). Thus, a quick test of dependence is to check if sampled correlations fall within a reasonable confidence band around zero. For example, suppose \(n = 100\), then \(\text{Var}(r_{k}) \approx \frac{1}{100} = 0.01\). Then, the standard deviation is \(\sqrt{0.01} = 0.1\). Assuming an approximate, \(95\%\) confidence level, yields a confidence band of \(\pm 1.645 \times 0.1 = \pm 0.1645\) about zero. Therefore, as long as the plotted values for \(r_{k}\) do not fall outside of this confidence band, it is likely that the data are independent.
A sample autocorrelation plot can be easily developed once the autocorrelations have been computed. Generally, the maximum lag should be set to no larger than one tenth of the size of the data set because the estimation of higher lags is generally unreliable.
As we can see from Figure B.5, there does not appear to be any significant correlation with respect to observation number for the computer lab arrival data. The autocorrelation is well-contained within the lag correlation limits denoted with the dashed lines within the plot.
Because arrival data often varies with time, it would be useful to examine whether or not the count data depends in some manner on when it was collected. For example, perhaps the computer lab is less busy on Fridays. In other words, the counts may depend on the day of the week. We can test for dependence on various factors by using a Chi-square based contingency table test. These tests are summarized in introductory statistics books. See (Montgomery and Runger 2006).
First, we can try to visualize any patterns based on the factors. We can do this easily with a scatter plot matrix within the lattice package of R. In addition, we can use the xtabs function to tabulate the data by week, period, and day. The xtabs function specifies a modeling relationship between the observations and factors. In the R listing, \(N \sim Week + Period + Day\) indicates that we believe that the count column \(N\) in the data, depends on the \(Week\), \(Period\), and the \(Day\). This builds a statistical object that can be summarized using the summary command.
library(lattice)
splom(p2)
mytable = xtabs(N~Week + Period + Day, data=p2)
summary(mytable)
Call: xtabs(formula = N ~ Week + Period + Day, data = p2)
Number of cases in table: 1324
Number of factors: 3
Test for independence of all factors:
Chisq = 133.76, df = 145, p-value = 0.7384
A scatter plot matrix plots the variables in a matrix format that makes it easier to view pairwise relationships within the data. Figure B.6 presents the results of the scatter plot. Within the cells, the data looks reasonably ‘uniform.’ That is, there are no discernible patterns to be found. This provides evidence that there is not likely to be dependence between these factors. To formally test this hypothesis, we can use the multi-factor contingency table test provided by using the summary command on the output object, myTable of the xtabs command.
The results of using, summary(mytable) show that the chi-square test statistic has a very high p-value, \(0.7384\), when testing if \(N\) depends on \(Week\), \(Period\), and \(Day\). The null hypothesis,\(H_{0}\), of a contingency table test of this form states that the counts are independent of the factors versus the alternative, \(H_{a}\), that the counts are dependent on the factors. Since the p-value is very high, we should not reject \(H_{0}\). What does this all mean? In essence, we can now treat the arrival count observation as 160 independent observations. Thus, we can proceed with formally testing if the counts come from a Poisson distribution without worrying about time or factor dependencies within the data. If the results of the analysis indicated dependence on the factors, then we might need to fit separate distributions based on the dependence. For example, if we concluded that the days of the week were different (but the week and period did not matter), then we could try to fit a separate Poisson distribution for each day of the week. When the mean rate of occurrence depends on time, this situation warrants the investigation of using a non-homogeneous (non-stationary) Poisson process. The estimation of the parameters of a non-stationary Poisson process is beyond the scope of this text. The interested reader should refer to (L. Leemis 1991) and other such references.
B.3.3 Estimating the Rate Parameter for the Poisson Distribution
Testing if the Poisson distribution is a good model for this data can be accomplished using various statistics tests for a Poisson distribution. The basic approach is to compare the hypothesized distribution function in the form of the PDF, PMF, or the CDF to a fit of the data. This implies that you have hypothesized a distribution and estimated the parameters of the distribution in order to compare the hypothesized distribution to the data. For example, suppose that you hypothesize that the Poisson distribution will be a good model for the count data. Then, you need to estimate the rate parameter associated with the Poisson distribution.
There are two main methods for estimating the parameters of distribution functions 1) the method of moments and 2) the method of maximum likelihood. The method of moments matches the empirical moments to the theoretical moments of the distribution and attempts to solve the resulting system of equations. The method of maximum likelihood attempts to find the parameter values that maximize the joint probability distribution function of the sample. Estimation of the parameters from sample data is based on important statistical theory that requires the estimators for the parameters to satisfy statistical properties (e.g. unique, unbiased, invariant, and consistency). It is beyond the scope of this book to cover the properties of these techniques. The interested reader is referred to (Law 2007) or to (Casella and Berger 1990) for more details on the theory of these methods.
To make concrete the challenges associated with fitting the parameters of a hypothesized distribution, the maximum likelihood method for fitting the Poisson distribution will be used on the count data. Suppose that you hypothesize that the distribution of the count of the number of arrivals in the \(i^{th}\) 15 minute interval can be modeled with a Poisson distribution. Let \(N\) be the number of arrivals in a 15 minute interval. Note that the intervals do not overlap and that we have shown that they can be considered independent of each other. We are hypothesizing that \(N\) has a Poisson distribution with rate parameter \(\lambda\), where \(\lambda\) represents the expected number of arrivals per 15 minute interval.
\[f(n;\lambda) = P\{N=n\} = \frac{e^{-\lambda}\left(\lambda \right)^{n}}{n!}\]
Let \(N_{i}\) be the number of arrivals in the \(i^{th}\) interval of the \(k=160\) intervals. The \(N_{1}, N_{2},...,N_{k}\) form a random sample of size \(k\). Then, the joint probability probability mass function of the sample is:
\[L(\lambda) = g(n_{1}, n_{2},...,n_{k};\lambda) = f(n_1;\lambda)f(n_2;\lambda) \ldots f(n_k;\lambda) = \prod_{i = 1}^{k} f(n_i; \lambda)\]
The \((n_{1}, n_{2},...,n_{k})\) are observed (known values) and \(\lambda\) is unknown. The function \(L(\lambda)\) is called the likelihood function. To estimate the value of of \(\lambda\) by the method of maximum likelihood, we must find the value of \(\lambda\) that maximizes the function \(L(\lambda)\). The interpretation is that we are finding the value of the parameter that is maximizing the likelihood that it came from this sample.
Substituting the definition of the Poisson distribution into the likelihood function yields:
\[\begin{aligned} L(\lambda) & = \prod_{i = 1}^{k}\frac{e^{-\lambda}\left(\lambda \right)^{n_i}}{n_{i}!}\\ & = \frac{e^{-k\lambda}\lambda^{\sum_{i=1}^{k}n_{i}}}{\prod_{i = 1}^{k}n_{i}!}\end{aligned}\]
It can be shown that maximizing \(L(\lambda)\) is the same as maximizing \(ln(L (\lambda))\). This is called the log-likelihood function. Thus,
\[ln(L (\lambda)) = -k \lambda + ln(\lambda)\sum_{i = 1}^{k} n_i - \sum_{i = 1}^{k} ln(n_{i}!)\]
Differentiating with respect to \(\lambda\), yields,
\[\dfrac{dln(L(\lambda))}{d\lambda} = -k + \dfrac{\sum_{i = 1}^{k} n_i}{\lambda}\]
When we set this equal to zero and solve for \(\lambda\), we get
\[0 = -k + \dfrac{\sum_{i = 1}^{k} n_i}{\lambda}\] \[\begin{equation} \hat{\lambda} = \dfrac{\sum_{i = 1}^{k} n_i}{k} \tag{B.6} \end{equation}\]
If the second derivative, \(\dfrac{d^2lnL(\lambda)}{d\lambda^2} < 0\) then a maximum is obtained.
\[\dfrac{d^2 lnL(\lambda)}{d \lambda^2} = \dfrac{-\sum_{i = 1}^{k} n_i}{\lambda^2}\]
because the \(n_i\) are positive and \(\lambda\) is positive the second derivative must be negative; therefore the maximum likelihood estimator for the parameter of the Poisson distribution is given by Equation (B.6). Notice that this is the sample average of the interval counts, which can be easily computed using the mean() function within R.
While the Poisson distribution has an analytical form for the maximum likelihood estimator, not all distributions will have this property. Estimating the parameters of distributions will, in general, involve non-linear optimization. This motivates the use of software tools such as R when performing the analysis. Software tools will perform this estimation process with little difficulty. Let’s complete this example using R to fit and test whether or not the Poisson distribution is a good model for this data.
B.3.4 Chi-Squared Goodness of Fit Test for Poisson Distribution
In essence, a distributional test examines the hypothesis \(H_{0}: X_{i} \sim F_0\) versus the alternate hypothesis of \(H_{a}: X_{i} \nsim F_0\). That is, the null hypothesis is that data come from distribution function, \(F_0\) and the alternative hypothesis that the data are not distributed according to \(F_0\).
As a reminder, when using a hypothesis testing framework, we can perform the test in two ways: 1) pre-specifying the Type 1 error and comparing to a critical value or 2) pre-specifying the Type 1 error and comparing to a p-value. For example, in the first case, suppose we let our Type 1 error \(\alpha = 0.05\), then we look up a critical value appropriate for the test, compute the test statistic, and reject the null hypothesis \(H_{0}\) if test statistic is too extreme (large or small depending on the test). In the second case, we compute the p-value for the test based on the computed test statistic, and then reject \(H_{0}\) if the p-value is less than or equal to the specified Type 1 error value.
This text emphasizes the p-value approach to hypothesis testing because computing the p-value provides additional information about the significance of the result. The p-value for a statistical test is the smallest \(\alpha\) level at which the observed test statistic is significant. This smallest \(\alpha\) level is also called the observed level of significance for the test. The smaller the p-value, the more the result can be considered statistically significant. Thus, the p-value can be compared to the desired significance level, \(\alpha\). Thus, when using the p-value approach to hypothesis testing the testing criterion is:
If the p-value \(> \alpha\), then do not reject \(H_{0}\)
If the p-value \(\leq \alpha\), then reject \(H_{0}\)
An alternate interpretation for the p-value is that it is the probability assuming \(H_{0}\) is true of obtaining a test statistic at least as extreme as the observed value. Assuming that \(H_{0}\) is true, then the test statistic will follow a known distributional model. The p-value can be interpreted as the chance assuming that \(H_{0}\) is true of obtaining a more extreme value of the test statistic than the value actually obtained. Computing a p-value for a hypothesis test allows for a different mechanism for accepting or rejecting the null hypothesis. Remember that a Type I error is
\[\alpha = P(\text{Type 1 error}) = P(\text{rejecting the null when it is in fact true})\]
This represents the chance you are willing to take to make a mistake in your conclusion. The p-value represents the observed chance associated with the test statistic being more extreme under the assumption that the null is true. A small p-value indicates that the observed result is rare under the assumption of \(H_{0}\). If the p-value is small, it indicates that an outcome as extreme as observed is possible, but not probable under \(H_{0}\). In other words, chance by itself may not adequately explain the result. So for a small p-value, you can reject \(H_{0}\) as a plausible explanation of the observed outcome. If the p-value is large, this indicates that an outcome as extreme as that observed can occur with high probability. For example, suppose you observed a p-value of \(0.001\). This means that assuming \(H_{0}\) is true, there was a 1 in 1000 chance that you would observe a test statistic value at least as extreme as what you observed. It comes down to whether or not you consider a 1 in 1000 occurrence a rare or non-rare event. In other words, do you consider yourself that lucky to have observed such a rare event. If you do not think of this as lucky but you still actually observed it, then you might be suspicious that something is not “right with the world.” In other words, your assumption that \(H_{0}\) was true is likely wrong. Therefore, you reject the null hypothesis, \(H_{0}\), in favor of the alternative hypothesis, \(H_{a}\). The risk of you making an incorrect decision here is what you consider a rare event, i.e. your \(\alpha\) level.
For a discrete distribution, the most common distributional test is the Chi-Squared goodness of fit test, which is the subject of the next section.
B.3.5 Chi-Squared Goodness of Fit Test
The Chi-Square Test divides the range of the data into, \(k\), intervals (or classes) and tests if the number of observations that fall in each interval (or class) is close the expected number that should fall in the interval (or class) given the hypothesized distribution is the correct model. In the case of discrete data, the intervals are called classes and they are mapped to groupings along the domain of the random variable. For example, in the case of a Poisson distribution, a possible set of \(k=7\) classes could be {0}, {1}, {2}, {3}, {4}, {5}, {6 or more}. As a general rule of thumb, the classes should be chosen so that the expected number of observations in each class is at least 5.
Let \(c_{j}\) be the observed count of the observations contained in the \(j^{th}\) class. The Chi-Squared test statistic has the form:
\[\begin{equation} \chi^{2}_{0} = \sum\limits_{j=1}^{k} \frac{\left( c_{j} - np_{j} \right)^{2}}{np_{j}} \tag{B.7} \end{equation}\]
The quantity \(np_{j}\) is the expected number of observations that should fall in the \(j^{th}\) class when there are \(n\) observations.
For large \(n\), an approximate \(1-\alpha\) level hypothesis test can be performed based on a Chi-Squared test statistic that rejects the null hypothesis if the computed \(\chi^{2}_{0}\) is greater than \(\chi^{2}_{\alpha, k-s-1}\), where \(s\) is the number of estimated parameters for the hypothesized distribution. \(\chi^{2}_{\alpha, k-s-1}\) is the upper critical value for a Chi-squared random variable \(\chi^{2}\) such that \(P\{\chi^{2} > \chi^{2}_{\alpha, k-s-1}\} = \alpha\). The p-value, \(P\{\chi^{2}_{k-s-1} > \chi^{2}_{0}\}\), for this test can be computed in using the following formula:
\[\text{CHISQ.DIST.RT}(\chi^{2}_{0},k-s-1)\]
In the statistical package \(R\), the formula is:
\[\text{pchisq}(\chi^{2}_{0},k-s-1,lower.tail=FALSE)\] The null hypothesis is that the data come from the hypothesized distribution versus the alternative hypothesis that the data do not come from the hypothesized distribution.
Let’s perform a Chi-squared goodness of fit test of computer lab data for the Poisson distribution. A good first step in analyzing discrete data is to summarize the observations in a table. We can do that easily with the R table() function.
# tabulate the counts
= table(p2$N)
tCnts tCnts
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1 3 2 7 13 16 21 25 20 21 11 7 5 6 1 1
From this analysis, we see that there were no week-period combinations that had 0 observations, 1 that had 1 count, 3 that had 2 counts, and so forth. Now, we will estimate the rate parameter of the hypothesized Poisson distribution using Equation (B.6) and tabulate the expected number of observations for a proposed set of classes.
# get number of observations
= length(p2$N)
n n
## [1] 160
# estimate the rate for Poisson from the data
= mean(p2$N)
lambda lambda
## [1] 8.275
# setup vector of x's across domain of Poisson
= 0:15
x x
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# compute the probability for Poisson
= dpois(x, lambda)
prob prob
## [1] 0.0002548081 0.0021085367 0.0087240706 0.0240638947 0.0497821822
## [6] 0.0823895116 0.1136288681 0.1343255548 0.1389429957 0.1277503655
## [11] 0.1057134275 0.0795253284 0.0548393410 0.0349073498 0.0206327371
## [16] 0.0113823933
# view the expected counts for these probabilities
*n prob
## [1] 0.04076929 0.33736587 1.39585130 3.85022316 7.96514916 13.18232186
## [7] 18.18061890 21.49208877 22.23087932 20.44005848 16.91414839 12.72405254
## [13] 8.77429457 5.58517596 3.30123794 1.82118293
We computed the probability associated with the Poisson distribution with \(\hat{\lambda} = 8.275\). Recall that the Possion distribution has the following form:
\[ P\left\{X=x\right\} = \frac{e^{-\lambda}\lambda^{x}}{x!} \quad \lambda > 0, \quad x = 0, 1, \ldots \] Then, by multiplying each probability, \(P\left\{X=i\right\} = p_i\) by the number of observations \(n=160\), we can get the expected number, \(n\times p_i\), that we should observed under the hypothesized distribution. Since the expected values for \(x = 0, 1, 2, 3, 4\) are so small, we consolidate them in to one class to have the expected number to be at least 5. We will also consolidate the range \(x \geq 13\) into a single class and the values of 11 and 12 into a class.
# compute the probability for the classes
# the vector is indexed starting at 1, prob[1] = P(X=0)
= c(sum(prob[1:5]), prob[6:11], sum(prob[12:13]), ppois(12, lambda, lower.tail = FALSE))
cProb cProb
## [1] 0.08493349 0.08238951 0.11362887 0.13432555 0.13894300 0.12775037 0.10571343
## [8] 0.13436467 0.07795112
# compute the expected counts for each of the classes
= cProb*n
expected expected
## [1] 13.58936 13.18232 18.18062 21.49209 22.23088 20.44006 16.91415 21.49835
## [9] 12.47218
Thus, we will use the following \(k=9\) classes: {0,1,2,3,4}, {5}, \(\cdots\), {10}, {11,12}, and {13 or more }. Now, we need to summarize the observed frequencies for the proposed classes.
# transform tabulated counts to data frame
= as.data.frame(tCnts)
dfCnts # extract only frequencies
= dfCnts$Freq
cnts cnts
## [1] 1 3 2 7 13 16 21 25 20 21 11 7 5 6 1 1
# consolidate classes for observed
= c(sum(cnts[1:4]), cnts[5:10], sum(cnts[11:12]), sum(cnts[13:16]))
observed observed
## [1] 13 13 16 21 25 20 21 18 13
Now, we have both the observed and expected tabulated and can proceed with the chi-squared goodness of fit test. We will compute the result directly using Equation (B.7).
# compute the observed minus expected components
= ((observed - expected)^2)/expected
chisq # compute the chi-squared test statistic
= sum(chisq)
sumchisq # chi-squared test statistic
print(sumchisq)
## [1] 2.233903
# set the degrees of freedom, with 1 estimated parameter s = 1
= length(expected) - 1 - 1
df # compute the p-value
= 1 - pchisq(sumchisq, df)
pvalue # p-value
print(pvalue)
## [1] 0.9457686
Based on such a high p-value, we would not reject \(H_0\). Thus, we can conclude that there is not enough evidence to suggest that the lab arrival count data is some other distribution than the Poisson distribution. In this section, we did the Chi-squared test by-hand. R has a package that facilitates distribution fitting called, fitdistrplus. We will use this package to analyze the computer lab arrival data again in the next section.
B.3.6 Using the fitdistrplus R Package on Discrete Data
Fortunately, R has a very useful package for fitting a wide variety of discrete and continuous distributions call the fitdistrplus package. To install and load the package do the following:
install.packages('fitdistrplus')
library(fitdistrplus)
plotdist(p2$N, discrete = TRUE)
The plotdist command will plot the empirical probability mass function and the cumulative distribution function as illustrated in Figure B.7.
To perform the fitting and analysis, you use the fitdist and gofstat commands. In addition, plotting the output from the fitdist function will provide a comparison of the empirical distribution to the theoretical distribution.
fp = fitdist(p2$N, "pois")
summary(fp)
plot(fp)
gofstat(fp)
= fitdist(p2$N, "pois")
fp summary(fp)
## Fitting of the distribution ' pois ' by maximum likelihood
## Parameters :
## estimate Std. Error
## lambda 8.275 0.2274176
## Loglikelihood: -393.7743 AIC: 789.5485 BIC: 792.6237
gofstat(fp)
## Chi-squared statistic: 2.233903
## Degree of freedom of the Chi-squared distribution: 7
## Chi-squared p-value: 0.9457686
## Chi-squared table:
## obscounts theocounts
## <= 4 13.00000 13.58936
## <= 5 13.00000 13.18232
## <= 6 16.00000 18.18062
## <= 7 21.00000 21.49209
## <= 8 25.00000 22.23088
## <= 9 20.00000 20.44006
## <= 10 21.00000 16.91415
## <= 12 18.00000 21.49835
## > 12 13.00000 12.47218
##
## Goodness-of-fit criteria
## 1-mle-pois
## Akaike's Information Criterion 789.5485
## Bayesian Information Criterion 792.6237
The output of the fitdist and the summary commands provides the estimate of \(\lambda = 8.275\). The result object, fp, returned by the fitdist can then subsequently be used to plot the fit, plot and perform a goodness of fit test, gofstat.
The gofstat command performs a chi-squared goodness of fit test and computes the chi-squared statistic value (here \(2.233903\)) and the p-value (\(0.9457686\)). These are exactly the same results that we computed in the previous section. Clearly, the chi-squared test statistic p-value is quite high. Again the null hypothesis is that the observations come from a Poisson distribution with the alternative that they do not. The high p-value suggests that we should not reject the null hypothesis and conclude that a Poisson distribution is a very reasonable model for the computer lab arrival counts.
B.3.7 Fitting a Discrete Empirical Distribution
We are interested in modeling the number of packages delivered on a small parcel truck to a hospital loading dock. A distribution for this random variable will be used in a loading dock simulation to understand the ability of the workers to unload the truck in a timely manner. Unfortunately, a limited sample of observations is available from 40 different truck shipments as shown in Table B.4
130 | 130 | 110 | 130 | 130 | 110 | 110 | 130 | 120 | 130 |
140 | 140 | 130 | 110 | 110 | 140 | 130 | 140 | 130 | 120 |
140 | 150 | 120 | 150 | 120 | 130 | 120 | 100 | 110 | 150 |
130 | 120 | 120 | 130 | 120 | 120 | 130 | 130 | 130 | 100 |
Developing a probability model will be a challenge for this data set because of the limited amount of data. Using the following \(R\) commands, we can get a good understanding of the data. Assuming that the data has been read into the variable, packageCnt, the hist(), stripchart(), and table() commands provide an initial analysis.
packageCnt = scan("data/AppDistFitting/TruckLoadingData.txt")
hist(packageCnt, main="Packages per Shipment", xlab="#packages")
stripchart(packageCnt,method="stack",pch="o")
table(packageCnt)
As we can see from the \(R\) output, the range of the data varies between 100 and 150. The histogram, shown in Figure B.9, illustrates the shape of the data. It appears to be slightly positively skewed. Figure B.10 presents a dot plot of the observed counts. From this figure, we can clearly see that the only values obtained in the sample are 100, 110, 120, 130, 140, and 150. It looks as though the ordering comes in tens, starting at 100 units.
Because of the limited size of the sample and limited variety of data points, fitting a distribution will be problematic. However, we can fit a discrete empirical distribution to the proportions observed in the sample. Using the table() command, we can summarize the counts. Then, by dividing by 40 (the total number of observations), we can get the proportions as show in the \(R\) listing.
= table(packageCnt)
tp /length(packageCnt) tp
## packageCnt
## 100 110 120 130 140 150
## 0.050 0.150 0.225 0.375 0.125 0.075
Thus, we can represent this situation with the following probability mass and cumulative distribution functions.
\[P\{X=x\} = \begin{cases} 0.05 & \text{x = 100}\\ 0.15 & \text{x = 110}\\ 0.225 & \text{x = 120}\\ 0.375 & \text{x = 130}\\ 0.125 & x = 140\\ 0.075 & x = 150 \end{cases}\]
\[F(x) = \begin{cases} 0.0 & \mathrm{if} \ x < 100\\ 0.05 & \mathrm{if} \ 100 \le x < 110\\ 0.20 & \mathrm{if} \ 110 \le x < 120\\ 0.425 & \mathrm{if} \ 120 \le x < 130\\ 0.80 & \mathrm{if} \ 130 \le x < 140\\ 0.925 & \mathrm{if} \ 140 \le x < 150\\ 1.0 & \mathrm{if} \ x \geq 150 \end{cases} \]
The previous two examples illustrated the process for fitting a discrete distribution to data for use in a simulation model. Because discrete event dynamic simulation involves modeling the behavior of a system over time, the modeling of distributions that represent the time to perform a task is important. Since the domain of time is on the set of positive real numbers, it is a continuous variable. We will explore the modeling of continuous distributions in the next section.