Basan Shrestha's Diary: 2018

Monday, October 8, 2018

What Sample Proportion is a Statistically Significant Estimate of Population Proportion?, Statistical Note 37

Toss a coin 20 times in a sample and count the number of heads. Repeat the same process for seven samples each constituting 20 tosses. Calculate the sample proportion of heads for each sample and test whether each sample proportion is a significant estimate of the population proportion. Discuss what makes the sample proportion a statistically significant estimate of the population proportion.

Key Words: Population Proportion, Sample Proportion, Standard Error, z-score, Statistical Significance, Sample Size

Introduction

Not all sample proportions are statistically significant estimates of the population proportion. Then, questions arise what sample proportion is significant and what makes the statistical significance. Tossing of a coin is an example of the binary categorical random variable to explain the statistical significance of the sample proportions. Refer to my statistical note 36 to know more about population and sample proportions and normality of the sampling distribution of the sample proportions.

Observed Data

I tossed a coin 20 times in a sample (S) and the same process was repeated for seven samples (S1 to S7). Can one guess how many heads will there be in each sample? Table 1 presents the outcome of 20 tosses of a coin in each of seven samples.

Table 1: Outcomes in 20 tosses of a coin in each of seven samples

Every sample of 20 tosses in Table 1 is from a population constituting the large number of possible tosses. Below section discusses on whether the sample proportions are statistically significant estimates of the population proportion.

Discussion

Population proportion, denoted by ‘p’ in coin tossing experiments is 0.5. Sample proportion is denoted by ’p^’ and spelt as ‘p-Hat’. The sample proportion of heads in 20 tosses of a coin ranged between 0.35 to 0.65 (Table 1).

Mean of sample proportions ‘p^’, also called center, is the population proportion ‘p’. Symbolically, it is indicated by µ_p^=p. In coin tossing, 0.5 is the mean of sample proportions or the population proportion.

Standard Deviation of Sample Proportions is expressed as the square root of the population proportion multiplied by one minus population proportion divided by sample size. This is referred to as spread or Standard Error (SE) of sample proportions, denoted by σ_p^ is Ö[p x (1-p)/n] where ‘p’ is the population proportion and ‘n’ is the sample size. In this example, SE is calculated to be Square root [0.5 x (1-0.5)/20], equal to 0.111803.

Sampling distribution of sample proportions with the sample size of 20 tosses of a coin is approximate to normal distribution with mean p=0.5 and σ_p^=0.111803.

In sampling distribution of sample proportions following normal distribution, z-score or test statistic is a measure calculated as the difference between population proportion and sample proportion divided by SE of sample proportions. Symbolically, z=(p+p_^)/Ö[p x (1-p)/n]. In normal distribution the z- score equal to minus or plus 1.96 is a commonly used cut-off point for the sample proportion to be a statistically significant estimate of the population proportion indicating that 95 percent samples have population proportion within the confidence interval of minus 1.96 to plus 1.96. It means one is 95 percent confident that the population proportion will fall within 1.96 confidence interval. z-score ranging between negative to positive cut-off point is called the confidence interval. z-score less than minus 1.96 and greater than plus 1.96 indicate that the sample proportion is a statistically significant estimate of the population proportion from a different population. One point to note here is that the sample size ‘n’ is directly proportional to z-score indicating that as the sample size ‘n’ increases, z-score also increases.

I calculated the z-score for all seven sample proportions in this example (Table 2). For example, for the sample proportion P_^=0.35, z = (p_^-p) / σ_p^, where σ_p^=Ö[p x (1-p)/n] = Ö[0.50 x 0.50/20] = 0.111803 so that z = (0.35-0.50) / 0.111803 = 1.3416. Likewise, the z-score was calculated for each sample proportion and tabulated. Using the cut-off point of z-score equal to minus or plus 1.96, the sample proportions in this example were not found to be the significant estimates of the population parameter of 0.50, given the sample size of 20 tosses of a coin.

Table 2: Sample proportions and their significance to estimate population proportion of samples constituting 20 tosses of a coin

The sample proportions as smaller as 0.2808 and as bigger as 0.7192 are statistically significant estimates of the population proportion 0.50, given the sample size of 20 tosses of a coin.

If the sample size is increased, the sample proportions bigger than 0.2808 and smaller than 0.7192 will be statistically significant estimates of population proportion of 0.50. Example, if the sample size is increased to 50 tosses, the sample proportions as smaller as 0.3614 and as bigger as 0.6386 are statistically significant estimates of the population proportion of 0.50. Likewise, if the sample size is increased to 100 tosses, the sample proportions as smaller as 0.4020 and as bigger as 0.5980 are statistically significant estimate of the population proportion of 0.50.

I am curious whether John Kerrich’s observed sample proportion of heads equal to 0.5067 in 10,000 tosses of a coin is a statistically significant estimate of population proportion of 0.5 or not. z-score for this is calculated to be 1.34, which is lower than the cut-off point of 1.96 indicating that this is one of 95 percent samples each of 10,000 tosses so that one can be 95 percent confident that this sample proportion is an insignificant estimate of the population proportion, 0.50. Thus, this sample proportion is not a statistically significant estimate of the population proportion, 0.50.

Conclusion

Not all sample proportions are statistically significant estimates of the population proportion. The sample size is pivotal for identifying whether the sample proportion is a statistically significant estimate of the population proportion or not.

Sunday, October 7, 2018

Sampling Distribution of One Sample Proportions, Statistical Note 36

Key Words: Population, Sample Space, Sampling Frame, (Population) Parameter, Expected Value, Proportion, Population Proportion, Sample, Subset, Random Sample, Sampling Unit, Independent Trials, (Sample) Statistic, Point Estimate, Sample Proportion, Likelihood Estimate, Maximum Likelihood Estimate, Sampling Variability, Sampling Variability, Sampling Error, Estimate, Estimator, Sample Mean, Center, Sample Variance, Sample Standard Deviation, Spread, Mean of Sample Proportions, Standard Deviation of Sample Proportions, Standard Error, Categorical Variable, Binary Random Variable, Binary Data, Binary Outcome, Bernoulli Random Variable, Binomial Random Variable, Frequency Distribution, Sampling Distribution, Shape, Normal Distribution, Law of Averages, Law of Large Numbers, Central Limit Theorem.

Introduction

Numerous samples of equal size can be formulated from a population. But, the sample characteristics vary from one to others due to sampling variability. Sample proportion is widely used as a summary measure of a binary or two-categorical random variable. Sampling distribution of a sample proportion gives an idea of population proportion that usually remains unknown in the real world. Tossing of a coin is an example of the binary categorical random variable that explains sampling distribution of sample proportions although population proportion in coin tossing experiments is already known.

Observed Data

Table 1: Outcomes in 20 tosses of a coin in each of seven samples

Every sample of 20 tosses in Table 1 is a representative of the large number of possible tosses. However, an outcome of every event is different, although the number of heads out of 20 tosses is same or different. Below sections will discuss in detail what does large number of possible tosses mean, it’s representation, and lot more about the outcomes in Table 1.

Discussion

Population constitutes an entire set of possible cases or values. It is also referred to the sample space. The population for coin tossing contains the results of tossing the coin for countably the large number of times. Not sure how large is large. There are some records of tossing a coin for large number of times. A French naturalist Count Buffon (1707 - 1788) tossed a coin 4040 times. Likewise, a South African mathematician John Kerrich tossed a coin 10,000 times as experiments to pass his time in imprisonment in Denmark during fifties.

A parameter, also referred to as population parameter or an expected value, is a population value that describes the characteristic of the population. Usually, the value of a parameter is unknown as the entire population is not enumerated. But, in case of tossing of an unbiased coin, the parametric value is already known, either head or tail appears in a toss. Turning of head or tail is mutually exclusive. This is referred to as Independence, the third rule of sample proportion..

Proportion is a special case of mean for binary data. Population proportion is a population parameter. Population proportion is the ratio of number of success and the entire number of cases in the population. It is the probability of success ‘p’ that ranges between zero and one. If a coin is tossed, either of head or tail turns up. The population proportion of success, turning up head in a toss is one face of head divided by two possible faces of head or tail, that is half. Symbolically, population proportion is indicated by µ=p. John Kerrich observed the proportion of heads equal to 0.5067 in 10,000 tosses of a coin.

Sample is a subset of the population selected for enumeration. Sample is drawn randomly to avoid purposeful bias in which every individual object in the population has an equal chance of being selected. This example of tossing of a coin follows randomization, the first rule of sample proportion. Sampling unit is an individual unit or outcome of the sample. In this case, turning up of a head or tail in a toss of a coin is a sampling unit.

Statistic, also referred to as (sample) statistic, is a sample value or a measure that describes the sample such as sample mean and sample variance. Statistics vary from one sample to another due to sampling variability. A statistic is used to estimate an unknown parameter. It is called point estimate.

Sample proportion is the observed number of successes out of total sample size. It is a random variable that takes the value between zero and one. The sample proportion denoted by ’p^’ and spelt as ‘p-Hat’ is the value of success divided by the sample size ‘n'. ‘-Hat’ is an indication of ‘estimate of’. Thus, ’p^’, a statistic is an estimate of the parameter ‘p’. In this example, the number of successes, heads out of total tosses in a sample. The expected value of the sample proportion is equal to the population proportion. In this example, the sample proportion of heads in 20 tosses of a coin ranged between 0.35 to 0.65 (Table 2). Some sample proportions were smaller than the population proportion while others are equal and larger due to sampling variability or error.

Table 2: Sample proportion of heads in 20 tosses of a coin in each of seven samples

Sample proportions ranging between zero to one are all likelihood estimates of the population proportion. Theoretically, every value of sample proportion between zero to one is possible. Bernoulli trial will have the sample proportion either zero or one. As the sample size increases, the sample proportion closer to the population proportion is highly likely to occur. Among them, the sample proportion equal to the population proportion is most likely and is called the Maximum likelihood estimate of the population proportion. Other sample proportions are less likely than the maximum likelihood estimate. In this example, population proportion in coin tossing is 0.5 and sample proportion equal to 0.5 is the maximum likely hood estimate.

Standard Deviation of Sample Proportions is expressed as the square root of the population proportion multiplied by one minus population proportion divided by sample size. This is referred to as spread or Standard Error (SE) of sample proportion or sampling distribution. Symbolically, σ_p^=Ö[p x (1-p)/n] where 'p' is the population proportion and 'n' is the sample size.

In this example, SE is calculated to be Square root [0.5 x (1-0.5)/20], equal to 0.111803. It indicates that the difference between sample proportion and population proportion is 0.111803. Sample proportions within one SE from population proportion will be 0.6111803 and 0.388197. According to Normal distribution, 68 percent samples of equal size will have population proportion within one SE from sample proportions. In this case, the sample proportions ranging between 0.723606 and 0.276394 have population proportion of 0.50 within two SEs of sample proportions. Likewise, 95 percent samples of equal size will have population proportion within two SEs from sample proportions. These indicate that the samples of 20 tosses each in this example are not surprising and fall among 95 percent samples of that size provided sample size is adequate to follow normal distribution. It means that I am 95 percent sure that the population proportion will be within the sample proportions ranging between 0.723606 and 0.276394. However, sample size is a decisive factor to calculate the sample proportion and SE. Bigger samples are less spread than smaller samples.

Sampling distribution is the frequency distribution of sample statistics. A sampling distribution lists the possible values of a statistic. The frequency distribution of sample proportions is called sampling distribution of sample proportions.

Sample proportions close to population proportion are more likely to occur and sample proportions farther to population are less likely to occur. Thus, the shape of the sampling distribution of sample proportions bulge in the middle part close to the population proportion and taper farther from the population, will closer to the shape of normal distribution.

Sample size determines the accuracy of the estimation of the population parameter. The larger the sample the smaller will be the variability. In the formula, SE is inversely proportion to the sample size. This is also referred to the Law of Large Numbers. One may be interested to know how large a larger sample size is. If the product of sample size and population proportion and the product of sample size and one minus population proportion is more than or equal to five, sampling distribution of sample proportion is said to follow normal distribution. Symbolically, the rule of thumb for normality has two criteria - 1) an expected number of head (n*p) >=5 and 2) an expected number of tail [n*(1-p)] >=5, X follows normal distribution with mean p and standard deviation equal to Square root [p x (1-p)/n. This result is called the Central Limit Theorem. Others argue that n*p and n*(1-p) should be greater than or equal to 10. This is called Normality, the second rule of sample proportion.

This exemplary sample of 20 tosses meets the normal distribution’s both criteria. Both n*p and n*(1-p) are 10. Thus, we can conclude that the sampling distribution of sample proportions will be approximate to normal distribution with mean of sample proportions, p=0.5 and standard error of sample proportions, σ_p^=0.111803.

Conclusion

The population parameters are usually unknown. Sample statistic is used to estimate the population parameter. Sampling proportion is used to estimate population proportion. Sampling distribution of sample proportion gives an idea of sampling variability. Sampling distribution of sample proportion in tossing of a coin follows normal distribution if a sample constitutes 20 tosses of a coin following Central Limit Theorem. Larger the sample size, the smaller will be the spread of distribution. The observed sampling distribution of sample proportions were among 95 percent samples of 20 tosses of a coin. In other words, I am 95 percent sure that the population proportion will be within the sample proportions ranging between 0.723606 and 0.276394, although I already know that population proportion is 0.5.

Saturday, September 15, 2018

Two-category Discrete Probability Distribution of Sampling Without Replacement, Observation and Theory, Statistical Note 35

Draw five cards without replacement from a deck of cards in an event and count the number of black cards. Repeat the same process for seven events each constituting five cards. Calculate the observed and theoretical discrete probability distributions of number of black cards.

The observed probability distribution is based on the real-time data. The theoretical probability distribution is based on an ideal situation. Using the observed data is important to understand the theory. The main objective of this note is to develop understanding of concepts on probability and two-category or binary probability distribution without replacement using a simple experiment.

Drawing of some cards without replacement from a deck of cards is an example of the two-category discrete probability distribution of sampling without replacement. Refer to my earlier Statistical Notes also for clarity on calculating the two-category discrete probability distribution of sampling without replacement using tree diagram, formula and Excel software function.

In this note, first I present the observed data and then present the probability and two-category discrete probability concepts using the event data. This note tries to clarify the concept two-category discrete probability distribution based on the observed data. Former notes first tried to clarify the theory and then discussed the observation. Unlike, this note is other way round which first discusses on the observed data using the tree diagram, and then clarifies the theoretical distribution. This is also because to analyze and interpret meaningfully the observed data based on the theory.

Observed Data

I drew five cards without replacement in an event (E) and the same process was repeated for seven events. Table 1 presents the outcome of drawing five cards without replacement in each of seven events. Black and red cards were coded B and R respectively for symbolic representation. Besides, a cell with black card is shaded black and the cell with red card is shared red. I will discuss more on this table in following sections.

Table 1: Outcomes in drawing five cards without replacement in each of seven events (Black card=B and Red card=R)

Queries

Several questions may arise looking at the outcomes data in Table 1. For example,

· How many unique outcomes each constituting five tosses are there in Table 1?

· Why are some outcomes in the table same and others different?

· Is there any pattern of outcomes in drawing five cards without replacement?

· How many unique outcomes will there be theoretically of black and red cards of five cards drawn without replacement?

· How many groups of outcomes are there in the table of observed data?

· Why were there outcomes with only two to three black cards of five cards drawn without replacement? Why not less or more than those number of black cards?

· How many different groups of outcomes will there be theoretically having five black cards to no black card in drawing five cards without replacement?

· What is the probability of the first event E1 (B,R,R,B and R) in Table 1 that has black and red cards in exactly this order?

· What is the probability of two black cards out of five cards drawn without replacement in which the order does not matter whether a black or a red card occurs in which draw out of five cards?

· Looking at Table 1, what will be the observed discrete probability distribution of number of black cards?

· What will be the theoretical probability distribution of number of black cards in drawing five cards without replacement?

· How different will be the observed from the theoretical discrete probability distribution of number of black cards in drawing five cards without replacement?

Response

These questions can be answered using different tools – Tree Diagram, Binomial Expansion, Binomial Distribution function and Hypergeometric function. Look at specific statistical notes to get answers to these questions. Below are responses to the queries:

Questions: How many unique outcomes each constituting five tosses are there in Table 1? Why are some event outcomes in the table same and others different? Is there any pattern of outcomes in drawing five cards without replacement?

Drawing a card is unbiased such that both black and red cards are equally likely to occur with the probability of half for each of black and red cards. Drawing a card is a random experiment in which any of black or red card is likely to be drawn. Thus, there are five unique outcomes each constituting five cards. First and seven outcomes (E1 and E7) are same, third and fifth outcomes (E3 and E5) are same. One outcome has three consecutive occurrences of red cards of five cards drawn (E6). Three outcomes have two occurrences of red cards of five cards (E1, E2 and E7). Black cards have not occurred consecutively in any of the seven events. One outcome has occurrences of red and black cards alternatively (E5). These indicate that there is no pattern of occurrence of black and red cards.

Question: How many unique outcomes will there be theoretically of black and red cards of five cards drawn without replacement?

Refer to my statistical notes 6 on the total number of possible outcomes of sampling without replacement. Each unique outcome will be different than others based on the black or red card in a certain draw. The total number of possible outcomes is calculated by using the formula ‘k to the power r’ or ‘k to the r^th power’ or ‘k^r’, where ‘k’ is the number of possible outcomes in an experiment or trail and ‘r’ is the number of times an experiment is conducted with replacement or the number of sampling units drawn without replacement. In this example, ‘n’ is two and ‘r’ is five so that the total number of possible outcomes is calculated by multiplying two possibilities (black or red card) in each of drawing five cards without replacement. This is calculated as 2 X 2 X 2 X 2 X 2 equal to 32 represented by the ‘two to the power five’ or ‘two to the fifth power’, denoted by 2⁵. Here, the number of outcomes remains same as that with replacement (see Statistical Note 34). This is clearly seen on the tree diagram 1 as well. Since the cards were drawn without replacement, the principle of conditional probability applies to the second through subsequent cards drawn without replacement. The outcomes with the number of black cards ranging from five black cards, denoted by ‘5B’ to zero black card (all five red cards), denoted by ‘0B’ are indicated by different colors on the third block from right in tree diagram 1. Besides, outcomes of seven events of drawing five cards without replacement listed in Table 1 are out of 32 outcomes that are shown with the respective outcome numbers E1 to E7 with different colors at the right most part of the tree diagram 1.

Questions: How many groups of outcomes are there in the table of observed data? Why were there outcomes with only two to three black cards in drawing five cards without replacement? Why not less or more than those number of black cards?

There are two groups of outcomes, with two and three black cards in drawing five cards without replacement. Some events have same number of black cards. Five events (E1, E2, E4, E6 and E7) each with five cards have two black cards and two events (E3 and E5) have three black cards. If such events are repeated for other multiple times, those events could have other number of black cards ranging from zero to all five black cards in drawing five cards without replacement. Thus, there is no guarantee that a specified number of black cards will occur in any number of cards drawn.

Diagram 1: Tree Diagram Showing Outcomes in Drawing Five Cards Without Replacement from a Deck of Cards

Question: How many groups of outcomes will there be theoretically having five black cards to no black card in drawing five cards without replacement?

Refer to my statistical note 15 on the grouping of unique outcomes in which the order does not matter. The possible number of outcomes groups is based on the number of black or red card in a draw of a card irrespective of the order of the occurrence of a black card. This is calculated using the formula C(k+r-1,r)=(k+r-1)!/ (k-1)!r!, where ‘C’ refers to the combination, ‘k’ is the number of possible outcomes in an experiment or trail and ‘r’ is the number of times an experiment or a trail is conducted. In this example, ‘k’ is two and ‘r’ is five so that the number of groups of possible outcomes is calculated to be 6! divided by 5!, equal to 6.

The grouping of outcomes with the number of black cards are shown in tree diagram 1. There are six groups of outcomes, ranging from G1 to G6. G1 has only one outcome with five black cards in drawing five cards without replacement, G2 has five outcomes with four black cards, G3 has 10 outcomes with three black cards, G4 has ten outcomes with two black cards, G5 has five outcomes with one black card and G6 has one outcome with no black card, means all red cards in drawing five cards without replacement.

This grouping can be shown using the Binomial Expansion formula. Let ‘B’ be the black card and ‘R’ be the red card. Since, five cards are drawn without replacement, a power five of sum of ‘B’ and ‘R’ is used for the Binomial expansion. The expansion of (B+R)⁵ is expressed as:

(B+R)⁵ = B⁵+5B⁴R+10B³R²+10B²R³+5BR⁴+R⁵

‘B⁵’ means there is one outcome G1 having five black cards in drawing five cards from a deck, in the same way ‘5B⁴R’ means there are five outcomes having four black cards and one red card, ‘10B³R²’ means 10 outcomes with three black cards and two red cards, ‘10B²R³’ means 10 outcomes with two black cards and three red cards, ‘5BR⁴’ means five outcomes with one black card and four red cards, and ‘R⁵’ means one outcome constituting five red cards in drawing five cards from a deck.

The number of outcomes in a certain group of outcomes or the Binomial coefficient can be identified using Pascal’s Triangle as discussed in my Statistical Note 16, as shown in Diagram 2.

Diagram 2: Number of cards drawn from a deck and Binomial Coefficient using Pascal’s Triangle

With the increase in the number of cards drawn, say 20 cards drawn, it is difficult to draw the tree diagram as well as to write the Binomial expansion of (B+R)²⁰, particularly the coefficients of each group of outcomes. In such a case, Binomial Distribution function is used to identify the coefficients. The formula is: C(n,x)B^xR^n-x, where ‘n’ is the number of trails and ‘B’ is the black card and ‘R’ is the red card. Example, I use Binomial distribution formula to calculate the number of outcomes in the group constituting three black and two red cards in drawing five cards without replacement from a deck. It is C(5,3)B³R², equivalent to 10 B³R² which is same as the third group in the above Binomial expansion.

Using the Binomial distribution function, as discussed in my Statistical Note 16, the expansion looks:

(B+R)ⁿ = C(n,0)Bⁿ+C(n,1)B^n-1R+C(n,2)B^n-2R²+C(n,3)B^n-3R³+ C(n,x)B^n-xR^x+……+ C(n,n)Rⁿ

This expression can be used to calculate the number of outcomes in a certain group of black cards and ultimately the total number of outcomes for the given number of events or experiments without replacement.

The same expansion can be used to identify the number of groups of outcomes that is Binomial coefficient for Hypergeometric distribution also. But the main change lies in calculating the probability of the group of outcomes in which the probability of the second and subsequent outcomes within a group of outcomes increases with the decrease in the denominator in the calculation of the probability.

Question: What is the probability of the first event E1 (B,R,R,B and R) in Table 1 that has black and red cards in exactly this order?

The joint probability is calculated by multiplying the marginal probability of the first card and conditional probabilities of the remaining cards drawn without replacement. There is only one outcome that has black cards in the first and third draws and red cards in second, third and fifth draws of five cards without replacement from a deck of cards. Thus, probability of an event E1 (B,R,R,B and R) is the product of the marginal probability of the first black card and conditional probabilities of the remaining four cards (R,R,B and R). Using these probability values from the tree diagram 1, P(B,R,R,B and R) is the product of 26/52, 26/51, 25/50, 25/49 and 25/48, equal to 0.033867.

Question: What is the probability of two black cards out of five cards drawn without replacement in which the order does not matter whether a black or red card occurs in which draw out of five cards?

Looking at the tree diagram 1, there are 10 outcomes constituting two black cards under group G2. They are – first outcome (B,B,R,R and R); second outcome (B,R,B,R and R) ; third outcome (B,R,R,B and R) ; fourth outcome P(B,R,R,R and B) ; fifth outcome (R,B,B,R and R) ; sixth outcome (R,B,R,B and R) ; seventh outcome (R,B,R,R and B) ; eighth outcome (R,R,B,B and R) ; ninth outcome (R,R,B,R and B) ; and tenth outcome (R,R,R,B and B). Adding the probability of all these ten outcomes, the probability of two black cards, P(2B), out of five cards drawn without replacement is equal to the sum of conditional probabilities of these ten outcomes. Thus,

P(2B)= P(B∩B∩R∩R∩R)+P(B∩R∩B∩R∩R)+P(B∩R∩R∩B∩R)+P(B∩R∩R∩R∩B)+P(R∩B∩B∩R∩R)+P(R∩B∩R∩B∩R)+P(R∩B∩R∩R∩B)+P(R∩R∩B∩B∩R)+ P(R∩R∩B∩R∩B)+ P(R∩R∩R∩B∩B)

P(2B) = (26x25x25x24x23)/(52x51x50x49x48)+(26x26x26x26x25)/(52x51x50x49x48)+ (26x26x25x25x25)/(52x51x50x49x48)+(26x26x25x24x24)/(52x51x50x49x48)+ (26x26x25x25x24)/(52x51x50x49x48)+(26x26x26x26x26)/(52x51x50x49x48)+ (26x26x26x25x25)/(52x51x50x49x48)+(26x25x25x24x24)/(52x51x50x49x48)+ (26x25x25x25x25)/(52x51x50x49x48)+(26x25x24x24x23)/(52x51x50x49x48)

P(2B) = (8,970,000+11,424,400+10,562,500+9,734,400+10,140,000+11,881,376+10,985,000+9,360,000+10,156,250+8,611,200)/(52x51x50x49x48) =101,825,126/(52x51x50x49x48) = 0.326493

The probability of two black cards out of five cards drawn without replacement can be calculated using Hypergeometric distribution formula.

Let X be a random variable of interest (number of black cards) that takes the value two as the number of black cards in the sample of five cards drawn without replacement, denoted by ‘x’. The probability distribution of X depends on the parameters, ‘n’, ‘M’ and ‘N’, and is given by the expression

P(X=x) = h(x;n,M,N) = [C(M,x) X C(N-M,n-x)]/C(N,n)

In this example, n=5, M=26, N=52 and ‘x’ takes the value 2. Putting these values in the above formula, one gets

P(X=2) = [C(26,2) X C(26,3)/C(52,5)] = (26! X 26! X 47! X 5!) / (24! X 2! X 23! X 3! X 52!) = 0.325130

This value is equal to the one calculated above. In the same way, the probability for other number of black cards in five cards drawn without replacement can be calculated. Excel software can also be used to calculate the probability using Hypergeometric formula and Hypergeometric function.

Question: Looking at Table 1, what will be the observed discrete probability distribution of number of black cards?

To summarize, the number of black cards out of five cards drawn without replacement in seven events ranged from two to three (Table 1). Two black cards were drawn five times in five of seven events of five cards (E1, E2, E4, E6 and E7) and three black cards were drawn two times (E3 and E5). Thus, drawing of two black cards is most likely to occur, five out of seven times with probability P(X=2)= 0.714285, highlighted yellow in Table 2.

Table 2: Number of Black Cards Out of Five Cards Drawn Without Replacement in Each of Seven Events and Observed Probability Distribution of Number of Black Cards

Question: What will be the theoretical probability distribution of number of black cards in drawing five cards without replacement?

The probability of all groups of outcomes can be calculated as calculated in the former section to present the probability distribution in Table 3. Besides, Hypergeometric distribution formula and function in Excel is also used to calculate the two-category theoretical probability distribution without replacement (Table 4). Refer to my Statistical Notes 31 and 32 that discuss on the Theoretical Two-Category Discrete Probability Distribution calculation.

Occurrence of two or three black cards in drawing five cards without replacement have highest probability and are thus, highly likely to occur. These are highlighted yellow. The likelihood decreases towards both sides from two or three black cards. Two extreme number of black cards, zero and five black cards, have the least chance of occurrence.

Table 3: Number of Black Cards, Number of Outcome Groups and Theoretical Probability Distribution of Number of Black Cards

Table 4: Number of Black Cards, Number of Red Cards, Hypergeometric Distribution Formula and Function Used to Calculate the Theoretical Probability Distribution of Number of Black Cards of Five Cards Drawn Without Replacement

Question: How different will be the observed from the theoretical discrete probability distribution of number of black cards in drawing five cards without replacement?

Chart 1 compares the observed and theoretical two category discrete probability distribution of black cards in drawing five cards without replacement.

This clearly shows the bell-shaped curve, the symmetric line chart of theoretical probability distribution and how different the observed distribution and charts are. Unlike, the distribution of observed is positively skewed.

Conclusion

Tree diagram, Binomial expansion, Binomial distribution function and Hypergeometric function are important tools to calculate the number of outcomes and the probability of samples drawn without replacement. The observed two-category probability distribution differs from the theoretical distribution. The observed data could differ from one event to another because of non-uniformity in the condition in which a card is drawn without replacement.

Saturday, September 8, 2018

Two-category Discrete Probability Distribution of Sampling With Replacement, Observation and Theory, Statistical Note 34

Toss a coin five times and count the number of heads. Repeat the same process for seven times or sets each constituting five tosses. Calculate the observed and theoretical discrete probability distributions of number of heads.

The observed probability distribution is based on the real-time data. The theoretical probability distribution is based on an ideal situation. Using the observed data is important to understand the theory. The main objective of this note is to develop understanding of complex probability and probability distribution concepts using a simple experiment.

Tossing of a coin is an example of the two-category discrete probability distribution of sampling with replacement. Refer to my earlier Statistical Notes also for clarity on calculating the two-category discrete probability using tree diagram, formula and Excel software function.

In this note, first I present the observed data and then present the probability and two-category discrete probability concepts using the trail data. This note tries to clarify the concept two-category discrete probability distribution based on the observed data. Former notes first tried to clarify the theory and then discussed the observation. Unlike, this note is other way round which first discusses on the observed data using the tree diagram, and then clarifies the theoretical distribution. This is also because to analyze and interpret meaningfully the observed data based on the theory.

Observed Data

I tossed a coin five times in a set and the same process was repeated for seven sets or times. Table 1 presents the outcome of five tosses of a coin in each of seven sets. Head and tail were coded H and T respectively for symbolic representation. Besides, a cell with tail is shaded sky blue and the cell with head is shared brown. I will discuss more on this table in following sections.

Table 1: Outcomes in five tosses of a coin in each of seven sets (head=H and tail=T)

Queries

Several questions may arise looking at the outcomes data in Table 1. For example,
· Why is every outcome in the table different than others?
·         Is there any pattern of outcomes in five tosses of a coin?
·         Why were there outcomes with only one to three heads in five tosses? Why not less or more than those number of heads?
·         How many unique outcomes will there be in five tosses?
·         Can unique outcomes be grouped?
·         How many different groups of outcomes will there be having five heads to no heads in five tosses of a coin?
·         What is the probability of an event in the first set S1 (T,T,T,H and H) in Table 1 that has tails and head in exactly this order?
·         What is the probability of two heads out of five tosses in which the order does not matter whether a head or a tail occurs in which toss?
·         Looking at Table 1, what will be the observed discrete probability distribution of number of heads?
·         What will be the theoretical probability distribution of number of heads in five tosses of a coin?
·         How different will be the observed from the theoretical discrete probability distribution of number of heads in five tosses of a coin?

Response

These questions can be answered using different tools – Tree Diagram, Binomial Expansion and Binomial Distribution functions. Look at specific statistical notes to get answers to these questions. Below are responses to the queries:

Questions: Why is every outcome in the table different than others? Is there any pattern of outcomes in five tosses of a coin? Why were there outcomes with only one to three heads in five tosses? Why not less or more than those number of heads?

A coin is unbiased such that both head and tail are equally likely to turn up with the probability of half for each side. Every toss of a coin is a random experiment in which any of the sides of the coin is likely to turn up. Thus, every outcome in the table different than other. However, it is likely that the same set of outcomes in several tosses could appear. There are some patterns of outcomes in five tosses. Four sets (S3 to S5 and S7) each with five tosses have three heads, two sets (S1 and S2) have two head and one set (S6) has one head. Other sets could have other number of heads ranging from zero to all five heads in five tosses. Thus, there is no guarantee that a specified number of heads will turn up in any number of tosses.

Question: How many unique outcomes will there be in five tosses?

Refer to my statistical notes 9 and 15 on the total number of possible outcomes of sampling with replacement. Each unique outcome will be different than others based on the turning up of a coin in a certain toss. The total number of possible outcomes is calculated by using the formula ‘k to the power r’ or ‘k to the r^th power’ or ‘k^r’, where ‘k’ is the number of possible outcomes in an experiment or trail and ‘r’ is the number of times an experiment or a trail is conducted. In this example, ‘n’ is two and ‘r’ is five so that the total number of possible outcomes is calculated by multiplying two possibilities (head or tail) in each of the five tosses. This is calculated as 2 X 2 X 2 X 2 X 2 equal to 32 represented by the ‘two to the power five’ or ‘two to the fifth power’, denoted by 2⁵. This is clearly seen on the tree diagram 1 as well. The outcomes with the number of heads ranging from five heads, denoted by ‘5H’ to zero head, denoted by ‘0H’ are indicated by different colors on the third block from right in tree diagram 1. Besides, seven different outcomes of five tosses of a coin listed in Table 1 are out of 32 outcomes that are shown with the respective outcome numbers S1 to S7 with different colors at the right most part of the tree diagram 1.

Diagram 1: Tree Diagram Showing Outcomes in Five Toss of a Coin

Questions: Can unique outcomes be grouped? How many groups of outcomes will there be having five heads to no heads in five tosses of a coin?

Refer to my statistical note 15 on the grouping of unique outcomes of sampling with replacement in which the order does not matter. The possible number of outcomes groups is based on the number of heads or the tails in tosses irrespective of the order of the turning up of a head. This is calculated using the formula C(k+r-1,r)=(k+r-1)!/ (k-1)!r!, where ‘C’ refers to the combination, ‘k’ is the number of possible outcomes in an experiment or trail and ‘r’ is the number of times an experiment or a trail is conducted. In this example, ‘k’ is two and ‘r’ is five so that the number of groups of possible outcomes is calculated to be 6! divided by 5!, equal to 6.

The grouping of outcomes with the number of heads are shown in tree diagram 1. There are six groups of outcomes, ranging from G1 to G6. G1 has only one outcome with five heads in five tosses, G2 has five outcomes with four heads, G3 has 10 outcomes with three heads, G4 has ten outcomes with two heads, G5 has five outcomes with one head and G6 has one outcome with no head, means all tails in five tosses.

This grouping can be shown using the Binomial Expansion formula. Let ‘H’ be the head and ‘T’ be the tail. Since, a coin is tossed five times, a power five of sum of ‘H’ and ‘T’ is used for the Binomial expansion. The expansion of (H+T)⁵ is expressed as:

(H+T)⁵ = H⁵+5H⁴T+10H³T²+10H²T³+5HT⁴+T⁵

‘H⁵’ means there is one outcome G1 having five heads in five tosses of a coin, in the same way ‘5H⁴T’ means there are five outcomes having four heads and one tail, ‘10H³T²’ means 10 outcomes with three heads and two tails, ‘10H²T³’ means 10 outcomes with two heads and three tails, ‘5HT⁴’ means five outcomes with one head and four tails, and ‘T⁵’ means one outcome constituting five tails in five tosses of a coin.

The same expansion is used for the probability of head, ‘h’ and the probability of tail, ‘t’. The expansion is: (h+t)⁵ = h⁵+5h⁴t+10h³t²+10h²t³+5ht⁴+t⁵

The number of outcomes in a certain group of outcomes or the Binomial coefficient can be identified using Pascal’s Triangle as discussed in my Statistical Note 16, as shown in Diagram 2.

Diagram 2: Number of toss of a coin and Binomial Coefficient using Pascal’s Triangle

With the increase in the number of tosses, say 20 tosses of a coin, it is difficult to draw the tree diagram as well as to write the Binomial expansion of (H+T)²⁰, particularly the coefficients of each group of outcomes. In such a case, Binomial Distribution function is used to identify the coefficients. The formula is: C(n,x)H^xT^n-x, where ‘n’ is the number of trails and ‘H’ is the head and ‘T’ is the tail. Example, I use Binomial distribution formula to calculate the number of outcomes in the group constituting three heads and two tails in five tosses. It is C(5,3)H³T², equivalent to 10 H³T² which is same as the third group in the above Binomial expansion.

Using the Binomial distribution function, as discussed in my Statistical Note 16, the expansion looks:

(H+T)ⁿ = C(n,0)Hⁿ+C(n,1)H^n-1T+C(n,2)H^n-2T²+C(n,3)H^n-3T³+ C(n,x)H^n-xT^x+……+ C(n,n)Tⁿ

This expression can be used to calculate the number of outcomes in a certain group of heads and ultimately the total number of outcomes for the given number of trials or experiments with replacement.

Question: What is the probability of an event in the first set S1 (T,T,T,H and H) in Table 1 that has tails and head in exactly this order?

Probability is calculated by dividing the number of outcomes by the total number of possible outcomes. There is only one outcome that has three consecutive tails and two consecutive heads in five tosses of a coin. As already discussed in length above there are 32 outcomes in five tosses. Thus, the probability of an event (T,T,T,H and H) is one by 32, equal to 0.03125.

Question: What is the probability of two heads out of five tosses in which the order does not matter whether a head or a tail occurs in which toss?

Looking at the tree diagram, Binomial expansion and the Binomial distribution function as discussed above, there are 10 outcomes having two heads and three tails in five tosses of a coin. Thus, the probability of two head is 10 divided by 32, equal to 0.3125.

Question: Looking at Table 1, what will be the observed discrete probability distribution of number of heads?

To summarize, the number of heads out of five tosses in seven sets ranged from one to three (Table 1). Two heads turned up two times in two of seven sets of five tosses (S1 and S2), three heads turned up four times (S3 to S5 and S7), one head turned up (S6) in five tosses . Thus, turning up of three heads is most likely to occur, three out of seven times with probability P(X=3)=0.571428, highlighted yellow in Table 2.

Table 2: Number of Heads Out of Five Tosses of a Coin in Each of Seven Sets and Observed Probability Distribution of Number of Heads

Question: What will be the theoretical probability distribution of number of heads in five tosses of a coin?

I discussed on the Theoretical Two-Category Discrete Probability Distribution calculation in my former Statistical Note 29 also. As discussed above, the number of outcomes under different groups of heads can be calculated using tree diagram, Binomial expansion and Binomial distribution function. Here, I present only the table constituting the number of heads in five tosses and respective probabilities (Table 3). Binomial distribution function in Excel is also used to calculate the two-category theoretical probability distribution with replacement.

Table 3: Number of Heads, Number of Outcome Groups and Theoretical Probability Distribution of Number of Heads

Turning up of two or three heads in five tosses have highest probability and are thus, highly likely to occur. These are highlighted yellow. The likelihood decreases towards both sides from two or three heads. Two extreme number of heads, zero and five heads, have the least chance of occurrence.

Question: How different will be the observed from the theoretical discrete probability distribution of number of heads in five tosses of a coin?

Chart 1 compares the observed and theoretical two category discrete probability distribution of heads in five tosses of a coin. This clearly shows the bell-shaped curve, the symmetric line chart of theoretical probability distribution and how different the observed distribution and charts are.

Conclusion

Tree diagram, Binomial expansion and Binomial distribution function are important tools to calculate the number of outcomes and the probability. The observed two-category probability distribution differs from the theoretical distribution. The observed data could differ from one set to another because of non-uniformity in the condition in which a coin is tossed repeatedly.