Basan Shrestha's Diary: Two Category Discrete Probability Distribution of Sampling Without Replacement and EXCEL Function, Statistical Note 23

Friday, July 13, 2018

Two Category Discrete Probability Distribution of Sampling Without Replacement and EXCEL Function, Statistical Note 23

Among 40 participants in a training, 18 were vegetarians and 22 were non-vegetarians. 2 participants are selected at random one after another without replacement of the name of the first selected participant. Calculate the probability distribution of vegetarians in which the order does not matter whether the vegetarians are sampled or not in the first or second draw.

Counting the total and favorable numbers of outcomes constituting the specified number of objects sampled without replacement from the finite population is important to calculate the probability of favorable events. Excel software is commonly available in the desktop or the laptop and is an important means to calculate the discrete probability distribution. I take an example from my Statistical Note 18, and will show how calculate the discrete probability distribution without replacement using Excel function. Refer to that note for conceptual clarity on using the tree diagram and formula to calculate the Hypergeometric probability distribution.

EXCEL Function

Excel software has a ‘Formulas’ tab, which has ‘Insert Function’ command under ‘Function Library’ group. Select ‘HYPGEOM.DIST’ function from the list of all function. The ‘HYPGEOM.DIST’ formula has five fields ‘Sample-s’, ‘Number_sample’, ‘Population_s’, ‘Number_pop’ and Cumulative’ as shown in Diagram 1.

Diagram 1: Selection of ‘HYPGEOM.DIST’ function in Excel Data Worksheet

The field ‘Sample_s’ takes the number of successes in trials. In this example, this field takes the value from zero to two vegetarians as indicated from B2 to B4 cells in Diagram 1. Because, in the sample of two participants there could be no vegetarian, one vegetarian or two vegetarians selected. In this calculation, one vegetarian in two independent draws of participants has been used as shown in the cell B3 in the table as well as an argument box in Diagram 1.

The field ‘Number_sample’ is the sample size. In this example, two participants were sampled without replacement and thus two was entered into this field of the function arguments in Diagram 1.

The field ‘Population_s’ is the number of successes in the population. This example has 18 vegetarians out of 40 participants. Thus, 18 was entered into the field of the function argument box in Diagram 1.

The field ‘Number_pop’ is the population size. This example has a total of 40 participants. Thus, 40 was entered into this field of the function argument in the diagram 1.

The field ‘Cumulative’ is a logical value that determines the form of the function. If ‘Cumulative’ is ‘FALSE’, ‘HYPGEOM.DIST’ calculates the probability mass function (PMF), which gives the probability associated with the value assigned to the field ‘Sample_s’ as the number of successes. Thus, ‘FALSE’ was entered into the function argument box in Diagram 1.

Fixing all five fields in the function arguments, ‘HYPGEOM.DIST’ function calculated the PMF equal to 0.507. It means that there is 50.7 percent chance that one of two participants sampled without replacement will be a vegetarian. This is equal to the value manually calculated in my Statistical Note 18.

The cumulative distribution function (CDF) is another important probability calculation. If the field ‘Cumulative’ takes the value TRUE, the ‘HYPGEOM.DIST’ results in the CDF, which gives the probability up to the number of successes mentioned in the field ‘Sample_s’. In this example, the probability upto 1 vegetarian selected is the sum of probability of no vegetarian or non-vegetarian selected in both samples and the probability of one vegetarian selected between two participants sampled. The logical value ‘TRUE’ is used an shown in both the function argument box and the dataset with the cell value ‘N53’ that gives the CDF equal to 0.8038 as shown in Diagram 2.

Diagram 2: ‘HYPGEOM.DIST’ Function Arguments Using Dataset in Excel Worksheet and using ‘TRUE’ logical value in the field ‘Cumulative’

It means that there is 80.3 percent chance that up to one vegetarian will be selected in two participants sampled without replacement. This is equal to the one manually calculated in the Statistical Note 18.

PMFs and PDFs can be undertaken for all number of successes, that is the number of two non-vegetarians (no vegetarian), one vegetarian and two vegetarians out of two participants sampled without replacement as shown in tables and argument functions in both diagrams 1 and 2.

Discussion in this note and my former Statistical Note 18 indicates that both manual and auto calculation produce the same values and are useful to calculate the discrete probability distribution without replacement. Thus, conceptual understanding is a backbone and automatization is efficient. Thus, both are important knowledge and skill sets.