Basan Shrestha's Diary: Exploring Quantitative Data

Saturday, May 9, 2020

Exploring Quantitative Data

Exploring data is a kind of data health check to see whether there are outliers that cause abnormal distribution. Parametric inferential statistics are designed to fit the normal distribution. Several summary functions and graphical displays are used to explore data. I exemplify these measures from the daily database on COVID-19 incidence worldwide as of April 29, 2020, which I downloaded from the European Centre for Disease Prevention and Control website. The dataset has 109 daily records. This article needs to read along with my previous article entitled ‘Describing Quantitative Data’.

Five Number Summary

Minimum value, first quartile (Q1) or 25^th percentile, median or 50^th percentile, third quartile (Q3) or 75^th percentile and maximum value are five summary numbers used to explore data. In a normal distribution the median is at the center of the distribution. Besides, the difference between the minimum value and median is equal to the difference between the median and maximum value. Likewise, the difference between the minimum value and Q1 is equal to the difference between Q3 and maximum value as shown by black and red arrows in figure 1.

Figure 1: Five Number Summary in Normal Distribution

In positively skewed distribution distance between minimum value and median is smaller than the distance between median and maximum value as shown by arrows in figure 2. Likewise, the distance between the minimum value and Q1 is smaller than the distance between Q3 and maximum value.

Figure 2: Five Number Summary in Positively Skewed Distribution

In negatively skewed distribution the distance between minimum value and median is greater than the distance between median and maximum value as shown by black and red arrows in figure 3. Likewise, the distance between the minimum value and Q1 is greater than the distance between Q3 and maximum value.

Figure 3: Five Number Summary in Negatively Skewed Distribution

In the given dataset, the minimum value is one and the maximum value is 101,728. Q1, median, and Q3 are respectively 1761.5, 3907, and 66403.5. The difference between minimum value and median is equal to 3906 which is far smaller than the difference between the median and maximum value equal to 97,821. This is consistent with figure 2 indicating that data is positively skewed.

Normality Checks

Box plot displays the five number summaries as shown in figure 4. The thick black line inside the box symbolizes the median. The lower and upper hinges or boundaries of the box symbolize Q1 and Q3, respectively. Whiskers below and above the hinges symbolize the minimum and maximum values. Box plot shows the outliers and extreme values. Outliers are the values below or above the value calculated as one and half times the difference between Q1 and Q3. The given dataset does not have statistically outlying values as there are no indications on the box plot. However, there are five extreme values above Q3 from 86,046 to the maximum number of persons infected in a day. Likewise, five extreme values below Q1 from 17 to the minimum number of infected cases. Those values have led to skewed distribution.

The stem and leaf plot shows the shape of distribution showing positive skew statistic in which the values cluster below 10,000 cases per day having 62 data points. Values from 50,000 to below 90,000 cases are also clustered as shown in figure 5. This shows that the distribution is positively skewed.

Figure 5: Stem and leaf plot

A normal Q-Q plot displays the straight line that represents expected values for normal distribution. The observed incidence values deviate distinctly from that line indicating positively skewed distribution as shown in figure 6.

Figure 6: Normal Q-Q Plot

In a nutshell, exploring data summarizes distribution, identifies outliers, and checks normality. These procedures confirm that the distribution of the given dataset is positively skewed.