Exploring data is a kind of data health check to see whether there are outliers
that cause abnormal distribution. Parametric inferential statistics are
designed to fit the normal distribution. Several summary functions and
graphical displays are used to explore data. I exemplify these measures from
the daily database on COVID-19 incidence worldwide as of April 29, 2020, which
I downloaded from the European Centre for Disease Prevention and Control
website. The dataset has 109 daily records. This article needs to read along
with my previous article entitled ‘Describing Quantitative Data’.
Five Number Summary
Minimum value, first quartile (Q1) or 25th percentile,
median or 50th percentile, third quartile (Q3) or 75th
percentile and maximum value are five summary numbers used to explore data. In a normal distribution the median is at the center of the distribution.
Besides, the difference between the minimum value and median is equal to the
difference between the median and maximum value. Likewise, the difference between the minimum
value and Q1 is equal to the difference between Q3 and maximum value as shown by
black and red arrows in figure 1.
Figure 1: Five Number Summary in Normal Distribution
In positively skewed distribution distance between minimum value and median is smaller than the distance between median and maximum value as shown by arrows in figure 2. Likewise, the distance between the minimum value and Q1 is smaller than the distance between Q3 and maximum value.
Figure 2: Five Number Summary in Positively Skewed Distribution
In negatively skewed distribution the distance between minimum value and median is greater than the distance between median and maximum value as shown by black and red arrows in figure 3. Likewise, the distance between the minimum value and Q1 is greater than the distance between Q3 and maximum value.
Figure 3: Five Number Summary in Negatively Skewed Distribution
In the given dataset, the minimum value is one and the maximum value is 101,728. Q1, median, and Q3 are respectively 1761.5, 3907, and 66403.5. The difference between minimum value and median is equal to 3906 which is far smaller than the difference between the median and maximum value equal to 97,821. This is consistent with figure 2 indicating that data is positively skewed.
Normality Checks
Box plot displays the five number summaries as shown in figure 4. The
thick black line inside the box symbolizes the median. The lower and
upper hinges or boundaries of the box symbolize Q1 and Q3, respectively.
Whiskers below and above the hinges symbolize the minimum and maximum
values. Box plot shows the outliers and extreme values. Outliers are the values
below or above the value calculated as one and half times the difference
between Q1 and Q3. The given dataset does not have statistically outlying
values as there are no indications on the box plot. However, there are five extreme
values above Q3 from 86,046 to the maximum number of persons infected in a day.
Likewise, five extreme values below Q1 from 17 to the minimum number of infected
cases. Those values have led to skewed distribution.
The stem and leaf plot shows the shape of distribution showing positive skew statistic in which the values cluster below 10,000 cases per day having 62 data points. Values from 50,000 to below 90,000 cases are also clustered as shown in figure 5. This shows that the distribution is positively skewed.
Figure 5: Stem and
leaf plot
A normal Q-Q plot displays the straight line that represents expected values for normal distribution. The observed incidence values deviate distinctly from that line indicating positively skewed distribution as shown in figure 6.
Figure 6: Normal Q-Q Plot
In a nutshell, exploring data summarizes distribution, identifies outliers, and checks normality. These procedures confirm that the distribution of the given dataset is positively skewed.
No comments:
Post a Comment