Basan Shrestha's Diary: Exploring Data Across Groups

Exploring Data Across Groups

Exploring data is a kind of data health check to describe data and see whether there are outliers that cause abnormal distribution for the whole dataset or across groups. Group here means disaggregation of data by certain categories. Parametric inferential statistics are designed to fit for normally distributed dataset. Several summary functions and graphical displays are used to explore data. I exemplify exploratory measures from the daily database on COVID-19 incidence worldwide as of April 29, 2020, which I downloaded from the European Centre for Disease Prevention and Control website. This article needs to read along with my previous two articles entitled for more clarity on describing and exploring quantitative datasets.

Description

Descriptive statistics measures central tendency, dispersion, and distribution. Mean is an important measure of central tendency that numerically describes data. Mean persons infected per day ranged from 14,912 (American continent) to 106.91 (Oceania). Standard deviation is a measure of dispersion that quantifies how far a data value varies from the mean. SD also varies widely across the continents for the given dataset with highest value of 17304.49 for America and lowest value of 161.75 for Oceania. For normal distribution, 95% of data values fall within two standard deviations from the mean. Applying that condition, the lower boundaries were below zero for all continents, and upper boundary from 49,520 (America) to 430 (Oceania). Likewise, for normal distribution,outliers fall beyond 2.68 standard deviation from the mean. For the given dataset, outliers at the lower level were less than zero for all continents, and outliers at the upper level ranged from 61288 (America) and 540 (Oceania). These are given in Table 1.

Skewness measures the distribution of data values. For the given dataset, the skewness value ranged from positive 1.991 (Oceania) to positive 0.191 (Europe). A normal distribution has zero skewness. The positive value indicated that the distribution was positively skewed with a right tail for all continents. The standard error of skewness varied from 0.302 (African) to 0.231 (Asia). Since four continents, except Europe had the skewness value more than two times the standard error the distributions were asymmetric. Europe had distribution close to normal.

Kurtosis is another measure of the distribution of data values. For the given dataset, the kurtosis value ranged from positive 3.337 (Oceania) to negative 0.514 (Africa). A normal distribution has the kurtosis value of zero. The standard error of kurtosis ranged from 0.595 (Africa) to 0.459 (Asia). The positive value indicates that the distribution is more peaked than normal with the data values much clustered around the center of distribution, which is also referred to as leptokurtic distribution. The negative value indicates that the distribution is flatter than normal with the data values less clustered around the center of distribution, which is also referred to as platykurtic distribution. Kurtosis value was more than two times the standard error for Europe and Oceania the distributions were more abnormal than other three continents.

Table 1: Numerical Description

Five-Number Summary

Five-number summary constitutes five statistics that describe the dataset. It includes minimum value, first quartile (Q1) or 25^th percentile, median or 50^th percentile, third quartile (Q3) or 75^th percentile and maximum value. For the given dataset, all continents had a minimum of one person infected in a day as shown in Table 2. Maximum number of persons infected in a day ranged from 62037 (America) to 662 (Oceania). Maximum value varied widely from 62,037 in America to 662 in Oceania. Asia had the highest Q1 value of 1208.50 and Oceania had the lowest value of 4.25. Median varied from 10460 in Europe to 31.5 in Oceania. Q3 varied between 32,666 (America) to 133 (Oceania). Mean is great than median for all continents (Tables 1 and 2). Greater mean indicates that distribution is positively skewed.

In normal distribution the median is at the center of distribution. The distributions of the given datasets are positively skewed for all continents as difference between minimum value and median is smaller than the difference between median and maximum value. In normal distribution the difference between minimum value and Q1 is equal to the difference between Q3 and maximum value Likewise, difference between minimum value and Q1 is smaller than the difference between Q3 and maximum value. American continent had the highest difference, followed by Europe indicating that those continents had higher positive skewness of distribution.

Table 2: Five-number summary and interpretation

Normality Tests

Normality tests determines whether a dataset follows a normal distribution. There are many numerical and visual ways to determine normality of the distribution.

Outliers distort the distribution of dataset from normality. The lower outliers are calculated by subtracting from Q1 one and half times the difference between Q3 and Q1, also referred to as Inter-quartile range (IQR). Likewise, the upper outliers are calculated by adding to Q3 one and half times the IQR. Given dataset had five extreme values above Q3 and below Q1 for each continent. Highest extreme values were not repetitive for any continent. Unlike, the lowest extreme value of one was repeated five times in four continents (Africa, America, Europe, and Oceania) and repeated twice in Asia (Table 3).

Table 3: Extreme values

A box plot displays the five number summaries as shown in figure 1. Due to wide variation in the data values across continents, the box plots are not clear for all continents. However, it was seen that the given dataset had statistically outlying values only in Oceania as indicated by record numbers in the dataset. The outliers distort the distribution away from normality.

Figure 1: Box plots

The stem and leaf plots show the shape of distribution revealing positive skew statistic in which the values cluster below 1,000 cases per day having 51 data points in Africa. In America values cluster below 10,000 cases per day with 47 data points (Figures 2 to 4). Likewise, in Asia values cluster below 2,000 cases per day with 65 data points. In Europe values cluster below 10,000 cases per day with 42 data points and in Oceania values cluster below 100 cases per day with 52 data points. These showed that the distributions were positively skewed.

Figure 2: Stem and leaf plot 1

Figure 3: Stem and leaf plot 2

Figure 4: Stem and leaf plot 3

A normal Q-Q plot displays the straight line that represents expected values for normal distribution. The observed incidence values deviate distinctly from that line indicating positively skewed distribution as shown in Figures 5 and 6.

Figure 5: Normal Q-Q Plot 1

Figure 6: Normal Q-Q Plot 2

In a nutshell, exploring data summarizes distribution, identifies outliers, and checks normality. These procedures confirm that the distribution of the given dataset is positively skewed across all continents. Days with number of persons infected less than median value are more frequent. Oceania has more positively skewed distribution and Europe has less positively skewed distribution. These distortions in the dataset are due to several outliers.