Family Tree

Family Tree

About Me

My photo
Kathmandu, Bagmati Zone, Nepal
I am Basan Shrestha from Kathmandu, Nepal. I use the term 'BASAN' as 'Balancing Actions for Sustainable Agriculture and Natural Resources'. I am a Design, Monitoring & Evaluation professional. I hold 1) MSc in Regional and Rural Development Planning, Asian Institute of Technology, Thailand, 2002; 2) MSc in Statistics, Tribhuvan University (TU), Kathmandu, Nepal, 1995; and 3) MA in Sociology, TU, 1997. I have more than 10 years of professional experience in socio-economic research, monitoring and documentation on agricultural and natural resource management. I had worked in Lumle Agricultural Research Centre, western Nepal from Nov. 1997 to Dec. 2000; CARE Nepal, mid-western Nepal from Mar. 2003 to June 2006 and WTLCP in far-western Nepal from June 2006 to Jan. 2011, Training Institute for Technical Instruction (TITI) from July to Sep 2011, UN Women Nepal from Sep to Dec 2011 and Mercy Corps Nepal from 24 Jan 2012 to 14 August 2016 and CAMRIS International in Nepal commencing 1 February 2017. I have published articles to my credit.

Saturday, May 9, 2020

Exploring Quantitative Data


Exploring data is a kind of data health check to see whether there are outliers that cause abnormal distribution. Parametric inferential statistics are designed to fit the normal distribution. Several summary functions and graphical displays are used to explore data. I exemplify these measures from the daily database on COVID-19 incidence worldwide as of April 29, 2020, which I downloaded from the European Centre for Disease Prevention and Control website. The dataset has 109 daily records. This article needs to read along with my previous article entitled ‘Describing Quantitative Data’.

Five Number Summary

Minimum value, first quartile (Q1) or 25th percentile, median or 50th percentile, third quartile (Q3) or 75th percentile and maximum value are five summary numbers used to explore data. In a normal distribution the median is at the center of the distribution. Besides, the difference between the minimum value and median is equal to the difference between the median and maximum value. Likewise, the difference between the minimum value and Q1 is equal to the difference between Q3 and maximum value as shown by black and red arrows in figure 1.

Figure 1: Five Number Summary in Normal Distribution






In positively skewed distribution distance between minimum value and median is smaller than the distance between median and maximum value as shown by arrows in figure 2. Likewise, the distance between the minimum value and Q1 is smaller than the distance between Q3 and maximum value.

Figure 2: Five Number Summary in Positively Skewed Distribution






In negatively skewed distribution the distance between minimum value and median is greater than the distance between median and maximum value as shown by black and red arrows in figure 3. Likewise, the distance between the minimum value and Q1 is greater than the distance between Q3 and maximum value.

Figure 3: Five Number Summary in Negatively Skewed Distribution





In the given dataset, the minimum value is one and the maximum value is 101,728. Q1, median, and Q3 are respectively 1761.5, 3907, and 66403.5. The difference between minimum value and median is equal to 3906 which is far smaller than the difference between the median and maximum value equal to 97,821.  This is consistent with figure 2 indicating that data is positively skewed.

Normality Checks

Box plot displays the five number summaries as shown in figure 4. The thick black line inside the box symbolizes the median. The lower and upper hinges or boundaries of the box symbolize Q1 and Q3, respectively. Whiskers below and above the hinges symbolize the minimum and maximum values. Box plot shows the outliers and extreme values. Outliers are the values below or above the value calculated as one and half times the difference between Q1 and Q3. The given dataset does not have statistically outlying values as there are no indications on the box plot. However, there are five extreme values above Q3 from 86,046 to the maximum number of persons infected in a day. Likewise, five extreme values below Q1 from 17 to the minimum number of infected cases. Those values have led to skewed distribution.























The stem and leaf plot shows the shape of distribution showing positive skew statistic in which the values cluster below 10,000 cases per day having 62 data points. Values from 50,000 to below 90,000 cases are also clustered as shown in figure 5. This shows that the distribution is positively skewed.

Figure 5: Stem and leaf plot













A normal Q-Q plot displays the straight line that represents expected values for normal distribution. The observed incidence values deviate distinctly from that line indicating positively skewed distribution as shown in figure 6.

Figure 6: Normal Q-Q Plot























In a nutshell, exploring data summarizes distribution, identifies outliers, and checks normality. These procedures confirm that the distribution of the given dataset is positively skewed.

No comments:

Post a Comment