Family Tree

Family Tree

About Me

My photo
Kathmandu, Bagmati Zone, Nepal
I am Basan Shrestha from Kathmandu, Nepal. I use the term 'BASAN' as 'Balancing Actions for Sustainable Agriculture and Natural Resources'. I am a Design, Monitoring & Evaluation professional. I hold 1) MSc in Regional and Rural Development Planning, Asian Institute of Technology, Thailand, 2002; 2) MSc in Statistics, Tribhuvan University (TU), Kathmandu, Nepal, 1995; and 3) MA in Sociology, TU, 1997. I have more than 10 years of professional experience in socio-economic research, monitoring and documentation on agricultural and natural resource management. I had worked in Lumle Agricultural Research Centre, western Nepal from Nov. 1997 to Dec. 2000; CARE Nepal, mid-western Nepal from Mar. 2003 to June 2006 and WTLCP in far-western Nepal from June 2006 to Jan. 2011, Training Institute for Technical Instruction (TITI) from July to Sep 2011, UN Women Nepal from Sep to Dec 2011 and Mercy Corps Nepal from 24 Jan 2012 to 14 August 2016 and CAMRIS International in Nepal commencing 1 February 2017. I have published articles to my credit.

Saturday, May 23, 2020

Exploring Data Across Groups


Exploring Data Across Groups

Exploring data is a kind of data health check to describe data and see whether there are outliers that cause abnormal distribution for the whole dataset or across groups. Group here means disaggregation of data by certain categories. Parametric inferential statistics are designed to fit for normally distributed dataset. Several summary functions and graphical displays are used to explore data. I exemplify exploratory measures from the daily database on COVID-19 incidence worldwide as of April 29, 2020, which I downloaded from the European Centre for Disease Prevention and Control website. This article needs to read along with my previous two articles entitled for more clarity on describing and exploring quantitative datasets.

Description

Descriptive statistics measures central tendency, dispersion, and distribution. Mean is an important measure of central tendency that numerically describes data. Mean persons infected per day ranged from 14,912 (American continent) to 106.91 (Oceania). Standard deviation is a measure of dispersion that quantifies how far a data value varies from the mean. SD also varies widely across the continents for the given dataset with highest value of 17304.49 for America and lowest value of 161.75 for Oceania. For normal distribution, 95% of data values fall within two standard deviations from the mean. Applying that condition, the lower boundaries were below zero for all continents, and upper boundary from 49,520 (America) to 430 (Oceania). Likewise, for normal distribution,outliers fall beyond 2.68 standard deviation from the mean. For the given dataset, outliers at the lower level were less than zero for all continents, and outliers at the upper level ranged from 61288 (America) and 540 (Oceania). These are given in Table 1.

Skewness measures the distribution of data values. For the given dataset, the skewness value ranged from positive 1.991 (Oceania) to positive 0.191 (Europe).  A normal distribution has zero skewness. The positive value indicated that the distribution was positively skewed with a right tail for all continents. The standard error of skewness varied from 0.302 (African) to 0.231 (Asia). Since four continents, except Europe had the skewness value more than two times the standard error the distributions were asymmetric. Europe had distribution close to normal.

Kurtosis is another measure of the distribution of data values. For the given dataset, the kurtosis value ranged from positive 3.337 (Oceania) to negative 0.514 (Africa). A normal distribution has the kurtosis value of zero. The standard error of kurtosis ranged from 0.595 (Africa) to 0.459 (Asia). The positive value indicates that the distribution is more peaked than normal with the data values much clustered around the center of distribution, which is also referred to as leptokurtic distribution. The negative value indicates that the distribution is flatter than normal with the data values less clustered around the center of distribution, which is also referred to as platykurtic distribution. Kurtosis value was more than two times the standard error for Europe and Oceania the distributions were more abnormal than other three continents.

Table 1: Numerical Description
















Five-Number Summary

Five-number summary constitutes five statistics that describe the dataset. It includes minimum value, first quartile (Q1) or 25th percentile, median or 50th percentile, third quartile (Q3) or 75th percentile and maximum value. For the given dataset, all continents had a minimum of one person infected in a day as shown in Table 2. Maximum number of persons infected in a day ranged from 62037 (America) to 662 (Oceania). Maximum value varied widely from 62,037 in America to 662 in Oceania. Asia had the highest Q1 value of 1208.50 and Oceania had the lowest value of 4.25. Median varied from 10460 in Europe to 31.5 in Oceania. Q3 varied between 32,666 (America) to 133 (Oceania). Mean is great than median for all continents (Tables 1 and 2). Greater mean indicates that distribution is positively skewed.

In normal distribution the median is at the center of distribution. The distributions of the given datasets are positively skewed for all continents as difference between minimum value and median is smaller than the difference between median and maximum value. In normal distribution the difference between minimum value and Q1 is equal to the difference between Q3 and maximum value Likewise, difference between minimum value and Q1 is smaller than the difference between Q3 and maximum value. American continent had the highest difference, followed by Europe indicating that those continents had higher positive skewness of distribution.

Table 2: Five-number summary and interpretation









Normality Tests

Normality tests determines whether a dataset follows a normal distribution. There are many numerical and visual ways to determine normality of the distribution.

Outliers distort the distribution of dataset from normality. The lower outliers are calculated by subtracting from Q1 one and half times the difference between Q3 and Q1, also referred to as Inter-quartile range (IQR). Likewise, the upper outliers are calculated by adding to Q3 one and half times the IQR. Given dataset had five extreme values above Q3 and below Q1 for each continent. Highest extreme values were not repetitive for any continent. Unlike, the lowest extreme value of one was repeated five times in four continents (Africa, America, Europe, and Oceania) and repeated twice in Asia (Table 3).

Table 3: Extreme values











A box plot displays the five number summaries as shown in figure 1. Due to wide variation in the data values across continents, the box plots are not clear for all continents. However, it was seen that the given dataset had statistically outlying values only in Oceania as indicated by record numbers in the dataset. The outliers distort the distribution away from normality.

Figure 1: Box plots























The stem and leaf plots show the shape of distribution revealing positive skew statistic in which the values cluster below 1,000 cases per day having 51 data points in Africa. In America values cluster below 10,000 cases per day with 47 data points (Figures 2 to 4). Likewise, in Asia values cluster below 2,000 cases per day with 65 data points. In Europe values cluster below 10,000 cases per day with 42 data points and in Oceania values cluster below 100 cases per day with 52 data points. These showed that the distributions were positively skewed.

Figure 2: Stem and leaf plot 1









Figure 3: Stem and leaf plot 2













Figure 4: Stem and leaf plot 3


A normal Q-Q plot displays the straight line that represents expected values for normal distribution. The observed incidence values deviate distinctly from that line indicating positively skewed distribution as shown in Figures 5 and 6.

Figure 5: Normal Q-Q Plot 1










Figure 6: Normal Q-Q Plot 2














In a nutshell, exploring data summarizes distribution, identifies outliers, and checks normality. These procedures confirm that the distribution of the given dataset is positively skewed across all continents.  Days with number of persons infected less than median value are more frequent. Oceania has more positively skewed distribution and Europe has less positively skewed distribution. These distortions in the dataset are due to several outliers.



Wednesday, May 20, 2020

Division Sign and Reality

Different symbols have different meanings although the meaning could be contextual and vary. Here, I want to discuss a divide symbol, which I guess most of us have been using since childhood. Divide symbol is usually used to indicate that a dividend or a numerator is divided by a divisor or denominator to get a quotient with or without remainder.

The reality does not flow as a horizontal line in the division sign. Ratan Tata said, "....a straight line even in an ECG means we are not alive." There are several ups and downs. Thus, two dots up and down the straight line in the division sign symbolize ups and downs in a life, which as Tata said keep us living.

Cross Mark and Laws of Demand and Supply

Different symbols have different meanings although the meaning could be contextual and vary. Here, I want to discuss a cross mark symbol, which I guess most of us have been using since childhood. Cross mark is usually used to indicate that an answer is wrong or an option of a response at which a cross mark is place is not applicable to the person cross marking.

A cross mark gave me another meaning also. If I place a cross mark on a graph it gives me a mining of laws of demand and supply although the cross mark may not exactly represent the demand and supply curves. The market determines the price of a commodity or a service based on demand and supply of a good or service. Price remains at the Y-axis and quantity demanded or supplied at the X-axis of a chart.

The law of demand states that demand will be less at a higher price of a good or service in the context that other things remaining constant. The demand curve is symbolized by a descending line from left to right of a cross mark.

The law of supply states that supply will be more as price increases in the context that other things remaining constant, The supply curve is symbolized by an ascending line from left to right of a cross mark.

These two laws interact at an equilibrium price, a meeting point of demand and supply curves in a cross mark, at which the seller can sell the quantity s/he wants to sell and the buyer can buy the quantity s/he wants to buy. Thus, a cross mark symbolizes the laws of demand and supply.

Tick Mark and Resilience

Different symbols have different meanings although the meaning could be contextual and vary. Here, I want to discuss a tick mark symbol, which I guess most have been using since childhood. Tick mark is usually used to indicate that an answer is correct or an option of a response at which a tick mark is place is applicable to the person tick marking.
A tick mark gave me another meaning also. If I place a tick mark on a graph it gives me a meaning of resilience. The term resilience has been a buzz word in academia and development, community resilience, household resilience, resilience to climate change so and so forth. An entity is considered resilient if it bounces or has the capacity to bounce back after hitting by shock or stress. Once an entity is hit by a shock or stress the current level of capacity of an entity could come down from the left tip of a tick mark to the lowest point of it. But, as time elapses and the context improves, gradually the capacity of an entity could also increase from the lowest point of a tick mark to its highest point. It could increase further as there is no hard and fast limitation of the tip of a tick mark. The resilient capacity could vary from an entity to another. Some may bounce back quickly and others may take a long time. However, in the natural process, an entity could face a shock or stress and could come back to its previous level or better situation as a tick mark. Thus, I wish everyone to be like a tick mark.


Saturday, May 9, 2020

Exploring Quantitative Data


Exploring data is a kind of data health check to see whether there are outliers that cause abnormal distribution. Parametric inferential statistics are designed to fit the normal distribution. Several summary functions and graphical displays are used to explore data. I exemplify these measures from the daily database on COVID-19 incidence worldwide as of April 29, 2020, which I downloaded from the European Centre for Disease Prevention and Control website. The dataset has 109 daily records. This article needs to read along with my previous article entitled ‘Describing Quantitative Data’.

Five Number Summary

Minimum value, first quartile (Q1) or 25th percentile, median or 50th percentile, third quartile (Q3) or 75th percentile and maximum value are five summary numbers used to explore data. In a normal distribution the median is at the center of the distribution. Besides, the difference between the minimum value and median is equal to the difference between the median and maximum value. Likewise, the difference between the minimum value and Q1 is equal to the difference between Q3 and maximum value as shown by black and red arrows in figure 1.

Figure 1: Five Number Summary in Normal Distribution






In positively skewed distribution distance between minimum value and median is smaller than the distance between median and maximum value as shown by arrows in figure 2. Likewise, the distance between the minimum value and Q1 is smaller than the distance between Q3 and maximum value.

Figure 2: Five Number Summary in Positively Skewed Distribution






In negatively skewed distribution the distance between minimum value and median is greater than the distance between median and maximum value as shown by black and red arrows in figure 3. Likewise, the distance between the minimum value and Q1 is greater than the distance between Q3 and maximum value.

Figure 3: Five Number Summary in Negatively Skewed Distribution





In the given dataset, the minimum value is one and the maximum value is 101,728. Q1, median, and Q3 are respectively 1761.5, 3907, and 66403.5. The difference between minimum value and median is equal to 3906 which is far smaller than the difference between the median and maximum value equal to 97,821.  This is consistent with figure 2 indicating that data is positively skewed.

Normality Checks

Box plot displays the five number summaries as shown in figure 4. The thick black line inside the box symbolizes the median. The lower and upper hinges or boundaries of the box symbolize Q1 and Q3, respectively. Whiskers below and above the hinges symbolize the minimum and maximum values. Box plot shows the outliers and extreme values. Outliers are the values below or above the value calculated as one and half times the difference between Q1 and Q3. The given dataset does not have statistically outlying values as there are no indications on the box plot. However, there are five extreme values above Q3 from 86,046 to the maximum number of persons infected in a day. Likewise, five extreme values below Q1 from 17 to the minimum number of infected cases. Those values have led to skewed distribution.























The stem and leaf plot shows the shape of distribution showing positive skew statistic in which the values cluster below 10,000 cases per day having 62 data points. Values from 50,000 to below 90,000 cases are also clustered as shown in figure 5. This shows that the distribution is positively skewed.

Figure 5: Stem and leaf plot













A normal Q-Q plot displays the straight line that represents expected values for normal distribution. The observed incidence values deviate distinctly from that line indicating positively skewed distribution as shown in figure 6.

Figure 6: Normal Q-Q Plot























In a nutshell, exploring data summarizes distribution, identifies outliers, and checks normality. These procedures confirm that the distribution of the given dataset is positively skewed.

Tuesday, May 5, 2020

Descriptive Measures of Quantitative Data

Descriptive statistics measure central tendency, dispersion, and distribution. I exemplify these measures from the daily database on COVID-19 incidence worldwide as of April 29, 2020, which I downloaded from the European Centre for Disease Prevention and Control website. The dataset has 109 daily records.

Measure of Central Tendency

Mean incidence per day is 28,016.6 persons. Median or second quartile is located at 3,907 persons and mode is 17 persons infected in a day. Mean, median, and mode are equal for normal distribution. Both median and mode are less than mean indicating that the dataset is distorted from normality.

Measure of Dispersion

Range of 101,727 is very wide between a maximum of 101,728 persons to one person infected on a single day worldwide. The standard deviation is calculated at 34,079.4 persons. For the normal distribution, 95 percent of the data values fall within two standard deviations from the mean. In this dataset, 95 percent of data values fall within plus 96,175.4 to minus 40,142.2























Measures of Distribution

Skewness value is positive 0.73.  A normal distribution has zero skewness. The standard error of skewness is 0.23. The positive value indicates that the distribution is positively skewed with a right tail. Since a skewness value is more than two times its standard error the distribution is asymmetric.

Kurtosis value is negative 1.24. A normal distribution has a kurtosis value of zero. The standard error of kurtosis is 0.45. The negative value indicates that the distribution is flatter than normal with the data values less clustered around the center of the distribution, which is also referred to as platykurtic distribution. Since a kurtosis value is more than two times its standard error the distribution is abnormal.

In a nutshell, descriptive statistics measure the central tendency, variability, and distribution of data. The given dataset is positively skewed and flatly distributed than normal. 

Saturday, May 2, 2020

Frequency, Tabulation and Graphical Presentation

Frequency is a summary measure that describes both categorical and quantitative data. It counts the number of times a category or a data value is repeated. The summary data can be presented in tables and graphs. I exemplify from the database on COVID-19 incidence and human death worldwide as of April 28, 2020, which I downloaded from the European Centre for Disease Prevention and Control website.

Categorical Data
I was curious to know how many countries in different continents or regions had had deaths.  I found that 169 out of 206 countries had deaths and 37 did not deaths as shown in Table 1. Most European countries had deaths and most African countries did not have deaths.

Table 1: Countries Reporting Death Due to COVID-19



















The above data can be visualized in a bar chart for better presentation as shown in Figure 1.











Quantitative Data

I was curious to know how many countries had how many numbers of death due to COVID-19 falling in a certain class interval. I found that most countries (161 out of 169) reporting death had deaths falling within a class interval of one to less than 5000.  It was followed by three countries that had deaths in the interval between 5,000 to less than 10,000. One country had deaths within the range of 55,000 to less than 60,000 (Table 2).

Table 2: Country and Death Due to COVID-19


















The above data can be presented in the histogram as shown in Figure 2. Data shows that the number of countries is positively skewed.








I was curious to know how many persons died at what time interval. It was appraised that in April 2020 alone 82 percent of total 209,776 people died. It seems that the number of deaths is rapidly increasing by the month. It could be because people who suffered in initial months started to die.

Table 3: Death by Month in 2020
Above data can also be visualized in the Time series line chart in Figure 3.









In a nutshell, frequency is a process of summarizing data, which can be presented in tabular and graphical forms.