Descriptive statistics using R
Descriptive statistics are useful for making summary statements about values for a variable. These summary statements are often more convenient to share than the original data values. This article aims to describe calculations and scripts for using descriptive statistics with your own data. For example, if you're about to make a presentation, this article will help to create summary statements and graphs rather than displaying all data values.
The statistics calculations are intended for data with discrete values and a normal distribution. This approach keeps the article short and practical. Readers are advised to use a textbook for coverage of continuous values and specialized distributions.
R is free open source software for statistical computing and graphics, which is usually installed on a personal computer. In this article, we will use an online toolbox which has adapted the R software for use within a web browser. There is no need for software download or installation.
The examples in this article used data from the UK Premier League during the 2010-11 football season. Specifically, we will use the number of points earned by each football team. The values are displayed below in alphabetical order of the team, starting with 68 points for Arsenal.
68,48,39,43,39,46,71,54,49,58,80,71,46,46,47,62,47,33,42,40
Central tendency
Mean, median and mode are 3 common measures of central tendency, which refers to the location (middle or center) that we typically find most of the values for a variable.Mean is the sum of all values divided by the count of values for a variable. This is the most commonly used measure of central tendency and is often referred to as the average value.
The text box below is used to write the R commands, which collectively is called a script. The script displayed below contains 2 commands.
The script is short because R has a pre-defined command to calculate the mean. The first line of the script defines the values for the variable. All values must be included between the pair of parentheses and each value must be separated by a comma. Re-use the script by replacing the sample data with your own data.
Median is the middle value of a variable once it has been sorted. If there is an even number of values, then the middle 2 values are added and divided by 2 to calculate the median.
Mode is the measure of the most frequently occurring value for a particular variable. The advantage of this measurement is that it can be used with any type of data. The script is slightly longer because R does not have a pre-defined command to calculate the mode.
Please note that if there are more than one value which occurs most frequently, then, the mode is undefined. For example, if we replace the 48 value with a 47, then the output would be NA to indicate undefined mode.
Variation
Another area of interest when summarizing data is the amount of variation between all values for a variable.Starting with the simplest, range is the difference between the highest and lowest values.
After sorting the values from lowest to highest, they can be divided into 4 segments of equal size. The first quartile is the value at 25%, the second quartile is the value at 50%, the third quartile is the value at 75% and the fourth quartile is the value at 100%. The second quartile is the same as the median value. If the boundary of a quartile does not fall on a specific value, then the nearest 2 adjacent values are added and divided by 2 to calculate the quartile value.
Standard deviation is a measure of the spread of values for a variable. It is based on another statistic called variance, which is based on the difference between the mean and each value of the variable. This is an important measure of variation because a higher standard deviation indicates a greater amount of variation in the values.
A common method to visualize the concept of standard deviation is to create a graph which shows the range of possible values on the x axis and the probability of finding those values in the variable on the y axis.
The z-score represents the difference between a test value and the mean for a variable, where the difference is expressed as a unit of the standard deviation. For example, the graph above suggests that 1 positive standard deviation (red line to the right side) is approximately between the values 60 and 70. Looking at the team points, the nearest value is 62. The calculation below will show that the z-score for the value 62 is near to positive 1.
Change the test value to 39 and the calculation will show a z-score near to negative 1, which correspond to the 1 negative standard deviation (red line to the left side) on the graph.
http://hughesbennett.co.uk/ArticleDescriptiveStatisticsUsingR