MTH225-Exercise 0

The first step in analyzing a set of data is a cursory examinination of some of its characteristics, usually including:

The mean or average, denoted by x_bar
The median or 50th percentile (the value for which 50% of the values are smaller and 50% are larger)
The minimum or smallest value
The maximum or largest value
The 25th percentile (also known as the first quartile)
The 75th percentile (also known as the third quartile)
The standard deviation (approximately the square root of the average squared deviation from x_bar)
The interquartile range (the difference between the first and third quartiles)
The number of data values

For the purpose of this assignment, suppose we have a collection of SAT scores, which are stored in a comma-delimited file called exercise0.csv.

The data is linked to the course website in the Notes and Handouts section.

The task is to read the SAT scores into R and use R to calculate each of the characteristics listed above.

As it turns out, R has a built-in function called read.csv that is designed precisely for reading this type of file. Even better, we can specify the location of the file with a URL or web address.

The first step is to start R. Once R is running, you can access documentation through a web browser by entering the following command after the command prompt (>)

help.start()

Functionality in R is organized into collections of routines called packages. The package that contains most of the routines we will be using is called stats. You can find it by following the Packages link from the main documentation page and then selecting the stats link from the list of packages. Notice that R has quite a few packages and many routines.

The next step is to read the SAT scores into R. The comma-delimited file consists of two columns of numbers, each beginning with a header giving a name for the data values in that column. R will probably guess that this is the case because the elements in the first row are not numeric, but we will help it make the correct choice by specifying that this is the case in the read.csv statement.

The command to read the data and store it in a data structure called a data frame named sat, which should be typed on a single line (the browser will probably break it into two lines in this page, but you should type it in on a single line) is:

sat<-read.csv("http://www.sandgquinn.org/stonehill/MTH225/Spring2011/exercise0.csv",header=TRUE)

The easiest way to get the URL correct is to visit the course web page and copy the link location to the clipboard, then type

sat<-read.csv("",header=TRUE)

on a single line and paste the link location between the quotation marks before hitting enter.

As usual with R, if it works it will look like nothing happened:

<

To verify that the data frame named sat was created, you can display the contents of the workspace with the command:

ls()

which should produce something like

[1] "sat"

One last step remains before we can begin examining the data values. R will assign names to the columns based on the values in the first row, and we want to be able to refer to those names directly. To see what the column names are, we will examine the structure of the data frame with the str() function:

str(sat)

The result should look like this:

'data.frame': 312 obs. of 2 variables: $ X: int 1 2 3 4 5 6 7 8 9 10 ... $ x: int 665 551 457 416 252 634 416 540 506 572 ...

Apparently the SAT scores are in a vector of integer values named x and the vector X contains sequence numbers. The command to allow us to refer to them directly is:

attach(sat)

Now we can easily display the values by entering

x

and

X

To compute the mean, median, min, max, and first and third quartiles, we can use the summary() function:

summary(x)

(be sure to use the name of the vector and not the name of the data frame here). To compute the standard deviation, use

sd(x)

Finally, the interquartile range is obtained using

IQR(x)

Use these values to answer the questions posted on eLearn as Exercise 0.