The first step in analyzing a set of data is a cursory examinination of some of its characteristics, usually including:
For the purpose of this assignment, suppose we have a collection of SAT scores, which are stored in a comma-delimited file called exercise0.csv.
The data is linked to the course website in the Notes and Handouts section.
The task is to read the SAT scores into R and use R to calculate each of the characteristics listed above.
As it turns out, R has a built-in function called read.csv
that is designed precisely for reading this type of file. Even better, we can specify the location of the file with a URL or web address.
The first step is to start R. Once R is running, you can access documentation through a web browser by entering the following command after the command prompt (>)
help.start()
Functionality in R is organized into collections of routines called packages. The package that contains most of the routines we will be using is called stats. You can find it by following the Packages link from the main documentation page and then selecting the stats link from the list of packages. Notice that R has quite a few packages and many routines.
The next step is to read the SAT scores into R. The comma-delimited file consists of two columns of numbers, each beginning with a header giving a name for the data values in that column. R will probably guess that this is the case
because the elements in the first row are not numeric, but we will help it make the correct choice by specifying
that this is the case in the read.csv
statement.
The command to read the data and store it in a data structure called a data frame named sat
, which should be typed on a single line (the browser will probably break it into two lines in this page, but you should type it in on a single line) is:
The easiest way to get the URL correct is to visit the course web page and copy the link location to the clipboard,
then type
on a single line and paste the link location between the quotation marks before hitting enter.
As usual with R, if it works it will look like nothing happened:
To verify that the data frame named
which should produce something like
One last step remains before we can begin examining the data values. R will assign names to the columns based on the values in the first row, and we want to be able to refer to those names directly. To see what the column names are, we will examine the structure of the data frame with the
The result should look like this:
'data.frame': 312 obs. of 2 variables:
Apparently the SAT scores are in a vector of integer values named
Now we can easily display the values by entering
and
To compute the mean, median, min, max, and first and third quartiles, we can use the
(be sure to use the name of the vector and not the name of the data frame here). To compute the standard deviation, use
Finally, the interquartile range is obtained using
Use these values to answer the questions posted on eLearn as Exercise 0.
sat<-read.csv("http://www.sandgquinn.org/stonehill/MTH225/Spring2011/exercise0.csv",header=TRUE)
sat<-read.csv("",header=TRUE)
<
sat
was created, you can display the contents of the workspace with the command:
ls()
[1] "sat"
str()
function:
str(sat)
$ X: int 1 2 3 4 5 6 7 8 9 10 ...
$ x: int 665 551 457 416 252 634 416 540 506 572 ...
x
and the vector X
contains sequence numbers. The command to allow us to refer to them directly is:
attach(sat)
x
X
summary()
function:
summary(x)
sd(x)
IQR(x)