05-436 / 05-836 / 08-534 / 08-734 Usable Privacy and Security

Basic Stats with R

R is a free statistics program that runs on Windows, Mac, and Linux.

You can download R from: http://lib.stat.cmu.edu/R/CRAN/

I have written an R script called tutorial.R (click to download) to demonstrate how to analyze categorical variables as both the independent and dependent variables (Chi-square test) and how to analyze a categorical independent variable and continuous dependent variable (ANOVA). You can probably just modify this file to do the stats part of Homework 3.

In that script, I read the data from ponies.csv (click to download), which is a CSV file containing values I made up. You can use any spreadsheet program (Microsoft Excel, LibreOffice Calc) to create CSV files, which are simply “comma-separated values” files.

Each row of the CSV file represents a participant, while the columns represent the data points (variables) you collect. The first row of the file represents a header, which specifies the name for each variable. NOTE: R requires that these headers not contain spaces!!!

I find it easiest to run R from the command line. Open a command prompt (and make sure that R's binary is in the path) and go to the directory that contains tutorial.R and ponies.csv. Then, run the following command: R CMD BATCH tutorial.R

This command effectively tells R to run all of the lines of code in the specified R file in one shot. You will have two new files at the end: Results.txt will contain the results of running your script. That is the important file. The second file, tutorial.Rout, contains the results of running the commands in batch mode. I never look at this file unless there's an error running my script (Results.txt doesn't contain everything I think it should). If there is an error, tutorial.Rout will contain the error message.

In this script, I do a number of different things that are hopefully clear from the comments. After defining some options and a useful helper function, I read the data in from the CSV file and store it in the data frame data. There are columns in our data representing the ParticipantID, LikePonies (a categorical variable representing whether the participant likes ponies: "Yes" "Maybe" or "No"), PoniesOwned (a continuous variable containing how many ponies are owned by that participant), and the participant's Gender. To specify which column you're talking about, you can use the $ operator in R. That is, data$Gender represents the "Gender" column from the data frame called data.

Next, I use R to bin the responses "Maybe" and "No" about liking ponies into "NonYes", which I then save as a new variable LikePoniesBinary. I did this since it will probably be useful for you to recode/bin data this way if you have Likert responses. Next, a print out the counts and percentages of the responses for our categorical variables. Afterwards, I slice the responses for LikePoniesBinary by gender in a contingency table (the rows represent one variable, while the columns represent another). This lets us see whether one variable seems to be independent of the other. To quantify this, I run a Fisher's Exact Test, finding that p = 0.2553. Therefore, we fail to reject the null hypothesis that gender and liking ponies are independent (not correlated). Recall that Fisher's Exact Test is similar to Pearson's Chi-square, except that it calculates the p value rather than relying on an estimate. Therefore, you must use Fisher's Exact Test when any of your cells in the contingency table contain fewer than 5 elements, as was the case here. You may use Fisher's Exact Test even if all cells contain more than 5 elements, though.

Afterwards, I look at the continuous variable PoniesOwned overall, and then individually by gender. Note that the statement data$PoniesOwned[data$Gender=="Male"] indicates that R should give us a vector of all data about PoniesOwned for participants where the Gender is male. I then run a Shapiro-Wilk normality test on the distribution of PoniesOwned for males and females. All of the P values are greater than 0.05, which indicates that the distributions appear to be normal. Therefore, we can conduct an ANOVA test, which is the final thing I do.

Going along, you probably noticed a few aspects of R's syntax. First of all, the operator "<-" is the assignment operator in R, although newer versions also support "=" as the assignment operator. You'll also notice that we read our CSV file into a variable (data) that in our case represents a data frame. Conveniently, R treats data as vectors, so comparing data$Gender and data$LikePonies will result in corresponding elements of the vectors coming from the same participant. The # sign is the operator for comments.