R is a free statistics program that runs on Windows, Mac, and Linux.

You can download R from: http://lib.stat.cmu.edu/R/CRAN/

I have written an R script called tutorial.R (click to download) to demonstrate how to analyze categorical variables as both the independent and dependent variables (Chi-square test) and how to analyze a categorical independent variable and continuous dependent variable (ANOVA). You can probably just modify this file to do the stats part of Homework 3.

In that script, I read the data from ponies.csv (click to download), which is a CSV file containing values I made up. You can use any spreadsheet program (Microsoft Excel, LibreOffice Calc) to create CSV files, which are simply “comma-separated values” files.

Each row of the CSV file represents a participant, while the columns represent the data points (variables) you collect. The first row of the file represents a header, which specifies the name for each variable. NOTE: R requires that these headers not contain spaces!!!

I find it easiest to run R from the command line. Open a command prompt (and make sure that R's binary is in the path) and go to the directory that contains tutorial.R and ponies.csv. Then, run the following command: *R CMD BATCH tutorial.R*

This command effectively tells R to run all of the lines of code in the specified R file in one shot. You will have two new files at the end: *Results.txt* will contain the results of running your script. That is the important file. The second file, *tutorial.Rout*, contains the results of running the commands in batch mode. I never look at this file unless there's an error running my script (Results.txt doesn't contain everything I think it should). If there is an error, tutorial.Rout will contain the error message.

In this script, I do a number of different things that are hopefully clear from the comments. After defining some options and a useful helper function, I read the data in from the CSV file and store it in the data frame *data*. There are columns in our data representing the *ParticipantID*, *LikePonies* (a categorical variable representing whether the participant likes ponies: "Yes" "Maybe" or "No"), *PoniesOwned* (a continuous variable containing how many ponies are owned by that participant), and the participant's *Gender*. To specify which column you're talking about, you can use the $ operator in R. That is, *data$Gender* represents the "Gender" column from the data frame called *data*.

Next, I use R to bin the responses "Maybe" and "No" about liking ponies into "NonYes", which I then save as a new variable *LikePoniesBinary*. I did this since it will probably be useful for you to recode/bin data this way if you have Likert responses. Next, a print out the counts and percentages of the responses for our categorical variables. Afterwards, I slice the responses for LikePoniesBinary by gender in a contingency table (the rows represent one variable, while the columns represent another). This lets us see whether one variable seems to be independent of the other. To quantify this, I run a Fisher's Exact Test, finding that p = 0.2553. Therefore, we fail to reject the null hypothesis that gender and liking ponies are independent (not correlated). Recall that Fisher's Exact Test is similar to Pearson's Chi-square, except that it calculates the p value rather than relying on an estimate. Therefore, you must use Fisher's Exact Test when any of your cells in the contingency table contain fewer than 5 elements, as was the case here. You may use Fisher's Exact Test even if all cells contain more than 5 elements, though.

Afterwards, I look at the continuous variable PoniesOwned overall, and then individually by gender. Note that the statement *data$PoniesOwned[data$Gender=="Male"]* indicates that R should give us a vector of all data about PoniesOwned for participants where the Gender is male. I then run a Shapiro-Wilk normality test on the distribution of PoniesOwned for males and females. All of the P values are greater than 0.05, which indicates that the distributions appear to be normal. Therefore, we can conduct an ANOVA test, which is the final thing I do.

Going along, you probably noticed a few aspects of R's syntax. First of all, the operator "<-" is the assignment operator in R, although newer versions also support "=" as the assignment operator. You'll also notice that we read our CSV file into a variable (*data*) that in our case represents a data frame. Conveniently, R treats data as vectors, so comparing data$Gender and data$LikePonies will result in corresponding elements of the vectors coming from the same participant. The # sign is the operator for comments.