bananalysis

veganmama18 August 11, 2010 at 1:59pm

assignment 2a answer key
(thanks to our anonymous expert R coder for providing excellent code)

1. summarize and create a histogram for the average number of live births per woman in the 1983 china data. describe the distribution. hint: this information is part of the questionnaire data. also, since we are interested in live births per woman, how should we subset the data?

*let’s first read in the data for 1983:
Q83data <- read.csv("CH83Q.CSV",as.is=TRUE,na.strings=".",strip.white=TRUE)

*now, since we’re interested in live births per woman, we need to subset our data differently than we have in the past, and include only females. So let’s subset our data:
Q83f <- subset(Q83data,Sex=='F' & Xiang==3)

*and now summarize the data:
summary(Q83f$Q192)

-- we get the following result:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3.000 4.000 4.300 4.359 4.700 6.900 5.000
this tells us that the minimum average number of livebirths among all counties was 3, and the maximum was 6.9. the median is 4.3 and the mean is 4.4. since they are roughly the same, the distribution of average live births is probably close to normal, or “bell-shaped.”

*to check this, let’s plot our histogram (recall that xlab tells R the label for the x-axis, and similarly ylab tells R the label for the y-axis):
hist(Q83f$Q192,main="Distribution of average # livebirths (1983)",xlab="average # livebirths",ylab="percent (%)")

the distribution isn’t perfectly normal – it looks like it might be very slightly right-skewed (or positively skewed). the median will be smaller than the mean for right-skewed distributions and indeed this is the case for our 1983 live birth data.

2. summarize and create a histogram for the average number of live births per woman in the 1989 china data. describe the distribution.

*let’s repeat what we did for the 1983 data:
Q89data <- read.csv("CH89Q.CSV",as.is=TRUE,na.strings=".",strip.white=TRUE)
Q89f <- subset(Q89data,Sex=='F' & Xiang==3)
summary(Q89f$Q192)
hist(Q89f$Q192,main="Distribution of average # livebirths (1989)",xlab="average # livebirths",ylab="percent (%)")

-- for the summary, we get the following result:
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.800 3.600 4.100 4.116 4.500 6.500
we see that the median and mean are almost identical and again, we suspect the data is normally distributed.
-- when we look at our histogram, it indeed looks pretty “bell-shaped” but might still be a bit right-skewed.

3. describe the change in average number of live births per woman between 1983 and 1989. hint: use information on the mean, median, as well as the graphical figures to help you.
-- we see that compared to 1983, there are, on average, fewer live births in 1989.
-- we also see that the 1989 data are slightly closer to being normally distributed than the 1983 data

bonus!
another way of looking at distributions is creating what is called "density plots." imagine a histogram with very, very (very, very) narrow bars. this might give us a pretty "jagged" looking histogram, which may be hard to interpret. so now imagine we "smooth" out the rough edges... and this is basically what a density plot is. let's just try a few lines of code:

par(mfrow=c(2,2))
plot(density(Q83f$Q192,bw=0.35,na.rm=TRUE),main="Density plot: 1983 livebirths",xlab="# livebirths")
plot(density(Q89f$Q192,bw=0.35),main="Density plot: 1989 livebirths",xlab="# livebirths")

check out the plots! we can see more clearly that both the 1983 and 1989 data for average livebirths are slightly right-skewed.

(all plots attached)key2a.pdf

pradtf August 11, 2010 at 8:07am

unit2 module 1: examining distributions

this is a 3 part module dealing with distributions of the two types of variables discussed in the introduction.

the first section introduces the idea of category variables whereas the second handles quantitative variables.

categorical variables are non-numerical in concept. they are essentially groupings into categories like apples and oranges (eg types of fruit), lending themselves nicely to displays like pie and bar charts.

quantitative variables, on the other hand, involve quantities that represent measurements such as the number of seeds different cantelopes contain. there are various pictorial opportunities possible here such as histograms, stemplots and boxplots as well as numerical analyses such as finding measures of central tendency (mean, median) and spread (variance, standard deviation). the concept of outliers comes into play as well which are fringe results that can be puzzling such as a cantelope having 1 seed or 30103 seeds (a prime example of palindromic excessiveness).

this is an important module forming the foundation for statistical mechanisms which will be utilized regularly!

in friendship,
prad

veganmama18 > pradtf August 11, 2010 at 10:09am

very nice summary! and coming soon... answer key in R to assignment 2a, and a post for assignment 2b! just what you wanted - more homework! (^_^)

veganmama18 August 1, 2010 at 1:08pm

assignment 2a
let's build on what we've learned from assignment 1, and on what you have covered in unit 2 (exploratory data analysis), module 1 (exploring distributions using graphs). the objectives for this assignment are to:

* plot histograms
* describe the distribution of a selected variable
* describe how that distribution might change over time

here are the tasks with some hints:
1. summarize and create a histogram for the average number of live births per woman in the 1983 china data. describe the distribution. hint: this information is part of the questionnaire data. also, since we are interested in live births per woman, how should we subset the data?

2. summarize and create a histogram for the average number of live births per woman in the 1983 china data. describe the distribution.

3. describe the change in average number of live births per woman between 1983 and 1989. hint: use information on the mean, median, as well as the graphical figures to help you.

hints for code
in assignment #1, we learned how to read in a CSV file containing data, summarize the data using the summary function, and create histograms using the hist function. we will use the same functions for this assignment.

one of our participants is also our acting "super master R coder" for this course and figured out a quirk of R: it treats character values as integers unless we tell it not to (for this, R has been given a detention). so, to read in the data and make sure R doesn't treat vectors like "county" as numeric, we should include as.is=TRUE when we use the function read.csv.

pradtf August 1, 2010 at 11:18am

that's too bad robert - but thx for letting us know and glad you are following along.
things will be summarized in the original post and linked into the discussion so it should be easy to find things.

it is easy to pick up on the course anytime and go at your own pace too - so join in whenever you can.

in friendship,
prad

veganmama18 July 31, 2010 at 8:17am

liam, you're officially out of detention. i am too. :-D yay!!

Liam G > veganmama18 July 31, 2010 at 8:36am

Yay! I think I'll complain about Mr Prad ;)
pradtf > Liam G July 31, 2010 at 9:44am

what?!!
i was the one who helped to get you out of there!
vm puts you in, i help you get our and you complain about me!!

excellent!
i rather like that actually - helps to establish my fearsome reputation further (as in the good old days)!!
my horns are a tingle!

in fiendship,
prad
Liam G > pradtf July 31, 2010 at 10:13am

Ah so it was Veganmama was it? She was giving me some questionable code. hmmmm.

Ok I trust you prad. No fear here. Sorry. Haha
pradtf > Liam G July 31, 2010 at 10:32am

well not quite - veganmama and i both enjoyed putting you in detention, but when she gave herself a detention, i thought it would be a good idea for her to bring you out, because otherwise we couldn't give you a detention again, could we?

however, it is always beneficial for you to complain about me because doing so helps maintain my fearsome reputation - without my having to actually do much. besides, right now you're the only student brave enough to post here that vm and i have to pick on. the others are hiding. :D

in friendship,
prad

Debunking The China Study Critics

bananalysis

Replies