Speaking Notes

PADM 5502

October 15, 2009

Dr. Neubauer

 

WHERE WE ARE:

 

 

I will refer back to these notes FREQUENTLY for the rest of this semester.  These notes will be important to you in PADM 5905 next semester.

 

The name of the file is sample_data.xls

 

 

It is possible to enter data directly into SPSS but it is easier to enter it into an Excel spreadsheet and then import the spreadsheet into SPSS.

 

Notice that the first row is used for the identifiers of the questions.  ID is not a question.  It is the identification number written on the top of each survey instrument returned.  Most of the data is code numbers -- for example in the "race" variable shown above.  However, age was measured at the interval level.  These are actual ages -- not coding numbers.  "page2" is just a page break "dummy variable."  It is not data.

 

Once the data is entered you want to make sure it is "clean." 

 

 

Your paper for this course should include the following parts.

 

Cover page
Introduction

Literature Review (brief)

Research Question and Research Model
Methodology, Population and Sampling Strategy

Hypotheses and Rationales

Findings

Discussion

Conclusion
Appendix 1: IRB Forms
Appendix 2: Survey Research Instrument

 

I highly recommend that you each make a table showing the mapping between questions in your instrument and your independent and dependent variables.  It should look something like the following. 

 

question number

variable name

independent or dependent variable being measured

1

ID

not applicable

2

gender

gender

3

v3

income

 

The following pages have some of the SPSS commands we will need.

http://statlab.stat.yale.edu/help/doco/spss_basics.jsp

http://its.unm.edu/introductions/Spss_tutorial/session3.html

http://www.lrz-muenchen.de/~wlm/wlmsreco.htm

 

It is also common to use 99 as missing data, unless 99 might be a valid value.

 

For each of your variables you need to tell SPSS about missing values.  For example:

 

MISSING VALUES gender (9).

 

When all the available data has been entered, open SPSS and import the data.

 

Using the Syntax Editor in SPSS, run frequencies on all your variables.  The command is as follows.  Remember that every command in SPSS must end with a dot (period).

 

freq all.

 

Check to see that all of your values are within appropriate ranges.  For example, if you have a 5 as a value of your gender variable, something is incorrect.  Go back, find that survey, and correct the data.

 

Put variable labels on each of your variables.

 

variable labels var14 'concerned about crime'.

 

Put value labels on each of your variables.

 

value labels var14 1 's disagree' 2 'disagree' 3 'undecided' 4 'agree' 5 's agree'.

 

Since you will have very little data to work with, "compress" your Likert's  using the following syntax, assuming the original variable is named var14.

 

compute newvar14 = var14.

recode newvar14 (1=1) (2=1) (3=2) (4=3) (5=3).

variable labels newvar14 'concerned about crime'

value labels newvar14 1 'disagree' 2 ' undecided' 3 'agree'.

 

Since most of our data will be ordinal or nominal, we will use contingency tables and the Chi-square statistic to test hypotheses.

 

The command is like the one shown below.

 

crosstabs tables = IV by DV/cells = count column/stat = chisq.

 

If you have fifteen independent variables and three "aspects" to your dependent concept (each operationalzed with only one question), this means your findings will be based upon at least 45 contingency tables (and 45 values of the Chi-square statistic).

 

For each of your hypotheses you will either find . . .

 

            no support

            partial support 

            support

 

For there to be (partial) support, the ChiSquare value must be STATISTICALLY SIGNIFICANT.  You can not just look at a ChiSquare value and tell.  You look at the "p" value.  P stand for probability.  Small p's are good.  The general convention is that is p <.05 then the value of ChiSquare is significant. 

 

So, why are "small p's" good? 

 

Let's assume that somehow we know that in a given population there is no relationship between gender and liking peanut butter.  Is it possible to take one random sample of the population and in that sample men are more likely than women to like peanut butter.  Yes.  Is it possible to take one random sample from that population and 90 percent of the men like peanut butter and only 5 percent of the women like peanut butter.  Yes, but it is very unlikely to get such a random sample from a population in which there is no relationship between gender and liking peanut butter. 

 

Now, in reality we don't know whether or not there is a relationship between gender and liking peanut butter in the population.  We draw a RANDOM SAMPLE and find that in the sample 90 percent of the men like peanut butter and only 5 percent of the women like peanut butter.  There is only a VERY SMALL CHANCE that a sample like this was drawn randomly from a population in which there is no relationship between gender and liking peanut butter.  Given that we have this sample we conclude that there is in fact a relationship between gender and liking peanut butter in the population. 

THESE ARE THE SPSS COMMANDS YOU WILL NEED TO USE.  Notice that all SPSS commands end with a dot.

 freq all. 

(Used to make sure all your data values are within range.)

 variable labels var4 'Gender'. 

(Used to put a label on a variable so the results produced by SPSS are easier to understand.)

 value labels var4 1 'Male' 2 'Female'. 

(Used to put a label on each value of a variable so the results produced by SPSS are easier to understand.)

 missing values var4 (9).

(Tells SPSS that if it finds a value of 9 in this column it is missing data.)

 compute newvar32 = var32.

(Makes a new variable just like var32 with the same values in the new variable.)

 variable labels newvar32 'I am optimistic about the future'.

 recode newvar32 (1=1) (2=1) (3=2) (4=3) (5=3).

(Reduces the numbers of categories of response from five to three by combining the first two and by combining the last two categories of response.)

 value labels newvar32 1 'Disagree' 2 'Undecided' 3 'Agree'.

(Adds value labels to your new variable so the results produced by SPSS are easier to understand.)

 freq newvar32.

(Allows you to verify that your new variable is what you intended.  In other words, it has the correct variable label and that it has three categories of response with value labels as follows -- Disagree, Undecided and Agree.)

 freq age.

(Displays a frequency of ages of respondents.  I am assuming this is interval data and you have collected the age of each participant.)

 missing values age (99). 

(Tells SPSS that if it finds 99 in this column this is not a 99 year old person.  This is a person who did not answer the question.  If you really do have a 99 year old participant, enter 98 and don't worry about it.)

 compute newage = age.

(This creates a new variable named newage that looks just like the variable age.)

 recode newage (1 thru 20 = 1) (21 thru 40 = 2) (41 thru 60 = 3) (61 thru 98 = 3).

(This changes the values in the variable newage so that there are four groups.  In other words, it converts interval data into ordinal data so you can use crosstabs analysis and the Chi-square statistic to test hypotheses.) 

 value labels newage 1 'Under 21' 2 '21 to 40' 3 '40-60' 4 'over 60'.

(This creates value labels for the four categories of age in the variable newage.)

 Lets say that we are testing the hypothesis that females are more likely than males to be optimistic about the future.  The variable newvar32 is ordinal.  The variable gender is nominal, but it can be treated as if it was ordinal because it has only two categories.  We can use crosstabs analysis. 

 The following is the general form of the crosstabs command in SPSS where DV stands for the dependent variable and IV stands for the independent variable.  Basically this says to SPSS, "make a table with the values of the independent variable across the top and the values of the dependent variable down the side.  In each cell I want to see the number of cases and the column percentage.  I also want to see the Chi-sq statistic and the 'p' value indicating the presence or absence of statistical significance."

 crosstabs tables = DV by IV/cells=count column/stat=chisq.

 Consider this hypothesis.
H5: that there is no relationship between gender and willingness to return to a medical care facility

You should be able to explain and interpret each of the following:

 

 

 You must have an expected cell count of at least five (5) in every cell before you can claim statistical significance.  This is a good reason to reduce the number of values of Likert-scale questions from five down to three.

 Given a hypothesis such that only one variable is used to operationalize the DV, there are three possible results.

 Say you operationalize the DV using three questions and your hypothesis anticipated a "positive" relationship between the IV and the entire dependent concept.  The possible outcomes are as follows.

You don't say that you found support for a null hypothesis.  If you found an apparent relationship where you did not expect to find one, that is interesting and probably worth writing about.

 Say you find support for the hypothesis (all three contingency tables are significant in the direction expected)

 Remember, never claim that you have "proved" anything.  At best, you have found support for a hypothesis regarding a relationship between an IV and a DV.

Now, there is the "cells with expected count less than 5" problem.  If you get that message from SPSS you don't have enough data to claim statistical significance even if your p value is small.  The more cells there are in your contingency table (and the smaller your sample) the more likely you are to have this problem. 

I use the "butter and toast" analogy.  If you don't have much butter (data) and you have only four pieces of toast (cells) you may be okay.  But if you only have a little butter you can't expect to have enough to spread on many pieces of toast.  If you only have a sample of 100 then you probably need to try to "crunch" your ordinal variables down into only two categories.  If the independent variable has two values and the dependent variable has two values, and you have 100 surveys you may avoid the "cells with expected count less than 5" problem.

 

The following questions refer to the SPSS output shown above.

1.         How many respondents were under 55 years of age?

2.         How many respondents under 55 years of age said they would return to the medical facility?

3.         What percentage of respondents 55 and older said they would return to the medical facility?

4.         What percentage of those who said they would return were under 55 years of age?        (be careful)

5.         What is the value of the Pearson Chi-Square? ________________

6.         Can we make a statement about the statistical significance based on only the value of the Pearson Chi-Square?

7.         What is the p value associated with the Pearson Chi-Square value?

8.         Is the p value less than .05?

9.         Is there a statistically significant relationship between the IV and the DV based upon the data shown above?

10.       Why or why not?