Building a Poll Part 8: Random sampling
It's been a while since we've had an update in the Building a Poll series, and I'm excited to pick it up again. When we last left off, we were talking about using a list strategically. I'd like to continue on the subject of using lists, but we're going talk about something more mechanical aspect of using the list: building a random sample.
Granted, if you're a pollster, you're likely to have purchased a random sample from your data vendor, but it never hurts you to understand the mechanics of how this works. Heck, maybe one of your clients stuck you with a membership list or some other targeted list and asked you to call a random sample of that list. You can't very well send that list off to your data vendor and ask them to sample it for you!
There's more...
This is easier to do if we pretend that we have a situation and a list from which we're working. Let's assume that we have the membership list of some fake organization - how about we call it the Association of Snood Players of America? Now, suppose that ASPA wants to conduct a poll of their membership, and they give you their entire membership list of 100,000 from which to pull a random sample of 500. What do you do?
Well, the first thing that you want to do is make sure that the list is completely and thoroughly randomized. (We'll get to why in a bit, but for now, let's just take it for granted.) Odds are that if it's a membership list, it's not at all randomized. It's probably organized by some pretty reasonable factors, like geography, age, gender, etc. This is because a good sorting makes it easy for ASPA to find the records that they're seeking in a hurry. Unfortunately for you, this doesn't make for easy random sampling. But have no fear! There's a really, really, really, easy solution: you can just randomize the list.
Randmizing the list consists of two steps: 1) assigning each record a random number and 2) sorting the list by that random number. Let's put this random number in a variable called RAND. So, if you're using SPSS, you can do it like this:
SET SEED RANDOM.
COMPUTE RAND=RV.UNIFORM(0,1).
FORMAT RAND(F8.6).
SORT CASES BY RAND.
Ta dah! Now all the records in your list are completely disorganized, or, randomized.
Step one was easy, but now comes step 2: picking out those 500 cases. How do you do it? How do you decide exactly which ones to get in such a way that you don't introduce bias into selection? Have no fear - there is an easy method. So, you have 100,000 records, of which you want 500. If you pick one record at random and then skip down the list 200 more records, and keep doing this, you will eventually have 500 randomly selected numbers. The starting point would be called yourseed. In case it's not obvious how I got the number 200, you divide the size of the list (100,000) by the number of cases that you want (500), and round to the nearest integer (200).
Perhaps now you see why it was important that we randomized the list. Had we not randomized the list, we would have had a sample that was heavily tiled towards whatever variables were sorting the list. This is what pollsters and statisticians call bias. By randomizing the list, we've largely avoided that. So, this is is all great, but you probably want to know the mechanics of how to do get this done.
Well, the first thing that you want to do is assign each record a number from 1 to 100,000 - this will be your index.
COMPUTE INDY=$CASENUM.
You have just now created a variable on every record in your dataset called INDY. INDY is equal to whatever position that record holds on the list.
Now, this is where things become a little bit more complicated, because we're going to implement what we just discussed above in SPSS syntax. So, here we go. The first thing that we're going to do is set our seed, which, to be convenient, should be between 1 and 200. Just pick a number randomly, in this, case, because I'm a big fan of Illuminatus, I'm going to pick the number 5. Then, we want to tell SPSS that we need to add 200 to the seed to get the next number. This is called the length. So, what you need to do is divvy the whole thing up into blocks of 200, or intervals, and select the fifth record from each interval.
COMPUTE SEED=5.
COMPUTE LENGTH=200.
COMPUTE INTERVAL=TRUNC(INDY/LENGTH).
IF (INTERVAL=(INDY/LENGTH)) INTERVAL=INTERVAL-1.
COMPUTE SAMPLE=(LENGTH*INTERVAL)+SEED.
IF (SAMPLE>TRUNC(SAMPLE)) SAMPLE=SAMPLE+1.
EXECUTE.
SELECT IF SAMPLE=INDY.
Ta dah! You have now selected 500 members from the membership list of Association of Snood Players of America. Feel free to do whatever you want to do to these cases, like exporting them, matching them to other data, etc.
Now, these days, just about any number cruncher that you can buy will have an automated, graphical user interface way of randomly sampling a list such that you never have to do this on your own, but I think that it's important to understand the mechanics of how such things operate.
We're going to continue with a discussion of what we just did here in the next installment of Building a Poll.
Dirty D














Recent comments
5 hours 13 min ago
5 hours 15 min ago
2 days 16 hours ago
3 days 4 hours ago
3 days 9 hours ago
2 days 12 hours ago
3 days 14 hours ago
3 days 15 hours ago
3 days 18 hours ago
3 days 18 hours ago