Using census data as a pollster

I'd like to follow up on Pluribus' brilliant post on downloading and using census data with an explanation of how these data can be useful to you from a polling perspective. You're starting to see some of this happen with voter file vendors, but, hey, remember, the point of this site is to help smaller groups who can't afford a Catalist subscription do all the crazy fun data things that we do. 

There's more.

So, as Pluribus said,

This is, quite literally, a gold-mine of information. In addition to the formal census taken every 10 years, our government pays for smaller efforts to keep tabs on important demographic shifts and provides ongoing projections on the many diverse faces of America. Changes in demographics will affect elections for the forseeable future in many complex and exciting ways.

When you're building a poll, you have to decide of whom you're going to ask these questions.  After all, the answers to the questions are only useful if the right people were asked!  So, how do pollsters decide whom to ask?  It's all summed in one word: counts.   We look at the area that we're going to poll, say, a Congressional District, and try to learn as much as we can about it.   What is the ethnicity breakdown?  What is the gender breakdown?  What do we know about how frequently people vote?  There are all kinds of questions that we ask.   Now, the way that we ask these questions is by using lists as a sampling frame.   We obviously cannot go into the district  to ask everyone there every relevant thing about them, so we use files as a representation of these areas, and make our determinations based on these lists.

Now, as Blue Leader can tell you, building these files isn't easy, and that you frequently wind up leaving out lots of people. After all, the primary sources of the voter files are lists of registered voters collected by the various Secretaries of State, clerks, etc., and God only knows that they're not all that good at it.  What that means is that when we're looking at counts from the list, there's a non-trivial possibility that the lists will be inaccurate, and that we'll be introducing some degree of error into the poll. For example, let's assume that the CD we're looking at is 45% Hispanic.  Finding and registering Hispanic voters is notoriously hard, so there's no guarantee that the list actually has all of them.  So, if we were to just set our quotas based on what the voter file tells, our counts could be off, and the utility of the poll would be compromised.  We can use census data as an independent measure of the CD to help correct what we learn from the lists.

So, why is it that we use lists? It's because census data and voter file lists are trying to solve different kinds of problems.  The goal of building a voter file is to have a reliable list of individuals.  The goal of the census is entirely different: they're trying to collect aggregated data about geographic areas.  Because they're concerned with areas, and not individuals, they are able to use modeling and statistics to a much greater extent than a voter file vendor can.   That being the case, we can use their data to verify and correct for possible errors in the sampling frame.

There's another use, too, and as far as I know, no one else is doing this, but I think that it would be a great idea.  One great thing that certain voter file vendors are doing these days is putting the latitude and longitude for an address on the file. What can we do with this?  Well, for starters, once we have a dataset that has the all the respondents, information from the file and their answers to the questions, we can use these coordinates to get even more information from the census.  There are certain data that are incredibly useful to know that you can't get a good answer to from a poll.  For example, you can't really get a good measure of the degree of education of the neighborhood by asking someone, "How many of your neighbors have post-graduate degrees?".  You also can't get a good measure of the ethnic mix of the area by asking.   What you can do, however, is ask the U.S. Census!  By using the coordinates, you can get all kinds of neat data about the area, and these are things that can really inform your interpretations.   Consider the following: you can look at the African-Americans in your poll and see how their responses to questions changed based on the ethnic demographics of their area.  Are African-Americans who live areas that are more white likely to answer questions differently than African-Americans who live in areas that are more Asian-American?  How do their answers differ based on the degree of education in the area?  All of these are things that tell us more about the respondents, and help us in our targeting.

The one thing that stops us from doing this, so far as I know, is that there's no really convenient way to query multiple sets of coordinates at a time.  Perhaps one of the tools that we can develop here is a good way to do that.

DD