Building a Voter File: Quality Control

In the last part of the Voter File tutorial, we loaded the Adams County, Ohio file into SQL.  It should now be ready to use.  But first, we need to go through some basic quality control on the file, which I'll show you how to do below the jump.

Voter file work is exacting, and even small errors can have large impacts on your data.  So no matter how much you trust the source of your voter file, it behooves you to do some basic checks on their work.  Today, we're going to give the Adams County file a thorough once-over before we get down to actually using it.  Here are some basic checks you should think through whenever you receive a new file.

  • Is the file up to date, and does it contain new information? You're always going to want the most recent iteration of the file.  We can see on the OH download page that the files were updated on November 17 of this year, so grab that file and load it into SQL.  Make sure to give your table a distinctive name--for this example, I'll use adams_111708.  To check if this is actually the updated file, the easiest thing to do is look at registration dates.  The following code should work: select regdate,count(*) from adams_111708 group by 1 order by 1 desc; 

This will display a list of people, in reverse order by registration date.  The output looks like this:

  regdate   | count
------------+-------
 2008-10-06 |   158
 2008-10-03 |    33
 2008-10-02 |    13
 2008-10-01 |    17
 2008-09-30 |    21
 

We can see that the last registration date is October 6, which corresponds to Ohio's registration deadline; also, it's worth looking at the counts on each date to make sure they don't seem outrageous.  These appear to be in order.

  • Did the file load correctly? Any text editor will give you a line count on the original file we downloaded from the Secretary of State (in this case, 17,795).  Simply run select count(*) from adams_111708 and make sure that your table has the same number of rows.
  • Is there a unique ID, and is it working correctly? From the very beginning of this series, I've emphasized the importance of a unique identifier.  Since we're using the Sec. of State ID, and not generating our own, we should check to make sure that it's been applied correctly.  We know there are 17,795 rows in our table; to check the SOS ID, simply run select count(distinct voterid) from adams_111708;.  This will show how many unique values exist in the column voterid, where we've placed the SOS ID.  Again, we have 17,795, so this is working fine.
  • Is there good geographic coverage? We want to make sure that no areas were accidentally left out of the file.  On a statewide file, we can check this fairly easily, by making sure that all counties and state legislative districts are present and accounted for.  On a county file, it's probably wisest to use precincts.  First, we can see that there are 35 precincts in Adams County (a good place to get this information is usually the SOS election results page).  Next, let's count the number of precincts in our file: select count(distinct precinctcode) from adams_111708;.  There are 35, so no geographic area was excluded from this file.

These four checks don't exclude the possibility of other problems.  But they're by far the most common issues with a voter file, and other problems will turn up in the course of using the file.  So it makes sense to focus your checks on these areas.  Run a few others--address coverage can be problematic as well--but after these checks it's safe to say we have a working voter file.  Next week I'll show you what we can do with it.