Building a Voter File: Matching Lists

In my efforts to duplicate the straightforward plots of David Foster Wallace, I'm going to circle back yet again to one of my earlier posts.  There's more below the fold...

When I wrote about the process of appending outside consumer data to state or local voter files, I cautioned strongly about attempting to match multiple data sources:

Aside from gathering all this information into one place, these companies also have the expertise to match it to any data you might already possess (you do not have this expertise).  Matching records is tough--people move, get married, die, have children with the same name as them, and generally make life hard.  Gathering, collating and coordinating all this information is a massive enterprise

I'll mostly stand by this, but nevertheless there are going to be times when you will want to match two disparate sources of information.  I'm going to cover just one today: matching older and newer versions of the same file.  Incidentally, I've been stressing the importance of consistent IDs throughout this series--this should explain why.

Let's start with a problem that's all too common--two files for the same area, from the same source, that lack a consistent and unique identifier.  Maybe the state switched the way they generate the IDs.  Maybe they have different IDs for county and statewide, and only included one.  Maybe they don't like disclosing their voter IDs at all.  Point is, you have two versions of the same file, from the same source, without a consistent ID across the files.

It is still possible, however, to establish a unique and consistent mapping for both files.  Let's go back to our example of Adams county.  The most recent version of the file contains 17,795 records.  Let's say that this file is replacing an older file, with 16,950 records, which we had attached various bits of information to--contact information from a field program, better phone numbers, or what-have-you.  We don't want to lose the work we've put into the old file, but we still need to replace it. 

If we have an ID that's consistent across both files, it's a snap.  We can use a simple SQL operation called a join, which matches two tables based on any column(s) we specify that they have in common.  So, for instance, let's say our old table was called adams_031808, and that in addition to the columns from the most recent file, it contains a column called voteintent, based on our canvassing efforts.  The first thing to do is add this column to our most recent table.  This can be accomplished fairly simply--alter table adams_111708 add voteintent (varchar(5));.  Now that we have a target for the data we'll be bringing over, let's see how many of the people in our old file are in the new one.  To do this, simply run the following: select count(*) from adams_081808 inner join adams_111708 using (voterid);.

This should spit out the number of values in the voterid column that appear in both tables.  If everyone from the old file is still on the new one, the result of that query should be exactly 16,950; however, we can expect it to be a little lower, since people will have moved, dropped off the rolls, or otherwise become unavailable.  Assuming that number doesn't set off any alarm bells, it's easy to bring over the voterintent column: update adams_111708 a set a.voterintent=b.voterintent from adams_031808 b where b.voterid=a.voterid;.  The "a" and "b" after the tablenames are aliases, there for convenience, but the gist should be clear--set the new voterintent equal to the old voterintent where the IDs are the same.

What happens if you don't have a consistent ID? This gets a little trickier, but it does use roughly the same process.  First of all, make sure there's one column that has the same name across each table, which will end up containing the consistent ID--this makes things more convenient and easier to remember.  We'll use voterid, and we'll assume that adams_111708 has one and adams_031808 has no such column.  Create it for the older table.  Now, we have to match the files--how do we do that?

This, admittedly, is not an exact process.  Basically, we have to ask ourselves: what information has to be the same for me to assume that someone is the same person? Name is not good enough, especially if you're dealing with large datasets.  But on the other hand, if somebody has the same first name, last name, date of birth, street address, town and zipcode as someone on the old file, we can probably surmise that they're in fact the same individual (note that the date of birth is doing a lot of work here--without it, we might have a problem with Joe Schmo Jr. vs. Sr.).  Let's do a check on these criteria for matching: select count(*) from adams_031808 a inner join adams_111708 b on (a.fname=b.fname and a.lname=b.lname and a.birthyear=b.birthyear and a.regaddr=b.regaddr and a.regcity=b.regcity and a.rzip=b.rzip).

Right up front, we can see there will be problems with these criteria.  For starters, we lose anyone who's moved automatically.  Secondly, a typo, alternate spelling or bad standardization in any one of those fields would throw off the whole match.  In short, this is not going to catch everyone from the old file who appears on the new.

On the other hand, we don't want to set too loose a criteria either.  After all, a false match (like the Jr./Sr. confusion, or just someone who happens to have the same name) will also mess up your file.  It's not as big an issue for a smaller jurisdiction like this, but it can and will happen in larger files (e.g. states). 

So how do you strike a balance? Generally, for a statewide file, you'll want to match at least 90% of the records in the old file.  But sometimes this won't be possible, depending on file quality.  Get the best match you can, while at a minimum matching on first and last name and zipcode.  This will miss movers, yes, but it will catch most everyone else while avoiding a lot of the problems with address standardization that can plague this sort of match.  I'll be discussing those problems and others in an upcoming post, so stay tuned.

teeth whitening