Crunching the House Delegation Numbers

I thought it would be fun to talk about how I calculated the numbers in this post. If you are a developer, this will probably be pretty routine (although maybe you can tell me how to fix the problem with non-ASCII characters), but others may find it interesting.

The first thing I needed was a list of House members by state and party, preferably in a machine-readable format. The best source I found for this was https://www.govtrack.us/developers/api. I’m not going to attempt a general definition of API (Application Programming Interface), but this is an example of a public web-based API. It works pretty similarly to a web page: you can access it in a browser, and it basically treats the URL in a manner similar to a command line. Instead of of the server returning a web page–an HTML document designed to be readable by human beings–it returns a JSON document–a structured text document designed to be readable by computers. The output is still returned in the browser, from which you can copy and paste, save as a file, or interact with it in some other way.

The documentation for the API is a bit tricky to make sense of if you aren’t accustomed to API programming, but in fact it’s not all that complicated. Let’s examine the URL that I used:

https://www.govtrack.us/api/v2/role?current=true&role_type=representative&fields=person__firstname,person__lastname,state,party&limit=500

The first part–https://www.govtrack.us–is the same as any other URL. (The “https” part is called the protocol, and the “www.govtrack.us” part is called the hostname). In a simple static web page, the stuff after the hostname would be identifying a particular file on that web server. Here, the “/api/v2/role” part is telling the web server that you want to access the API. Everything after the question mark is the parameters you are sending to the API. Basically, the web server takes those parameters, converts them into a database query, pulls the information you have asked for out of the database, and then structures it for browser-based delivery.

The “current=true” part tells the web server that I’m only interested in currently serving officials.
The “role_type=representative” part says that I’m only interested in members of the House.
The part after “fields=” indicates the particular values I would like to get back. The database has a ton of fields, most of which weren’t relevant for this purpose. (To see them, try removing from the URL everything from “fields” to the next “&”, and then reloading the page.)
The data has a particular structure, and the part after “fields=” reflects that. For instance, data like name and gender live under “person”, and the double-underline is how you say “From within ‘person’, I want the ‘firstname’ field”.
The part after “limit=” tells the server how many records I want returned at once. The current membership of Congress is a small number for a computer, so it is no problem to ask for all of them at once, especially when I just want a handful of fields. For larger data sets, you might have to break up your requests.

The data comes back in a format called JSON, for JavaScript Object Notation. Although JSON syntax is based on JavaScript, many languages support the import of JSON data. The first part of the return file consists of metadata, i.e., data about the data. The “limit” of 500 corresponds to the limit specified in the URL. Had the number of matching records exceeded the limit, the “offset” would have been needed to track which “page” of results this was. The “total_count” is the total number of matching records; it is 440, rather than 435, because the database includes the non-voting delegates from DC, Guam, etc.

To import the data into Python, I first copy-pasted the text into a text editor, and saved it as a file with the name “house.json”, in my Python working directory. Then I made a few tweaks. First, I deleted the whole metadata section, which I didn’t need. Then, for clarity, I changed the generic label “objects” to the more specific label “representatives”.

The code to import the file is short:

import json
with open('house.json') as data_file:
    house_data = json.load(data_file)

Unfortunately, this generated the following error:

Traceback (most recent call last):
  File "/Users/stein/Documents/Python/House.py", line 12, in 
    house_data = json.load(data_file)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/__init__.py", line 265, in load
    return loads(fp.read(),
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6456: ordinal not in range(128)

This is the sort of thing that can make programming very frustrating. The source of the problem turns out to be that the data includes a number of names with accented characters, such as “Gutiérrez”. The accented é is a non-ASCII character, and non-ASCII characters can be quite difficult for data imports to handle. I’m sure there are ways to get it to work, but I tried a few different things, and none of them did. For this purpose, I don’t need representatives’ names to be rendered perfectly. Strictly speaking, I’m not really using them at all, although without them it would be hard to verify that my code was correct. So I resorted to an unusual-sounding command offered by my text editor (TextWrangler): “Zap Gremlins”. This provides the ability to replace all the non-ASCII characters with their closest ASCII equivalents: good enough for this purpose, and it did the trick.

The original data in the file is structured with a “wrapper” around it. This is a flexible mechanism, but if I left it that way, it would make for clunky references. For example, to reference the nth representative in the data set, I would need to refer to:
house_data[‘representatives’][n]

To avoid that, I created a new variable that is easier to reference:

representatives = house_data['representatives']

Now the nth representative is just:
representatives[n]

Next, I needed a JSON list of all the state abbreviations. It wouldn’t be all that much work to create one from scratch, but of course someone else has already done that. I borrowed one from here: https://gist.github.com/mshafrir/2646763, and imported it using the same technique as discussed for the representatives.

The next step is to create variables to store how many representatives from each state are Republicans, Democrats, or other. To do that, I used a quite cool data structure known in Python as a dictionary; it gets declared like this:

rCount = {}
dCount = {}
oCount = {}

The neat thing about a dictionary is that it permits reference by arbitrary values. If you were going to do this with only the very simplest variables, you might create variables like AL_rCount, GA_rCount, SC_rCount, etc. Iterating through them would be a nightmare! Moving up to a more sophisticated variable type, you could use an array. Then you would have rCount[0], rCount[1], rCount[2], etc. That works a lot better, but you still have to keep track of which state corresponds to each index in the array (e.g., AL, is 1, GA is 2, etc.) With a dictionary, you can instead reference the states directly: rCount[‘AL’], rCount[‘GA’], etc. Brilliant!

The first step is to iterate through the states, initializing all the values to 0:

for x in range (0,len(states)):
    rCount[states[x]['abbreviation']] = 0
    dCount[states[x]['abbreviation']] = 0                              
    oCount[states[x]['abbreviation']] = 0

Next, we iterate through all the representatives, tallying the counts by party:

for x in range (0,len(representatives)):
    if representatives[x]['party'] == 'Republican':
        rCount[representatives[x]['state']] += 1;
    elif representatives[x]['party'] == 'Democrat':
        dCount[representatives[x]['state']] += 1;
    else:
        oCount[representatives[x]['state']] += 1;

With these counts finished, we can iterate through the states a second time, this time identifying which party a majority of the House delegation belongs to:

majorityParty = {}
for x in range (0,len(states)):
    if rCount[states[x]['abbreviation']]>dCount[states[x]['abbreviation']]:
        majorityParty[states[x]['abbreviation']]='Republican'
    elif rCount[states[x]['abbreviation']]<dCount[states[x]['abbreviation']]:
        majorityParty[states[x]['abbreviation']]='Democrat'
    elif rCount[states[x]['abbreviation']]==dCount[states[x]['abbreviation']]:        
        majorityParty[states[x]['abbreviation']]='Tie'
    else:
        print("ERROR")

Finally, we need variables to count the number of states for which each party has a majority, so we can iterate through the states to do the count:

rMajorityCount = 0
dMajorityCount = 0
noMajorityCount = 0

for x in range (0, len(states)):
    if majorityParty[states[x]['abbreviation']]=='Republican':
        rMajorityCount += 1
    elif majorityParty[states[x]['abbreviation']]=='Democrat':
        dMajorityCount += 1
    elif majorityParty[states[x]['abbreviation']]=='Tie':
        noMajorityCount += 1
    else:
        print("ERROR")

The resulting numbers aren’t quite right, because the initial Representatives data set includes several non-voting delegates, while the States data set includes both the districts and territories that get delegates (e.g., the District of Columbia and Puerto Rico), and those that do not (e.g., the Northern Mariana Islands). For this purpose, though, it’s really not worth the effort it would take to fix this programatically. It’s easier to just count them, and subtract them from the computed numbers:

The final results are as follows:
Republican majority House delegations:
AL : 6 vs. 1
AK : 1 vs. 0
AZ : 5 vs. 4
AR : 4 vs. 0
CO : 4 vs. 3
FL : 17 vs. 10
GA : 10 vs. 4
ID : 2 vs. 0
IN : 7 vs. 2
IA : 3 vs. 1
KS : 4 vs. 0
KY : 5 vs. 1
LA : 5 vs. 1
MI : 9 vs. 5
MS : 3 vs. 1
MO : 6 vs. 2
MT : 1 vs. 0
NE : 2 vs. 1
NV : 3 vs. 1
NC : 10 vs. 3
ND : 1 vs. 0
OH : 11 vs. 4
OK : 5 vs. 0
PA : 13 vs. 5
SC : 6 vs. 1
SD : 1 vs. 0
TN : 7 vs. 2
TX : 25 vs. 11
UT : 4 vs. 0
VA : 8 vs. 3
WV : 3 vs. 0
WI : 5 vs. 3
WY : 1 vs. 0

Democratic majority House delegations:
CA : 39 vs. 14
CT : 5 vs. 0
DE : 1 vs. 0
HI : 2 vs. 0
IL : 10 vs. 8
MD : 7 vs. 1
MA : 9 vs. 0
MN : 5 vs. 3
NM : 2 vs. 1
NY : 18 vs. 9
OR : 4 vs. 1
RI : 2 vs. 0
VT : 1 vs. 0
WA : 6 vs. 4

Tied House delegations:
ME : 1 vs. 1
NH : 1 vs. 1
NJ : 6 vs. 6

Net figures:
R: 33 states
D: 14 states
Tied: 3 states

The complete code is here:

import json
with open('house.json') as data_file:
    house_data = json.load(data_file)
with open('states.json') as data_file:
    state_data = json.load(data_file, encoding='utf-8')
representatives = house_data['representatives']
states = state_data['states']
#print (states[0]['name'])
#print (len(states))

rCount = {}
dCount = {}
oCount = {}

for x in range (0,len(states)):
    rCount[states[x]['abbreviation']] = 0
    dCount[states[x]['abbreviation']] = 0                              
    oCount[states[x]['abbreviation']] = 0
        
for x in range (0,len(representatives)):
    if representatives[x]['party'] == 'Republican':
        rCount[representatives[x]['state']] += 1;
    elif representatives[x]['party'] == 'Democrat':
        dCount[representatives[x]['state']] += 1;
    else:
        oCount[representatives[x]['state']] += 1;

majorityParty = {}
for x in range (0,len(states)):
    if rCount[states[x]['abbreviation']]>dCount[states[x]['abbreviation']]:
        majorityParty[states[x]['abbreviation']]='Republican'
    elif rCount[states[x]['abbreviation']]<dCount[states[x]['abbreviation']]:
        majorityParty[states[x]['abbreviation']]='Democrat'
    elif rCount[states[x]['abbreviation']]==dCount[states[x]['abbreviation']]:        
        majorityParty[states[x]['abbreviation']]='Tie'
    else:
        print("ERROR")
        
rMajorityCount = 0
dMajorityCount = 0
noMajorityCount = 0

for x in range (0, len(states)):
    if majorityParty[states[x]['abbreviation']]=='Republican':
        rMajorityCount += 1
    elif majorityParty[states[x]['abbreviation']]=='Democrat':
        dMajorityCount += 1
    elif majorityParty[states[x]['abbreviation']]=='Tie':
        noMajorityCount += 1
    else:
        print("ERROR")

print ("Republican majority House delegations:", rMajorityCount)
print ("Democratic majority House delegations:", dMajorityCount)
print ("House delegations with no Majority Party:", noMajorityCount)

print ("Republican majority House delegations:")
for x in range (0,len(states)):
    if majorityParty[states[x]['abbreviation']]=='Republican':
        print(states[x]['abbreviation'],":",rCount[states[x]['abbreviation']], "vs. ",dCount[states[x]['abbreviation']])
    
print ("Democratic majority House delegations:")
for x in range (0,len(states)):
    if majorityParty[states[x]['abbreviation']]=='Democrat':
        print(states[x]['abbreviation'],":",dCount[states[x]['abbreviation']], "vs. ",rCount[states[x]['abbreviation']])

print ("Tied House delegations:")
for x in range (0,len(states)):
    if majorityParty[states[x]['abbreviation']]=='Tie':
        print(states[x]['abbreviation'],":",dCount[states[x]['abbreviation']], "vs. ",rCount[states[x]['abbreviation']])

An Electoral College Plurality

The electoral college system by which we in the U.S. elect our president is famously convoluted, but one aspect is largely unknown even to those who otherwise understand it: what happens if there is no electoral college majority? This has happened twice: in 1800 there was a dead tie between Aaron Burr and Thomas Jefferson, and in 1824 there was a 4-way race in which the top candidate received only 40% of the electoral vote.

So what happens then? Under the 12th Amendment, the House of Representatives picks the new president, choosing from the top three candidates (based on electoral votes). But wait! It gets better! The House votes not as individual representatives–the way they vote on every other matter–but rather as state delegations. That’s right: the 53 representatives from California would collectively cast a single vote, as would the lone representative from Wyoming. Although residents of Washington DC get to vote in presidential elections, their solitary non-voting delegate to the House would not receive a vote here.

While the electors meet and vote (in their respective states) on the first Monday after the second Wednesday in December (I swear, I’m not making this up), the operative date for this purpose is January 6th, when the House and Senate meet in joint session to conduct the official tally of electoral votes. It is at this point that, were there no majority, the House would select the next president. The precise timing is critically important, since under the 20th Amendment, old Congressional terms end and new ones begin on January 3rd, at noon. Thus it is the newly elected House that gets to select the President.

What would it take for this to happen? An electoral college tie is theoretically possible in a close two-way race, but extremely unlikely. By contrast, any remotely viable third-party candidacy could easily send an election to the House. Assuming a close race between the Democrat and Republican, a third party challenger who was able to win even just a few states could make it quite difficult for anyone to achieve an electoral majority.

What does the House look like from this perspective? In a few quick searches, I couldn’t find any tallies of House membership broken out in the way that would matter if the House were to pick the President. This is not surprising, since the House doesn’t function this way for any other purpose that I am aware of. Plus, if this were to happen, the relevant House wouldn’t even be the current House. Nonetheless, I thought it would be an interesting exercise to see what the current House looks like through this lens (I’ve thrown in the Senate for the sake of comparison):

House, by state delegation majority House, by representative Senate, by Senator
Republican 33 (66%) 246 (57%) 54 (54%)
Democratic 14 (28%) 188 (43%) 44 (44%)
Tied / Vacant / Ind 3 (6%) 1 (0%) 2 (2%)

To put it another way, under regular voting rules, the current ratio of Republican votes to Democratic votes in the House is about 1.3 to 1. Under the rules by which the House would select a president, and assuming party-line voting (with evenly-divided delegations deadlocked and unable to vote), this changes to an astounding 2.4 to 1!

In a later post, I may go into the underlying data and discuss how I calculated these numbers using an API from www.govtrack.us and a bit of Python. [Update: That post is now up.]

[Update 2: To think about the implications of the House voting in this manner today versus in 1788 or 1804, it is helpful to look at how the relative sizes of state delegations have changed over time. I created a data visualization that shows this.]

A Histogram of the Monster Data

And here is a histogram representing the distribution of the number of monsters that had to be killed in order to acquire all three types of gems, produced with the following code:

import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(monsterCount,bins=maxMonsterCount-1)

Monster-histogram

Note that this is for a different run of the code than the sample output I showed in the previous post. In this particular 100,000 trial run, there was one trial where it was necessary to kill 67 monsters in order to acquire all three gem types!

Also, copying the image produced a .png file with partial transparency, which has resulted in black-on-gray numbers here that aren’t very legible. Sorry about that. I think there are mechanisms for displaying Python output graphs in a browser window without just copying and pasting images, but I haven’t figured them out yet.

538’s Monsters and Gems Puzzle

538 has a weekly riddler post, and I thought I would try this week’s riddle:

A video game requires you to slay monsters to collect gems. Every time you slay a monster, it drops one of three types of gems: a common gem, an uncommon gem or a rare gem. The probabilities of these gems being dropped are in the ratio of 3:2:1 — three common gems for every two uncommon gems for every one rare gem, on average. If you slay monsters until you have at least one of each of the three types of gems, how many of the most common gems will you end up with, on average?

Here is the answer I submitted:

I couldn’t figure out how to reason about this with the probability theory that I know, so I thought I would try to simulate it. I’ve been playing around with Python recently, so I wrote a program that simulates an arbitrary number of trials. Running this repeatedly with 100,000 trials, rounded averages were mostly consistent to 2 decimal places, although sometimes the last decimal place varied by one: 3.65 common gems, 2.43 uncommon gems, 1.22 rare gems, 7.30 monsters killed.

On average it should take 6 dead monsters to get a rare gem. By then, most of the time we would expect that both common and uncommon gems would have appeared. So it’s surprising to me that the average is as much higher than 6 as it is. Intuitively, I think part of what is going on is that because it always takes at least 3 dead monsters to get all 3 gem types, the cases where a rare gem is drawn early on aren’t able to “balance out” the cases where it takes a really long time, as would be the case in a simpler probability distribution. About 15% of the time, it takes 12 or more kills to get all three gem types. I’m no closer, though, to a mathematical derivation of the exact answer.

I have pasted below a sample output from the program, and below that the program itself.

number of trials: 100000
Average number of common gems: 3.64636
Average number of uncommon gems: 2.43568
Average number of rare gems: 1.21557
Average number of monsters killed: 7.29761
max monster count 56
1 dead monsters: 0 (0.000% | 0.000%)
2 dead monsters: 0 (0.000% | 0.000%)
3 dead monsters: 16744 (16.744% | 16.744%)
4 dead monsters: 16700 (16.700% | 33.444%)
5 dead monsters: 13746 (13.746% | 47.190%)
6 dead monsters: 10839 (10.839% | 58.029%)
7 dead monsters: 8446 (8.446% | 66.475%)
8 dead monsters: 6552 (6.552% | 73.027%)
9 dead monsters: 5176 (5.176% | 78.203%)
10 dead monsters: 3937 (3.937% | 82.140%)
11 dead monsters: 3241 (3.241% | 85.381%)
12 dead monsters: 2689 (2.689% | 88.070%)
13 dead monsters: 2104 (2.104% | 90.174%)
14 dead monsters: 1653 (1.653% | 91.827%)
15 dead monsters: 1376 (1.376% | 93.203%)
16 dead monsters: 1200 (1.200% | 94.403%)
17 dead monsters: 907 (0.907% | 95.310%)
18 dead monsters: 807 (0.807% | 96.117%)
19 dead monsters: 620 (0.620% | 96.737%)
20 dead monsters: 529 (0.529% | 97.266%)
21 dead monsters: 443 (0.443% | 97.709%)
22 dead monsters: 382 (0.382% | 98.091%)
23 dead monsters: 312 (0.312% | 98.403%)
24 dead monsters: 305 (0.305% | 98.708%)
25 dead monsters: 221 (0.221% | 98.929%)
26 dead monsters: 178 (0.178% | 99.107%)
27 dead monsters: 139 (0.139% | 99.246%)
28 dead monsters: 129 (0.129% | 99.375%)
29 dead monsters: 95 (0.095% | 99.470%)
30 dead monsters: 99 (0.099% | 99.569%)
31 dead monsters: 60 (0.060% | 99.629%)
32 dead monsters: 58 (0.058% | 99.687%)
33 dead monsters: 46 (0.046% | 99.733%)
34 dead monsters: 50 (0.050% | 99.783%)
35 dead monsters: 33 (0.033% | 99.816%)
36 dead monsters: 31 (0.031% | 99.847%)
37 dead monsters: 25 (0.025% | 99.872%)
38 dead monsters: 25 (0.025% | 99.897%)
39 dead monsters: 19 (0.019% | 99.916%)
40 dead monsters: 19 (0.019% | 99.935%)
41 dead monsters: 14 (0.014% | 99.949%)
42 dead monsters: 15 (0.015% | 99.964%)
43 dead monsters: 6 (0.006% | 99.970%)
44 dead monsters: 5 (0.005% | 99.975%)
45 dead monsters: 6 (0.006% | 99.981%)
46 dead monsters: 4 (0.004% | 99.985%)
47 dead monsters: 4 (0.004% | 99.989%)
48 dead monsters: 1 (0.001% | 99.990%)
49 dead monsters: 2 (0.002% | 99.992%)
50 dead monsters: 0 (0.000% | 99.992%)
51 dead monsters: 3 (0.003% | 99.995%)
52 dead monsters: 3 (0.003% | 99.998%)
53 dead monsters: 1 (0.001% | 99.999%)
54 dead monsters: 0 (0.000% | 99.999%)
55 dead monsters: 0 (0.000% | 99.999%)
56 dead monsters: 1 (0.001% | 100.000%)

Here is my python 3 code:

import random
commonCount=[]
uncommonCount=[]
rareCount=[]
monsterCount=[]
numTrials = 100000

maxMonsterCount = 0;

for i in range (0,numTrials):
    commonCount.append(0)
    uncommonCount.append(0)
    rareCount.append(0)
    monsterCount.append(0)

for i in range (0,numTrials):
    while (commonCount[i]==0 or uncommonCount[i]==0 or rareCount[i]==0):
        monsterCount[i] +=1
        if (monsterCount[i] > maxMonsterCount):
            maxMonsterCount = monsterCount[i]
        monster = random.randrange(1,7)
 #randrange syntax includes the first parameter, but excludes the second       
        if (monster==1 or monster==2 or monster==3):
            commonCount[i] += 1
        elif (monster==4 or monster==5):
            uncommonCount[i] += 1
        elif (monster==6):
            rareCount[i] += 1
        else:
            print("ERROR!")            
# Don't run the following lines unless numTrials is small ...
#    print ("i =",i)
#    print ("common:", commonCount[i])
#    print ("uncommon:", uncommonCount[i])
#    print ("rare:", rareCount[i])
#    print ("total:", monsterCount[i])

commonTotal = 0;
uncommonTotal = 0;
rareTotal = 0;
overallTotal = 0;

for i in range (0, numTrials):
    commonTotal += commonCount[i]
    uncommonTotal += uncommonCount[i]
    rareTotal += rareCount[i]
    overallTotal += monsterCount[i]

commonAverage = commonTotal/numTrials
uncommonAverage = uncommonTotal/numTrials
rareAverage = rareTotal/numTrials
totalAverage = overallTotal/numTrials
print ("number of trials:",numTrials)
print ("Average number of common gems:",commonAverage)
print ("Average number of uncommon gems:",uncommonAverage)
print ("Average number of rare gems:",rareAverage)
print ("Average number of monsters killed:", totalAverage)

print("max monster count", maxMonsterCount);

killDistribution=[]
for a in range (0,maxMonsterCount+2):
    killDistribution.append(0)
for i in range (0,numTrials):
    killDistribution[monsterCount[i]] += 1

cumulativePercentage = 0;
for a in range (1,maxMonsterCount+1):
#1 and 2 should always be zero, but I like to leave them in anyway.
    cumulativePercentage += killDistribution[a]/numTrials*100
    print ('{: d} dead monsters: {: d} ({:04.3f}% | {:04.3f}%)'.format(a,killDistribution[a],killDistribution[a]/numTrials*100,cumulativePercentage))