Scoping Projects

We wrote about what a good Data Science for Social Good project looks like. Our post begins:

Data Science for Social Good is a summer program that requires year-round preparation. A successful summer requires a mix of good people and projects, and we spend a lot of time trying to find projects to solve and the people to solve them. In addition to reading over 800 applications from aspiring fellows, mentors, and project managers, we’ve spent numerous hours researching, pursuing, and scoping projects: exploring datasets, speaking with representatives, and wrangling with attorneys. Well over a hundred projects will cross our emails, phones, and eyes before we find the 12 to do next summer.

Read more here.

Criminal Justice and n Guilty Men

When classifying cases for all but trivial problems -- for example, classifying a student as failing, an armed conflict as war, or a recipe as Italian, Thai, or French cuisine -- we need to choose a tradeoff between true positives and false positives. This is even true in criminal justice, where we would like to classify persons as criminals or law-abiding citizens.

One way to measure the justness of a criminal-justice system is to look at "n guilty men": how many guilty persons that system lets free for every innocent person it punishes. North Korea cares far more about capturing true positives than about not capturing false positives, so they will arrest and punish people even when there isn't much evidence of wrongdoing. In contrast, democracies tend to err on the side of the accused, letting more guilty men go free to avoid incorrect imprisonment. Reasonable people can disagree about what n should be, but few would argue that North Korea's n, which is surely below 1, is more just than a democratic n, which is often claimed to be 1 or greater.

It turns out that getting a large n is often difficult. The table below tries to make this point using a hypothetical terrorist-identification program. Given 10,000 "terrorists" (probably more than there are in the US) in a population of 300,000,000 (smaller than the US).

Whether the program uses human experts or a sophisticated data-mining algorithm doesn't matter: even when we can correctly classify the vast majority of terrorists and civilians, n will be small.

US population300,000,000
Terrorists in the US1,000
% Terrorists Flagged99.9
% Civilians Flagged99.99
% Flagged Who Are Civilians75.0
"n Guilty Men".0003
True TerroristTrue Civilian
Flagged Terrorist999029,99939,989
Flagged Civilian10299,960,001299,960,011

Ranking Hockey Fighters

Hockey-fight enthusiasts talk a lot about who they think the best fighters are, so I decided to take a look at the data. Using more than 7,000 crowdsourced win-loss-draw records from -- thanks to David Singer, the website operator, for letting me -- I ranked over 1,400 fighters.

Some notes of interest:

  • A home-ice advantage appears. The home fighter wins 41% of his fights, loses 35%, and draws 24%.
  • only reports the results when ten people have voted. Lots of people vote almost as soon as a fight happens, but they don't go into the archives and vote on old fights. This means we don't have much data for fighters from past generations. Chris "Knuckles" Nilan, for example, fought 254 times, but we only have data for 6 of those fights.
  • What I did is similar to ranking football teams. The algorithm I used, penalized ordinal logistic regression, bumps a fighter up more for beating a quality opponent and less for beating a poor opponent. Marian Hossa is highly ranked even though he's only been in three rated fights because he won all three and one win was against John Erskine, a highly rated opponent.

You can download the results here.

Hockey Teams that Fight More Lose More

Don Cherry often says teams that fight more win more (for example, here). I decided to take a longer look at the data. Since 1967-1968, when the NHL expanded, there have only been ten seasons in which the correlation between points and fights is positive. You can download the graphs here.

Visiting the Charlotte-Mecklenburg Police Department

Ayesha Mahmoud, Kenny Joseph, and I wrote about the trip and how it affects our work. The post begins:

Police departments around the country have been in the spotlight recently because of several controversial, high-profile incidents. Tragic events in Ferguson, New York City, Baltimore, and elsewhere have highlighted the need for police departments to better address the issue of adverse interactions between the police and the public. Many police departments are working hard to avoid these negative interactions with new technologies and tactics, while others are leading new data collection efforts.

This summer, as part of the White House Police Data Initiative, fellows Sam Carton, Kenny Joseph, Ayesha Mahmud, and Youngsoo Park, technical mentor Joe Walsh, and project manager Lauren Haynes are working with the Charlotte-Mecklenburg Police Department (CMPD) on a novel approach: using data science to improve the department’s Early Intervention System (EIS) for flagging officers who may be at a high risk for being involved in an adverse interaction.

Read more here.

Text Re-Use in Scott Walker's Abortion Bill

Eugenia Giraudy, Matt Burgess, Julian Katz-Samuels, and I wrote another blog post for Data Science for Social Good. It starts:

On Monday, Wisconsin governor and 2016 presidential candidate Scott Walker signed into law a bill banning non-emergency abortions past the 19th week of pregnancy. Unsurprisingly, Walker’s move garnered support from one side, derision from the other, and media attention from both. However, journalists face a big hurdle when trying to provide context for a story such as this: it is time-consuming to figure out how many states have introduced similar legislation and where it originated.

Automated detection of copied legislation can help. Data Science for Social Good fellows Matt Burgess, Eugenia Giraudy, and Julian Katz-Samuels, technical mentor Joe Walsh, and project manager Lauren Haynes are working with the Sunlight Foundation to make it easier to find re-used text. Using Sunlight’s corpus of state legislation, our computational tools uncover textual similarities.

Read more here.

Finding Legislative Plagiarism

Eugenia Giraudy and I wrote a blog post introducing our Data Science for Social Good project:

In 2005, Florida implemented a new “Stand Your Ground” law, which legally protected the use of deadly force in self-defense. The law, which removes the “duty to retreat” when a person is threatened with serious bodily harm, gained national attention after George Zimmerman fatally shot Trayvon Martin in 2012.

Soon after its passage in Florida, Stand Your Ground laws went “viral,” spreading to other parts of the country. Currently, at least two dozen states have implemented a version of Florida’s legislation. These laws didn’t arise in response to broad, spontaneous popular demand. Interest groups, in particular the National Rifle Association and the American Legislative Exchange Council (ALEC), drafted a model bill to ease passage across the country. Ten states have passed nearly identical bills to the ones Florida used and ALEC promoted.

Read more here.

Simple (as Possible) Drake Installation

Factual has created a data-workflow tool called Drake. Drake lets the analyst outline her command-line instructions -- including data collection, pre-processing, analysis, validation, and visualization -- and easily run them together. If the analyst modifies code or data in the workflow, Drake naturally re-runs all instructions that depend on that modified piece. This makes for a cleaner, more efficient, more reproducible workflow.

Installing Drake requires Java JDK, Leiningen, the Drake uberjar, and a shell script. Here I provide a series of steps that can install these things on an Ubuntu system. Note that I chose to put the Drake files in the /usr/local/bin/ directory, which resides in my PATH. Continue reading

Adding New Users to Existing EC2s

It's somewhat straightforward to add a user to an AWS security group and then create an AWS instance that the new user can access. It's more difficult to grant a new user access to existing instances. I don't want to waste time trying to find an answer again (Amazon, your AWS documentation could use some work!), so I'm posting my solution here, where future me and other confused individuals can find it. Here are the steps:

  1. Create a new user in IAM.
  2. Go to 'users' in IAM and add the new user to the appropriate security group.
  3. Go to 'users' in OpsWorks and 'import IAM users' (at the bottom of the page).
  4. Choose the user(s) you'd like to add and click on 'import to OpsWorks'.
  5. Click on the user you just imported and copy the public key into the provided box. Also enable SSH access (a checkbox below) so the user can SSH into the instance.

OpsWorks executes a recipe automatically that pushed the new user's permissions to the instances, which takes a minute or two. The new user can now log in.

To remove the user's permissions, go to OpsWorks -> users -> [user's account] -> 'deny permission'