Scoping Projects

We wrote about what a good Data Science for Social Good project looks like. Our post begins:

Data Science for Social Good is a summer program that requires year-round preparation. A successful summer requires a mix of good people and projects, and we spend a lot of time trying to find projects to solve and the people to solve them. In addition to reading over 800 applications from aspiring fellows, mentors, and project managers, we’ve spent numerous hours researching, pursuing, and scoping projects: exploring datasets, speaking with representatives, and wrangling with attorneys. Well over a hundred projects will cross our emails, phones, and eyes before we find the 12 to do next summer.

Read more here.

Criminal Justice and n Guilty Men

When classifying cases for all but trivial problems -- for example, classifying a student as failing, an armed conflict as war, or a recipe as Italian, Thai, or French cuisine -- we need to choose a tradeoff between true positives and false positives. This is even true in criminal justice, where we would like to classify persons as criminals or law-abiding citizens.

One way to measure the justness of a criminal-justice system is to look at "n guilty men": how many guilty persons that system lets free for every innocent person it punishes. North Korea cares far more about capturing true positives than about not capturing false positives, so they will arrest and punish people even when there isn't much evidence of wrongdoing. In contrast, democracies tend to err on the side of the accused, letting more guilty men go free to avoid incorrect imprisonment. Reasonable people can disagree about what n should be, but few would argue that North Korea's n, which is surely below 1, is more just than a democratic n, which is often claimed to be 1 or greater.

It turns out that getting a large n is often difficult. The table below tries to make this point using a hypothetical terrorist-identification program. Given 10,000 "terrorists" (probably more than there are in the US) in a population of 300,000,000 (smaller than the US).

Whether the program uses human experts or a sophisticated data-mining algorithm doesn't matter: even when we can correctly classify the vast majority of terrorists and civilians, n will be small.

US population300,000,000
Terrorists in the US1,000
% Terrorists Flagged99.9
% Civilians Flagged99.99
% Flagged Who Are Civilians75.0
"n Guilty Men".0003
True TerroristTrue Civilian
Flagged Terrorist999029,99939,989
Flagged Civilian10299,960,001299,960,011
10,000299,990,000300,000,000