Secondary Screening

« We Take Visas and Your Phone Records | Main | For Your Weekend Screening Pleasure »

February 04, 2005 | A Search By Any Other Name

Slate has been running excerpts from a book called Safe: The Race to Protect Ourselves in a Newly Dangerous World (affiliate url here, normal url here).

I plan to get to it soon, but want to talk first about today's excerpt, which concerns data mining, Total Information Awareness and an technique called one-way hashing. I'll include a snippet from today's Slate story, but if you want to really follow the argument here, first go read the whole piece

This excerpt is about technology created by Jeff Jonas, a computer scientist who founded a company called SRD, which gained venture capital from the CIA and was recently bought by IBM. His software was originally used to look for casino cheaters, by looking for hidden links between individuals.

But the death of TIA was not the end of data mining's application for security questions. In fact many of the most controversial TIA projects simply switched funding sources to classified ones. Finding a way to scan and exchange data remains an active interest of intelligence agencies. One question, then, is whether there are technical ways to mine networked data and preserve both secrecy and privacy at the same time. Jonas thinks he has an answer, which [Jeff Jonas] says came to him after he heard that the government had trouble keeping its watch-list data under wraps. He also knew-from the TIA controversy and the firestorm of criticism over airlines such as JetBlue giving passenger data to the government-that Americans are becoming increasingly skeptical of corporations handing over their personal data to the government. What Jonas came up with is a means to anonymize information but still allow it to be searched for links. He named it ANNA, and he says it's the answer to "how to know everything about everyone without knowing anything about anyone."

ANNA works like this: The software takes a set of data and applies a mathematical encryption formula that converts each piece of data-a name, an address, a phone number-into an indecipherable string of characters. The name al-Midhar, for example, could be transformed into cbd034409c22929518fa494f99dc9964. It's called a one-way hash, and in the case of ANNA, the hash function serves to create an anonymous version of the information stored in the database. Each string of numbers is unique, so if two pieces of data differ by even a letter or a comma, the resulting hash will be completely different. ANNA also takes the common data errors found by NORA-misspellings of names, transposed birth dates-and hashes them as well. Then it does the same for the names and other information on the watch list (which might include birth dates, addresses, or Social Security numbers). Once all the data is hashed, NORA or another system could search for matches between the unique numbers without ever revealing the underlying data.

Let's say the government is looking for a particular suspect, John Doe, and wants to find out if certain companies have any data about him. It runs a hash on "John Doe," his birth date, Social Security number, and any other information it has on him. The result is a string of letters and numbers. It then hands that string over to the companies, which have run the same hash function on all of their data. Then the company simply looks for matching strings in its database. If it finds one, it alerts the government, which then could obtain a court order to un-anonymize the data.

First, Jeff Jonas is a really smart guy. I've met him before and interviewed him for at least one story.

Jonas used to be a severe critic of Total Information Awareness-style data mining, which looks for patterns of behavior to find possible suspects. He contrasted that with his system, which starts with a suspicion about an individual and then looks to see who that individual is connected to. I assume, though I don't know, that he is still of this opinion.

Now, Jonas's system would anonymize data, but despite the attempt to look for misspellings, transposed digits and variations on names, there is huge room for error in such a matching system. For instance, every David Nelson would share the same hash number. Databases also differ significantly in their ability to differentiate between individuals. A National Rifle Association email list may only contain a name and an email address, while a bank would have much more. Unless, every individual has a unique identifier attached to their every transaction, there is a huge problem with incorrect identification.

Moreover, what kinds of databases does Jonas envision his system having immediate access to? As part of the Markle Task Force report on the need for a centralized national security IT structure, he wrote (.pdf) this:

Counterterrorism officers should be able to identify known associates of the terrorist suspect within 30 seconds, using shared addresses, records of phone calls to and from the suspect’s phone, emails to and from the suspect’s accounts, financial transactions, travel history and reservations, and common memberships in organizations, including (with appropriate safeguards) religious and expressive organizations.

Now, Jonas is here talking about a system that would have access to your emails, online behavior, your purchase records, lists of what numbers you called, where you have traveled and what political and religious groups you belong to.

Maybe that's something the country would agree to, but do not think it is not a massive change.

Two other points, anonymization was much talked about with Total Information Awareness.

Just because government agencies or algorithms don't know your name, that does not mean you are not being surveilled.

Imagine if a little spider robot sneaked into your house every day, poking around for drugs and plugging into your computers' USB port to search for child pornography or unauthorized MP3s.

It doesn't know your name, but if it finds something suspicious, it alerts an officer, who gets authorization from his superior or a judge to reveal your identity and further search your house.

Now, to be fair, that's a real world analogy to an anonymized Total Information Awareness model that has suffered from mission creep.

In the original TIA conception, the little partly-blind spider would only look in your computer and the ones of every company and hospital for indicators that you were part of a terrorist conspiracy, though it might come by a couple times a day.

In Jonas's ANNA model, the blinkered spider would only look visit your house, your work and your church, if you were a terrorism suspect or if you had a connection to a terrorist suspect, such as living in the same apartment building or visiting the same chat room.

And finally, there is nothing that prohibits the government from using information about other possible crimes when conducting a legitimate search and there's no technical barrier to using a system like ANNA to track down mobsters, file traders or recreational drug users.

Those are political questions. I don't mean to disparage the idea of anonymization using one-way hashes -- it could be a very useful tool for protecting privacy and civil liberties.

However, I'm skeptical of its misuse in political arguments, and I'm distrustful of Wired Magazine-style techno-evangelism.

That said, I'm still planning on buying Safe later today.

(And just a note about the history of Total Information Awareness -- it was not The New York Times that first revealed the program's existence in November 2002 as today's Slate excerpt would have it. Wired News freelancer Elliot Borin beat the newspaper of record by almost three months.>

Posted by Ryan Singel at February 4, 2005 08:45 AM

Trackback Pings

TrackBack URL for this entry:
http://www.secondaryscreening.net/cgi-bin/mt-tb.cgi/111

Post a comment

Hi Ryan!

I thought I would comment on a few points you made in your February 4, 2005 article entitled “A Search by Any Other Name”.

YOU STATED: “Now, Jonas's system would anonymize data, but despite the attempt to look for misspellings, transposed digits and variations on names, there is huge room for error in such a matching system. For instance, every David Nelson would share the same hash number.”

MY THOUGHTS: There is an unacceptable “huge room for error” in systems involving “name only matching.” Ours is not such a system, and I have been on record firmly opposed to name only matches (particularly when used against large data sets) because of the civil liberties invasiveness associated with the resulting number of false positives. The ANNA technique is designed to match identities on multiple values. For example, even two records with the same SSN does not constitute a match (because SSN numbers are often mistyped). It is only responsible to match when, for example, SSN and full name match, or SSN, first name, and date of birth match. The novelty of ANNA is its ability to achieve such fuzzy-like matching across a number of values while using only one-way hashes. It is in beta-testing now and the results are extremely promising.

YOU STATED: “Databases also differ significantly in their ability to differentiate between individuals. A National Rifle Association email list may only contain a name and an email address, while a bank would have much more. Unless, every individual has a unique identifier attached to their every transaction, there is a huge problem with incorrect identification.”

MY THOUGHTS: This has been solved. What we have figured out is a technique I call “accumulating context at ingestion”. In short, this is a model whereby new information received is used to determine if any prior decision was wrong … and if so, fixes it in real-time. It is my belief that matching systems should first err on the side of false negatives (i.e., maximum reduction of false positives) then, while “accumulating context at ingestion,” new data is used to detect and correct false negatives. Notably, being able to match on a plurality of attributes while accumulating context obviates the need for a problematic national id card.

YOU STATED: “Now, Jonas is here talking about a system that would have access to your emails, online behavior, your purchase records, lists of what numbers you called, where you have traveled and what political and religious groups you belong to.”

MY THOUGHTS: Our technology is not about behavioral, transactional, or lifestyle data: rather, we specialize in loading and matching person identifiers (e.g., names, addresses, etc.). In any case, I feel we should not be sending private-sector transactional data to the government for aggregation into some giant database. And where information is required to flow (in any direction), then I’d rather have it anonymized (the reason I invented the ANNA capability).

As an aside, I also believe that data mining (predicting who will behave a certain way based on their socio-demographics and transactional behavior) is an ineffective way to predict who is going to be a bad guy.

Any and all comments welcome!

jeffjonas@us.ibm.com

Posted by: Jeff Jonas at March 6, 2005 11:26 AM

Powered by
Movable Type 3.2