Monday, February 06, 2006

Could the NSA wiretap everybody?

Recently, President Bush ordered the NSA to wiretap U.S. citizens without a warrant. In theory, this order is used only to eavesdrop on terrorists and their associates, but I started to wonder: What would it take for the NSA to record and monitor ALL long distance telephone traffic in the U.S? How expensive would a computing infrastructure to analyze and search all of those phone calls be?

The NSA doesn't release specifics on their technology, resource allocation, or what they've learned from the Aliens at Area 51. (Kidding!) And what little information is out there could just as well be disinformation. There's going to be some handwaving in this article. Indeed, there are some relevant technologies, such as special-purpose ASICs and vector processors, that I will ignore, but only in the interest of coming up with a worst-case estimate.

There are several alternative systems one can imagine, but I'm going to model a network of modern CPUs doing the first pass voice-to-text translation. The system would also need to search for suspicious conversations, perhaps by something as simple as keyword searching. On the other hand, there are more sophisticated techniques they could use; the anti-spam community has proven that Bayesian filtering does a great job as an automated first pass for recognizing conversations. Eventually, any conversation flagged would be forwarded to human intelligence analysts.

Phone Traffic Volume
This section analyzes the available numbers on phone traffic; it's a bit dry. If you're bored, skip to the last paragraph in this section.

How much traffic do we need to record? The FCC issues a yearly report on "Statistics of Communications Common Carriers" summarizing phone usage in this country. There's some jargon in these reports but to grossly oversimplify, InterLATA calls are long-distance; IntraLATA are medium-haul calls that don't cross LATA boundaries; and then the closest calls are classified as local.

In the U.S. in 2004 (see page 21), there were 72 billion InterLATA calls, with a total of 602 billed minutes. (Thus the average phone call was 8 minutes 20 seconds long.) There were also 10 billion IntraLATA calls, and 381 million local calls. The FCC doesn't publish minutes for local and IntraLATA calls, but let's assume 8 minutes 20 seconds for those calls, too. That leads us to an additional 88 billion minutes, or a grand total of 689 billion minutes. Let's call it 700 billion.

The international calls for 2004 are listed on page 108 of the same document. There were 7.34 billion calls billed in the U.S., with a total of 42.7 billion minutes. (Average call length 5:48). There were also 3.6 billion calls billed in foreign countries, with a total of 15.6 billion minutes. (Average call length 4:21). There were additionally 1.7 billion minutes listed as "transiting the U.S.".

Approximate U.S. phone traffic, by type
Traffic type# of callsTotal MinutesTime Per call
InterLATA72 billion602 billion8:20
IntraLATA10 billion(est) 84 billion (est) 8:20
Local381 million(est) 4 billion (est)8:20
International7.34 billion42.7 billion5:48
Transiting the U.S.-1.7 billion-
Total90 billion750 billion-

That was a long walk to arrive at about 700 billion minutes of call time within the U.S., and 60 billion minutes of international call time. (That's 1,330,928.11 years of telephone traffic within the U.S. , and 114,079.553 years of additional international traffic!)

Computing Power
There are real-time continuous voice recognition systems that run well on a 200Mhz Pentium-class computer. Let's wave our hands here and say that this implies that a modern 2Ghz computer could do about 10 times as much work, and recognize voice on 10 streams at once, as well as do some simple Bayesian keyword analysis on the resulting text. This might be an oversimplification of the benefits of a processor speed boost, but on the other hand, the NSA's got a lot of people focused on this type of problem. They've got smart people, and they've been working on similar problems for a long time. The NSA's web pages even claim that the NSA is the "largest employer of mathematicians in the United States and perhaps the world."

A year is 525,600 minutes long, so a 2Ghz single computer can handle 5.25 million minutes of traffic a year.

For U.S. traffic: 700 billion / 5.25 million = 133,181 computers.
For international traffic: 60 billion / 5.25 million = 11,415 computers.

Thus it would take a system of 145,000 computers to scan all of the U.S. and international phone traffic. That may sound like a lot of hardware, but coincidentally, it's about equal to the estimated number of computers that Google had last year!

There are other costs associated with such a system, not insignificantly the networking costs to route a copy of all of the phone traffic in the country to an NSA system! (Although apparently they've got quite a sophisticated network built just for that purpose.)

Additionally, the system above just performs a simple scan; in a real system, flagged calls would probably be routed to a more sophisticated system for better analysis. I've ignored the fact that the voice-recognition systems I know about don't cope well with mumbling, two people talking at once, foreign-language translation, and not being "trained" to a particular voice. But once again, these technical hurdles, while not trivial, seem well within the capabilities of the NSA.

The entire voice-recognition system I outlined above would cost about $250 million, at civilian prices. The NSA has an annual estimated budget of $7.5 billion. Every year, they spend over $21 million on electricity alone! With a budget like that, not only is it possible; it would be easy for the NSA to build a system to listen in on every phone call in the country.

Tags: , , , , , , , , ,


Post a Comment

<< Home