What Can We Learn From Spying On Our Own Metadata?
Why let the NSA have all the fun?
Okay, so the National Security Agency is sitting on a treasure trove of all your metadata. What exactly can they learn about you from something as vague as the time and duration of your calls?
Performing simple data-mining on an individual level is becoming much easier, thanks to numerous prediction libraries available in just about any programming language and powerful cloud-based tools like Google’s Prediction API. To understand exactly what the government can do with this metadata, I decided to beat the NSA at its game by spying on my own data.
While the government can get your Verizon cell data in a breeze, it is naturally much harder to get ahold of it yourself. Stein wrote a Ruby script to mine his own metadata from his Google Voice account. His goal: to figure out whether he could identify the gender of a caller based solely on the time of day of a call and how long it lasted.
He pulled 20 random phone numbers from his call history and marked whether they belonged to a man or a woman. Then he used all the calls from those 20 numbers as his test samples, including the time and duration of call. Google’s Prediction API gave his model a 67 percent confidence level in predicting the gender of a caller after training with those 861 test examples. Though by scientific terms, that’s not particularly accurate, Stein “found it surprisingly good at determining a caller’s gender.”
As he points out, his results might be skewed by his small, individually-specific sample, but it’s a testament to exactly how much you could find out about a person with even more specific algorithms and access to a huge amounts of data:
Most importantly, if that’s what I can do with a limited set of my own data, imagine what the NSA can do with the datasets it has access to. If you don’t think determining an anonymous caller’s gender is particularly useful, think about the other things you might find out from a better set of data and more precise algorithms, like which callers are likely to be related to one another (I’m going to try that one on myself next), or with location data, where they’re likely to be at any given time.
For aspiring spooks, he’s taking suggestions on how he should spy on himself next here.