Clearing the Cloud
Rediscovering—and sometimes ditching—the stuff we store online
By Andrea Poet
When we want to store or share digital files, the cloud is an appealing choice. Free data allowances on services such as Google Drive, Dropbox, iCloud, and Amazon Drive make it easy to save it and forget it.
The question is: what happens to those files several months or even years down the road? Stashing everything in the cloud can create security and privacy risks. What’s more, much of this data may no longer be needed over time, and it’s simply taking up space.
“In the pre-digital society, you’d have maybe a shoebox full of mementos—you can go through it in an afternoon and that’s what you keep,” said Chris Kanich, associate professor in UIC’s computer science department. “It’s a big shift.”
Kanich, along with Assistant Professor Elena Zheleva and University of Chicago collaborator Blase Ur, are creating a tool to help us wade through our cloud files so we can decide what to keep and what to jettison. Importantly, the tool will flag files that could cause security risks if they fell into the wrong hands, spurring us to act before the worst happens.
Kanich said the process to develop the tool began with 100 interviews. The researchers talked to people about a sample of 10 files in their Dropbox or Google Drive accounts, accessed with consent: among them photos, text documents, and emails. Kanich and his team asked people what each file was, whether they had remembered it was there, and whether they had shared the file with anyone. The researchers also asked participants if they wanted to keep the files and, if so, if the items should be encrypted.
In many cases, people had forgotten they had stored a particular file or didn’t even know what it was, Kanich said. With shared documents, access may have been granted years ago to former colleagues who either no longer needed it or should no longer have it.
In terms of whether people wanted to retain their cloud-stored materials, many were ambivalent.
I had this ‘digital packrat’ idea that nobody will want to delete anything. That was not the case.
Associate professor of computer science|
Kanich, a cybersecurity expert, also hypothesized that people would want to remove items from the cloud mainly for security and privacy reasons. But for the vast majority of the people interviewed, it boiled down to resource allocation: the unneeded files were taking up space.
Storage may seem limitless, and new data centers are continuously coming online, but keeping files in perpetuity isn’t cheap.
“I have a 10-terabyte hard drive that I bought for $200, and it may cost me five dollars in electricity per year,” Kanich noted. “But to store that amount of data at, say, Google? It would cost $2,760 per year.”
Beyond concerns about space and security, leaving a digital footprint can be risky for other reasons.
“An email or Facebook post I wrote 15 years ago is going to have a fundamentally different meaning today than when I wrote it,” Kanich said. “That doesn’t have to do with the information itself—that post is, bit for bit, identical with whatever it is I wrote 15 years ago. The information has not changed, but the world around it has drastically changed.”
To help everyday users grapple with issues like these, Kanich and his team designed a machine-learning algorithm that classifies data through the lens of whether it is useful or sensitive and prioritizes the most important files for review. Like humans, machine-learning algorithms learn from experience. The algorithms find patterns or correlations in data to help you make better decisions and predictions. By training their algorithm using the data from the participant-interview phase, Kanich and colleagues created a tool that ferrets out items with labels or keywords that are likely to be sensitive or important to cloud users, calling them out for follow-up.
Kanich noted that for sensitive files stored in the cloud—say, a tax return or digital copy of a passport—services such as Microsoft’s OneDrive Personal Vault can provide additional layers of security. Microsoft’s product encrypts data and requires a key or two-factor authentication to decrypt it. With documents that people don’t access regularly, that extra hurdle can add a great deal of protection.
Corporations are doing a similar housecleaning on the data they store on customers and clients. In fact, new privacy and data-protection laws compel them to do so. Sriya Potham, a 2017 UIC graduate and a cloud security architect at Newell Brands, has worked on compliance with the European Union’s General Data Protection Regulation, or GDPR, and is building a new platform for e-commerce.
“It’s a huge task to determine what information we have on consumers: where is it stored, what type of data is it, and is it sensitive,” she said. “From a big-enterprise perspective, it’s a huge undertaking.” Tools that are able to do something similar to Kanich’s, she said, are very expensive at the corporate level.
Kanich, Zheleva, and Ur’s tool is geared toward individuals, but the approach can be applied at the single-person level in corporate settings, encouraging employees to manage their own data and storage.
“We need to reimagine our relationship with data over long periods of time,” Kanich said.