Researchers at IBM's Almaden, California research lab are building what will be the world's largest data array--a monstrous repository of 200,000 individual hard drives all interlaced. All together, it has a storage capacity of 120 petabytes, or 120 million gigabytes.
There are plenty of challenges inherent in building this kind of groundbreaking array, which, says, IBM, is destined to be used for, as Technology Review writes, "an unnamed client that needs a new supercomputer for detailed simulations of real-world phenomena." For one thing, IBM had to rely on water-cooling units rather than traditional fans, as this many hard drives creates heat that can't be subdued in the normal manner. There's also a sophisticated backup system that senses the number of hard disk failures and adjusts the speed of rebuilding data accordingly--the more failures, the faster it rebuilds. According to IBM, that should allow it to operate with the absolute minimum of data loss, even none.
IBM's also using a new filesystem, designed in-house, that writes individual files to multiple disks so different parts of the file can be read and written to at the same time.
This kind of array is bottlenecked pretty severely by the speed of the drives themselves, so IBM has to rely on software improvements like that new recovery and filesystem to up the speed and enable the use of so many different drives at once.
Arrays like this could be used for all kinds of high-intensity work, especially data-heavy duties like weather and seismic monitoring (or people monitoring)--though of course we're curious as to what this particular array will be used for.
The incredible innovations, like drone swarms and perpetual flight, bringing aviation into the world of tomorrow. Plus: today's greatest sci-fi writers predict the future, the science behind the summer's biggest blockbusters, a Doctor Who-themed DIY 'bot, the organs you can do without, and much more.


Online Content Director: Suzanne LaBarre | Email
Senior Editor: Paul Adams | Email
Associate Editor: Dan Nosowitz | Email
Assistant Editor: Colin Lecher | Email
Assistant Editor: Rose Pastore | Email
Contributing Writers:
Kelsey D. Atherton | Email
Francie Diep | Email
Shaunacy Ferro | Email
What happens when all these hard drives get infected by the "Brain" virus? Bra-Waaa-haaa Ha!
Will it blend? That is the question.
120 yottabytes. that should be large enough.
why learn from your own mistakes, when you could learn from the mistakes of others?
Is knowing and remembering ever enough. What about love? BIG HUG! BIG BIG HUG! SQUISH!
So it sounds like, on average, they are using 600 GB hard drives. They do make 2 and even 4TB hard drives these days, which would reduce the number of components, while still providing quite a bit of fault tolerance and parallelism in reading the data. Right now, for example, it seems like the 3TB drives are at a pretty cheap price point compared to the 4TB. Add a multi-level cache of several GB per drive and most of the time you'll be working in ram instead on slow drives.
FYI: The "Brain" virus was one of the oldest virus that would ruin a hard drive.
I'm curious as to why 200,000 drives are needed to provide 120 PB of storage.
Are any of the drives spares that replace failing disks transparently? What is the error detection and correction architecture? How many drives does it use for that purpose?
What sort of architecture does the array use? Could it be that the array migrates older, infrequently used files to a lower hierarchy of larger, but slower disks, while frequently used files reside higher up in smaller but faster disks? Older supercomputers used a similar scheme, differing in that old files get written to tape.
But it would appear that interleaving the disks, and smart use of caches will make this unnecessary, but I'm not familiar with disk array architecture (let alone top-end, leading-edge stuff like this), and I don't remember much of the chapter about storage in Computer Architecture: A Quantitative Approach.
So many questions!
They're probably using anywhere from 1 to 2 TB drives so that they can build in redundancy.
Of course, 200,000 x 600GB = 120PB, and 200,000 x 1 or 2TB would be way higher than 120PB. But I'm sure they have it in a raid config for read speed improvements and hot swapping of bad drives. Plus, they'll probably have some redundancy of their own built in. If this is used with linux, they may have even decided to use a very large swap drive. All of these things would eat up useable disk space (some more than others).
So 200,000 x 1TB = is 200PB. If the usable disk space is 120PB, then 60% of the disk space is useable.
And if they're using 2TB drives, they might be doing 100% redundancy. so 2TB x 200,000 = 400PB. 120PB with 100% redundancy would mean 240PB of the 400PB is used, leaving 160PB as a swap drive, which seems like a lot. So I'm guessing they're not doing that. But 200% redundancy would be 360PB, leaving 40PB as a swap drive.
SO if I had to guess, I'd say they're using 1.5 or 2TB drives.
This array is using 600GB 15K RPM Fibre Channel disks. No one builds a massive super computer and stuffs the data on slow SATA disks.
So 200000 x 600GB = 120PB - Remember this is a marketing number. The net usable will be much lower as you would expect the volumes to be RAID10.
All very irrelevant - it probably won't work anyway ... well at least not as a single array ...
But it makes good PR though ...
Why IBM are not using SSD?
SSD has a limited lifespan on top of costing at the bare minimum $1 per gigabyte. If my abacus is correct that would cost One hundred and twenty trillion dollars..... oops I accidentally carried the bead.
It would make sense for IBM to have some SSD drives in the array. But, as others have pointed out, they are more expensive, wear out quickly compared to hard drives, and also can't do things like block transfers as quickly. They beat hard drives for small, random data transfers, but may not for high volumes of chunked data.
So why did IBM sell their hard drive mfg years ago??
Why not just say 120,000 Terabytes as that's what most users today are using 1 to 3 TB drives.
IBM sold their hard drive business because it had low profit margins. Hard drive R&D and manufacturing is expensive, and competition between the hard drive manufacturers keeps prices low, so every hard drive only earns a few dollars of profit (IIRC, and this was a few years ago, I'm not sure about the current situation -- whether its less or not).
I'm pretty sure a 5 year old can put two and two together and crack this "mystery". The client is Lockheed Martin. Why? Lockheed Martin just recently contracted D-Wave Systems to build the world's first 128 qubit quantum computer. A quantum computer of this magnitude will need enormous data storage capabilities. There are multiple purposes for this, but its primary purpose will probably be the NGI (Next Generation Identification) for the FBI, CIA, et al. They also teamed up with Lockheed Martin for a Census contract called the Decennial Response Integration System (DRIS) 2010 data collection contract. All of this combined basically means that for "security" reasons every American will be cataloged. All the data will be available on a new cloud computing network for the Fed's called the Federal Community Cloud. The roots of this was back when Google developed image recognition software for the first D-Wave supercomputer. That was to identify cars in images. This algorithm could easily be adapted to human biometric data.
Well, they have to back and store all their DVDs. Oh and if you own a DVD, you do have the legal right to backup and protect your own media. Of course after you removed that copy right protection, then temptation comes into play to share your videos with others; don't go there, its a bad place.
I brought this last little point up about backing up Blu-rays and DVDs to my hard drive to see what other people thought about it and I am also interested in what other people use for decrypting-backing up software. I have backup all my home DVDs to my 2Terabit Hrd-Drv and watch them on my TV remotely using a Wester Digital Media Player Plus. The WD TV Live Plus HD Media Player from Western Digital is a digital media player for TVs or HDTVs. It can be used to play back video, audio, and pictures over your home network or from any USB hard drive or storage device. Two USB ports are built into the device, allowing you to connect multiple hard drives simultaneously. It also lets you make use of a lot of internet sights, Netflix, Youtube and many others...
"According to IBM, that should allow it to operate with the absolute minimum of data loss, even none."
That many hard drives and there's still a possibility of data loss?? Seriously?? Doesn't IBM know about redundancy?
IBM must be fibbing me. All they got there is a zero and a one.
I would like to know what filesystem they have.
@rettaH_daM
The PopSci article is wrong in regards to the reliability of the array. The MIT Technology Review article linked to by PopSci quotes Bruce Hillsberg as saying, "the result is a system that should not lose any data for a million years without making any compromises on performance."
@jefro
The same MIT articles says the filesystem the array uses is IBM's GPFS.
Doesn't that much data storage seem a little.... excessive?
@ZeroCool100, I see your point on quantity. If I go to the library of congress and see all the books, I still can only read a few at a time. I suppose the real usefulness of storing large quantities of data is to be able to cross reference and retrieve what you want quickly. Then the science of what is a good search engine becomes highly important. I personally believe many internet search engines are bias towards money and marketing and give you the best answer that will send you to a link, that benefits the search engine ($$$). I am not confident it gives you the best answer to your question first on the page. You may have to click through 50 or more pages or try several different search engines to find the best answer to your question. One of the first things I notice that is wrong of any search engine, say Google, I always get a response in English first. But isn't it connecting to the world? The second thing I notice of responses is the locality seems to be delivered in most quantity. I believe there is much bias going on the back ground of search engines and most people do not even realize it. Finally many people make a lot of decisions base of the information they find on a internet search engine and they forget that a lot of businesses and government agencies, hospitals and so forth do not put all their ideas on the internet. Sometimes if the question is really important to you, you need to call someone or see them in person to get the best answer; maybe even read a book or two; go to the location and check it out yourself.
@BubbaGump, First, I love reading your comments. And I understand what you're saying, but I was just saying that I think they are building that much storage just because they can. (Which I would probably do too, if I'm honest.)
I hope it's SSD technology...
And all that heat will get dumpt in to the air!
they could heat a city with this.
But do they? NO!
you could spread them out to.
just link them with fiber optics.
Oh heck just build one big arse ssd and be done with it. Not rocket science. lol jk