Google Launches New Web Indexing System

Submitted by poundsmack 2010-06-09 Google 8 Comments

“Today, we’re announcing the completion of a new web indexing system called Caffeine. Caffeine provides 50 percent fresher results for web searches than our last index, and it’s the largest collection of web content we’ve offered. Whether it’s a news story, a blog or a forum post, you can now find links to relevant content much sooner after it is published than was possible ever before.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

8 Comments

2010-06-10 9:42 am
Kroc
Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day.
How is this even possible? Unless I’m not aware of some incredible storage technology that the government has been keeping behind closed doors, they must be fitting new hard drives to machines non-stop, 24/7.
That’s 100 1TB HDDs a day. A day. And you have to have somewhere to plug those in, so they must have new racks going in each day. Where do the physical limits kick in!?
It simply must be magic. That, or Google has server farms of humans like something out of The Matrix.

2010-06-10 11:06 am
Laurence
They might be counting re-indexed sites – which you would expect to over-write the old indexes. So, for example, 100TB being recorded wouldn’t be an additional 100TB of disk space
Plus they probably save some space by removing old / dead sites, images and products (etc).
And for all we know, those stats are probably pre-compression too.
However, even if all three of those points are true, it’s still an epic amount of storage each day – and that’s not taking redundancy and back ups into account (unless of course Google already had done in their stats hehe)

2010-06-10 3:47 pm
Tuishimi
Their fingers reach into our society like fungus threading into rotten, old wood.

2010-06-11 7:13 am
Tuishimi
I thought that was rather poetic.

2010-06-10 11:22 pm
smashIt
How is this even possible?
they most likely are guessing the number, like they are guessing the number of results for a search.
like when google says there are 1000000 hits for your search but they suddenly all disappear after 10-20 pages

2010-06-11 8:51 am
Laurence
they most likely are guessing the number, like they are guessing the number of results for a search.
I very much doubt a company that excels in collating and indexing statistics (as you would have to if you were a high profile search engine) would resort to guess work when it comes to the data exchange on their core business.
In fact, I doubt any company of a reasonable size would.

2010-06-11 10:18 pm
phoenix
How is this even possible? Unless I’m not aware of some incredible storage technology that the government has been keeping behind closed doors, they must be fitting new hard drives to machines non-stop, 24/7.
That’s 100 1TB HDDs a day. A day. And you have to have somewhere to plug those in, so they must have new racks going in each day. Where do the physical limits kick in!?
You can get 4U and 5U rackmount cases that can hold 48 harddrives. Fill those with 2 TB harddrives (or 3 TB when it’s released shortly) and you have almost 100 TB of storage space per server.
You only need 1 storage server per day to cover the extra storage needs.
A standard rack is 48U (I believe) which means you can stick 12 of these in, for a total of over a petabyte of storage per rack.
All you need is to add 1 rack of storage per fortnight (approx) to more than cover this expansion. And that’s just in one data centre. They’re opening/expanding data centres all the time. Depending on how it’s distributed and the level of redundancy, they’d only need a couple of servers added per week, per data centre.
Now, this may not be the exact way that Google handles storage, but it’s similar.
It’s not insurmountable. You just have to think BIG and distributed.
Edited 2010-06-11 22:19 UTC

2010-06-10 5:56 pm
Macc
Yeah, that’s a pretty huge leap on indexing, well, if they can live up to their claims, then I’d say they’ve got a really leet type of storage, indexing or whatever related systems…