To break Google’s monopoly on search, make its index public

Thom Holwerda 2019-07-15 Google 35 Comments

Fortunately, there is a simple way to end the company’s monopoly without breaking up its search engine, and that is to turn its “index”—the mammoth and ever-growing database it maintains of internet content—into a kind of public commons.
There is precedent for this both in law and in Google’s business practices. When private ownership of essential resources and services—water, electricity, telecommunications, and so on—no longer serves the public interest, governments often step in to control them. One particular government intervention is especially relevant to the Big Tech dilemma: the 1956 consent decree in the U.S. in which AT&T agreed to share all its patents with other companies free of charge. As tech investor Roger McNamee and others have pointed out, that sharing reverberated around the world, leading to a significant increase in technological competition and innovation.

This is an interesting proposition. I don’t know if this will increase competition in any meaningful way, or if it’ll just lead to a shift in power from Google to the other major technology companies without really creating opportunities for newcomers, but it’s certainly yet another proposal on how to deal with the ever growing power these companies wield.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

35 Comments

2019-07-15 8:04 pm

Alfman verbose=1
Thom Holwerda,

This is an interesting proposition. I don’t know if this will increase competition in any meaningful way, or if it’ll just lead to a shift in power from Google to the other major technology companies without really creating opportunities for newcomers, but it’s certainly yet another proposal on how to deal with the ever growing power these companies wield.

This is an interesting proposal. The political stumbling blocks are high, but to the extent that google’s web indexes are scrapped from public sources, maybe this isn’t so far fetched..I do actually think it could help increase competition. Heck, even I’ve had lots of ideas for search engines in the past, but it was always futile because I simply don’t have google’s reach. There’s probably tons of 3rd party innovation to be had here – things that google is unwilling ./ unable to do because of it’s corporate DNA.

2019-07-16 12:56 am

Kochise
I also presume there is a little bit more than just serving a bunch of links. Not only their crawling algorithm is theirs, but the way they weight a selection of links is also their own recipe. I already heard that some governments like Germany requested that Google disclose their algorithms. Why not every company to open their secret formulas ? Let’s go full open ! Seriously…

2019-07-15 9:11 pm

cb88
water, electricity, telecommunications, all of which are almost entirely privately owned, well with exception of municipal water….. in my area which is about an 40min out of a large metro area, about of 50 of the electricity is supplied by a membership corporation, essentially a non profit….. where the members are the owners. \

Government control of almost anything is an extremely bad idea…. with few exceptions.

2019-07-16 6:46 am

kwan_e
cb88,

where the members are the owners.

You know that’s kind of what (a democratically elected) government is, right?

Government control of almost anything is an extremely bad idea…. with few exceptions.

Because Enron did such a bang up job, and the rolling blackouts were such a great idea.

2019-07-16 8:34 am

ahferroin7
> You know that’s kind of what (a democratically elected) government is, right?

Not really, unless you’re only looking at true democracy, which is remarkably scarce in the modern world (Switzerland’s somewhat dumbfounding continued existence as a coherent nation-state aside). In the US for example, it may be the people who elect politicians, but beyond the local municipal level, it’s largely a case of lobbyists and big corporations owning the government, not the people.

2019-07-17 3:29 pm

CaptainN-
That was almost completely incomprehensible, but I’ll respond anyway. There are private sector success stories for democracy too – check out Mondragon and Arrasate in Spain for an example.

It’s really impossible to even discuss this stuff with puritanical hardliners. The phrase “true democracy” is vomit inducing, for its hard-line extremism – its purity. How can you have any kind of balanced thoughts with these sort of either/or thought processes? Of course “true democracy” which I can only assume means mob-rule to you, won’t work. But that’s not really what it’s about, and direct-immediate-democracy is not the only kind of real democracy.

2019-07-15 9:43 pm

sukru
I would be conflicted to comment the on proposal as a Google engineer, however what I could tell from the overview is that the author has very limited technical understanding of “the index”.

A web search engine (at least the backend) has the following parts [roughly]:

1 – The raw crawled data. While this is the source of everything, it is also the easiest part. Even alone, during my PhD I did crawl my own – albeit limited – index. And the open source crawlers have improved significantly since then. If you have a simple Hadoop cluster one can easily reach an index of significant proportions.

2 – The data structures. These are what makes the index actually useful, And everything is built around it. Anyone involved in information retrieval techniques would know about the trees, lookup tables, and other mechanisms used. And once again many of these are available by open source systems, including Apache’s Nutch, and Solr indices.

3 – Additional ranking information. There is significant value added by the algorithms like PageRank, or many machine learning models. There are many open source efforts in this area (including some from Google), and if engineered well it is easy to build a custom search answering system utilizing these techniques.

And on top of these are are mobile and web front ends, customer APIs, and some load balancing / sharding mechanisms.

As it would not be right for me to go into detail, I would stop here, and will not make any conclusions about the article itself.

[ Also this is just me personally commenting after work ].

2019-07-16 12:52 am

winter
I think I would disagree with #1 being the easiest. While crawling the net is easy, doing it at scale is not trivial. There are so many barriers in place from figuring out what to crawl and how often, let alone being blocked by webmasters from spamming their site.

Webmasters would see significant value in having Google crawl their site, however this is not likely to be the case for other companies who do this, if they don’t see a ROI from it.
2019-07-16 3:08 am

flypig
That’s an nice technical breakdown of the technology involved. The original article talks about the index, but I think there’s an underlying message which it’s worth pulling out. As someone who understands where the real value is, how would you extend the message to work with the complete search stack? Is there a way to allow others access to these technologies in order to reduce Google’s dominance, but without undermining the value Google provides?

2019-07-16 11:37 pm

sukru
The search stack is as open as possible. The algorithms are already well known (and most are published by Google). There are APIs to access various parts of the system, and even core C++ libraries are shared as open source (https://github.com/abseil/abseil-cpp).

However it would not make sense to share actual implementation details. For example, take the spam signals used in ranking. If they were to be open, spammers would use this information to try to hide themselves. Or the actual PageRank values calculated would open ranking to gamification by scammers. It would not help users nor even the competitors (who would have their own version of these signals).

Bing, Baidu, Yandex and others already have their own implementation, and strong presence in their markets. ( For example Bing is over 30% in US: https://twitter.com/MSFTAdvertising/status/898208047578849280 ). They would not need to or benefit from accessing internal Google data.

2019-07-17 12:04 pm

flypig
The search stack is as open as possible. The algorithms are already well known (and most are published by Google). There are APIs […]

If opening out access won’t be particularly beneficial to other companies (or are already even accessible), but will persuade lawmakers and regulators that Google is making some sort of concession, then that seems ideal for Google. Google should be jumping on the suggestion in the original article.

2019-07-17 3:45 pm

sukru
Unfortunately product and legal matters are beyond my expertise.

2019-07-15 9:58 pm

dark2
If simply copying search results would make you competitive, Bing would be popular by now. It’s now about an entire ecosystem that Google or Apple provide. Gmail, docs, maps, etc. and all integrated well into your smart phone.

2019-07-18 11:32 am

CaptainN-
Exactly. Google’s algorithm serves as a core of how they do business. They sell ad placement, and their “algorithm” helps them do that – no matter what the interface – search result, banner ads all over the web, devices, apps, whatever. Thinking of Google as a search company demonstrates a massive lack of understanding of Google’s business.

2019-07-16 12:10 am

djitanium
“essential resources”

Search engines/results are not “essential resources”. You will not die from not being able to see pictures of cute dogs or recipes for bread pudding. Crank the hysterics down from 11.

2019-07-16 12:53 am

winter
This shows a lack of understanding on exactly how important search engines are on navigating the internet.

2019-07-16 1:26 pm

Alfman verbose=1
winter,

This shows a lack of understanding on exactly how important search engines are on navigating the internet.

Exactly. Controlling what people see gives google extreme power, combining that with troves of personal data and you’ve got one of the most powerful think tanks in existence.. They have the power to determine what services and vendors will be successful or fail through hidden algorithms that determine who shall get an audience. Google can and does make or break businesses. It’s not just e-commerce anymore, even in the physical world google is directing people and traffic. Some may view google as altruistic, but we still should admit the enormous power they have over our lives. Many regimes can only dream of having google’s powers. Even though google hasn’t set out to do evil, money corrupts everything in subtle ways.

2019-07-16 2:59 am

flypig
After reading the osnews teaser I almost immediately went for my keyboard to comment that I don’t see why Google would mind this at all. With the most obvious API access, Google would still be able to track searches, build up profiles, etc It could work out well for them. Thankfully I read the full article to see that the the author comes to the same conclusion.

However, more generally I don’t see how Google’s invasiveness and power can be reined in without hurting the company’s bottom line. The same for Facebook. Their success is built on ubiquitous data collection/analysis, and the network effect, and it’s hard to see how either of those could be restricted without affecting their business models.

2019-07-16 1:42 pm

Alfman verbose=1
flypig,

With the most obvious API access, Google would still be able to track searches, build up profiles, etc It could work out well for them. Thankfully I read the full article to see that the the author comes to the same conclusion.

I disagree. If 3rd parties were explicitly allowed to take google’s index for themselves, it would become trivial for 3rd party services to anonymize access such that google couldn’t track individuals anymore, at best google would only track aggregate data pipes. Furthermore plenty of companies might download and use parts of the index completely outside of google’s control. For better or worse, google would loose the ability to track users of the index..

However, more generally I don’t see how Google’s invasiveness and power can be reined in without hurting the company’s bottom line. The same for Facebook. Their success is built on ubiquitous data collection/analysis, and the network effect, and it’s hard to see how either of those could be restricted without affecting their business models.

I agree. There’s just no way adding more competition wouldn’t hurt their bottom line. They’d still make billions upon billions, just not as many,. When we look at all the consolidation that’s happened in tech and elsewhere, we must ask ourselves is this what we want for the future? We cannot fix these monopolies without implicitly taking away wealth and power at the top. If regulators do nothing, the money and wealth will remain at the top and consolidation will continue to get worse.

2019-07-17 11:54 am

flypig
I disagree. If 3rd parties were explicitly allowed to take google’s index for themselves, it would become trivial for 3rd party services to anonymize access such that google couldn’t track individuals anymore, at best google would only track aggregate data pipes.

In essence this is what StartPage is already doing, so I can’t totally disagree. But I think Google is smart enough — smarter than the people trying to hold it to account — to put together an API and a contract that allows it to continue tracking users. When you consider how much visibility Google already has of the Web (any site using Analytics or running adverts, any user using Android, Chrome or 8.8.8.8, etc.), if competing search engines were required to send back analytics (e.g. which result was selected), Google could tie it all together. I’m not saying a search engine couldn’t obfuscate it, but not all alternative search engines have a privacy focus.

2019-07-17 1:08 pm

Alfman verbose=1
flypig,

In essence this is what StartPage is already doing, so I can’t totally disagree. But I think Google is smart enough — smarter than the people trying to hold it to account — to put together an API and a contract that allows it to continue tracking users. When you consider how much visibility Google already has of the Web (any site using Analytics or running adverts, any user using Android, Chrome or 8.8.8.8, etc.), if competing search engines were required to send back analytics (e.g. which result was selected), Google could tie it all together. I’m not saying a search engine couldn’t obfuscate it, but not all alternative search engines have a privacy focus.

Well, you’re talking about all the tracking bugs that google has installed all over the web, yes those all help reinforce it’s monopoly. You’re right though, google’s monopoly is so powerful today that it has actually become non trivial to find ways to break it up. That’s the problem with monopolies, by the time we get around to doing anything about them, they’ve already established parasitic roots everywhere. There’s got to be a better way of taking preventative measures against mass-consolidation and monopolization, the main issue as I see it is that the regulators are corrupted by the very industries they’re tasked with overseeing, which gives corporate monopolies nearly absolute power. Unfortunately neither party is doing anything to address this corruption, so I don’t really predict a breakthrough.

2019-07-17 2:33 pm

flypig
Yes, that does seem to be the underlying issue, and I also agree with what you say about the broader dangers of monopolies and the difficulties of preventing them in your reply to Brenden below.

2019-07-16 5:03 am

Artem S. Tashkinov
Everyone can index the web, so what the f are we talking about?

Google does NOT have a monopoly on indexing.

2019-07-16 8:48 am

ahferroin7
Indexing is indeed something anybody can do.

The problem though, is that not everybody has:

* The bandwidth required to do so efficiently. Just a rough estimate, we’re talking something on the order of dozens of terabytes of data per month that Google is pulling in _just_ for indexing.
* The storage space to actually keep that index around. Again, this is a huge factor of scale we’re dealing with here, probably on the order of double digit petabytes.
* The hardware required to efficiently utilize that bandwidth and storage. Indexing has to scale out and not up to be efficient, period, so at minimum you’re talking about a decentralized Beowulf cluster, which is not cheap once you factor in the bandwidth and storage requirements.
* The know-how to be able to keep the index up to date in a reasonable manner.
* A sufficiently good reputation that they won’t get blocked by a sizable percentage of websites. There are a lot of sites that only allow a couple of specific search crawlers and explicitly block everything else, because it’s just so much easier that way to keep bandwidth usage reasonable.

Put differently, while indexing is indeed technically easy and freely available, it’s also the most expensive part of actually creating and running a search engine. The front-ends are trivial, and the ranking logic is not hard to get good enough to be acceptable, but the indexing is going to cost you a huge amount of time and money to deal with.

2019-07-16 9:40 am

Artem S. Tashkinov
> The know-how to be able to keep the index up to date in a reasonable manner.

> The front-ends are trivial, and the ranking logic is not hard to get good enough to be acceptable, but the indexing is going to cost you a huge amount of time and money to deal with.

I’ve used Bing several times and, no, the ranking logic is anything but trivial. Google invests quite a lot to return relevant results and these two things are even more important than everything else on your list. Bing cannot even search for queries in quotes. This is just ugly.
2019-07-16 1:15 pm

Anonymous
Considering the number of crawlers that hit the sites I control, I don’t think the barrier is that high. Then you have the storage and computational burden needed to manage and process the raw search data. My gut reaction is that if you can handle the second part, then you probably can handle the first part.

2019-07-16 2:06 pm

Alfman verbose=1
jockm,

Considering the number of crawlers that hit the sites I control, I don’t think the barrier is that high. Then you have the storage and computational burden needed to manage and process the raw search data. My gut reaction is that if you can handle the second part, then you probably can handle the first part.

I think it’s harder and obviously far more time consuming to crawl millions of websites than to index the data. While servers can be expensive, the limiting factor for most individuals would be all the bandwidth. While nothing about crawling is particularly difficult, it’s just very tedious and expensive to do at internet-wide scales. Even google itself is rather slow to crawl all of the pages I’m a webmaster for, consequently it can take months for them to pick up changes. Having immediate access to pre-indexed data could enable a lot of independent projects that wouldn’t otherwise be feasible.

2019-07-17 4:04 am

Brendan
To find the right solution to a problem, the first step is to understand the problem and its root cause. So, what is the problem they’re trying to solve?

On its own, just being a monopoly is not a problem.

Voters being manipulated (by Russia, by “fake news”, by politicians spending millions of $$ on marketing campaigns, by organisations like the NRA, etc) is a problem; but it’s a fundamental flaw in democracy that has always existed and making Google’s index open won’t make any difference at all.

The “search bubble” (e.g. people finding information they already agree with) is also something that has existed for centuries. It used to be your circle of friends, now it’s much wider and its a lot easier to stumble across something you don’t agree with. It’s just how society (and its tendency towards cliques) works; and making Google’s index open won’t really make any difference.

In my opinion the 2 biggest problems with Google (and Facebook, etc) are spam and privacy (and the link between them – intrusive data collection practices for the purpose of customizing spam). Making Google’s index open won’t make any difference for this either – it’s more likely to encourage other companies (who start using Google’s search index) to implement their own intrusive data collection practices for the purpose of customizing spam, and likely to make things worse.

What we really need is laws that limit data collection and retention; plus a new version of “the web” that has built in micro-transactions that gives content providers a better option than “provide content for free and get funding by assisting spammers/advertisers” and encourages “provide higher quality content to attract more $$ from micro-transactions”.

2019-07-17 12:47 pm

Alfman verbose=1
Brendan,

To find the right solution to a problem, the first step is to understand the problem and its root cause. So, what is the problem they’re trying to solve?

On its own, just being a monopoly is not a problem.

Voters being manipulated (by Russia, by “fake news”, by politicians spending millions of $$ on marketing campaigns, by organisations like the NRA, etc) is a problem; but it’s a fundamental flaw in democracy that has always existed and making Google’s index open won’t make any difference at all.

I think you are right to point out that there’s a lot of issues at stake. However on it’s own, just being a monopoly IS a problem. Capitalism is no better than any other economic system if it does not produce vibrant competition. Having lots of competition is the whole justification for capitalism: many companies competing against each other provides consumers with the best value. No private company deserves to be a monopoly (or oligopoly for that matter). Competition must always be healthy & viable, otherwise we become dependent on a few ultra-powerful corporations who control everything, and that runs against the core tenants of capitalism. If our values are to mean anything and be more than illusionary, then we ought to recognize that monopolies in and of themselves are fundamentally problematic for healthy free markets.

2019-07-18 5:17 am

Brendan
Capitalism is about using greed (the desire to increase profits) as a tool to improve (the availability, price, quality, features, … of) the goods and services available to consumers.

If you decide a monopoly on its own is “bad’ and that all companies must be split up and/or punished just for being successful; then you can expect that companies will refuse to improve goods and services to avoid becoming a monopoly. In other words, if you decide a monopoly on its own is “bad’ then you’re trying to undermine capitalism.

Things that actually are bad for capitalism are things that prevent fair competition. One of these things is using a dominant position (not necessarily a monopoly) to get an unfair advantage over competitors (e.g. using “bundling”, discounts, etc). There are already laws in place for this (although they’re not necessarily the best laws, and not necessarily enforced well).

There are also a lot of things that are bad in general (false advertising, theft, fraud, …); and a large number laws in place for those. For some things (privacy) the laws aren’t quite good enough (not updated in response to modern “data mining” practices); but can and should be fixed; and this has nothing to do with whether or not the company violating your privacy is a monopoly, or is dominant, or is neither.

2019-07-24 6:25 pm

Alfman verbose=1
Brendan,

If you decide a monopoly on its own is “bad’ and that all companies must be split up and/or punished just for being successful; then you can expect that companies will refuse to improve goods and services to avoid becoming a monopoly. In other words, if you decide a monopoly on its own is “bad’ then you’re trying to undermine capitalism.

Well, we shouldn’t necessarily assume the solution to this problem has to be breaking them up or punishing them per say. We might instead create a “progressive” economy to level the playing field.

There’s always a lot of posturing by those at the top threatening to leave if they’re not so highly compensated, but in reality it’s an empty threat because the top is still the most rewarding position. They make ridiculous sums of money and there are more than enough takers should they step down, we don’t need them and they know it, If they do in fact step down because they lost their monopoly advantages, then so what? We don’t need supreme leaders of industry. If we can open up the market to more/healthier competition then so much the better! History shows that monopolies always result in abuse and a decline in market opportunities. If incumbents insist having only a few players sitting at the head of non-competitive markets, then we’re better off letting them drop so we can renew healthier competitive markets with more opportunities.

Things that actually are bad for capitalism are things that prevent fair competition. One of these things is using a dominant position (not necessarily a monopoly) to get an unfair advantage over competitors (e.g. using “bundling”, discounts, etc). There are already laws in place for this (although they’re not necessarily the best laws, and not necessarily enforced well)…

I agree there are other things that prevent fair competition in addition to monopolies.

2019-07-18 11:29 am

CaptainN-
On its own, being a monopoly is a problem….

2019-07-17 3:31 pm

CaptainN-
This is maybe the dumbest idea I’ve read all week. Not only could you not adequately define how this would be done in any useful way, it wouldn’t actually achieve the stated goal at all. In broad terms, the only thing “interesting” about this at all is the general idea that maybe we should do something about a monopoly. And it’s less interesting, and more that I agree with it. Beyond that, completely nonsense.
2019-07-18 4:17 pm

yoko-t
“2019-07-17 3:31 pm
CaptainN-

This is maybe the dumbest idea I’ve read all week. Not only could you not adequately define how this would be done in any useful way, it wouldn’t actually achieve the stated goal at all. In broad terms, the only thing “interesting” about this at all is the general idea that maybe we should do something about a monopoly. And it’s less interesting, and more that I agree with it. Beyond that, completely nonsense.”

It’s worse than this. You’re basically setting things up so that every Tom,Dick and Harry can steal this data for Spaming Purposes.

Think things are bad now? Just release this info in this manner and just watch what happens.
2019-07-24 12:41 am

nguyentuanpc
Thanks for share !!!!!!!!!!!!!!!!!! I like it