The internet is awash with fears over what impact AI-powered products and services like ChapGPT will have on life as we know it. Will large language models (LLMs) take over everyone’s jobs? Will AIs take over lethal autonomous weapons, or even lead to the extinction of humanity? These are important questions, but here’s a simpler one: what impact will the incorporation of LLM technology into search engines have on the web?
Creative Destruction Comes Around
To answer that question, let’s take a look at recent history. Twenty-five years ago, newspapers were a highly profitable business. But the advent of free news sites drove many out of business, and forced most of the rest to restructure their operations. Web sites that offered news as a side hustle (like Meta and Google) continue to prosper, but the pure news sites that disrupted the incumbents are now under financial pressure themselves, with Vox announcing layoffs and Vice Media – once valued by venture capitalists at $5.7 billion – filing for bankruptcy after failing to find a buyer at any price.
Meanwhile, what about the millions of blogs and other content sites that blossomed during the internet age, many of which have become trusted sources within their areas of specialization? Take, for example, SCOTUSblog, one of the most respected sources of news and analysis about the U.S. Supreme Court. Originally a labor of love by a husband and wife team, it now offers up the work of up to 100 authors a year, free of charge, as a public service.
Until now, such dedication made sense, because traditional search engines sent traffic directly to the source. Users could discover sites like SCOTUSblog and add them to their list of trusted authorities. But what if search engines no longer display those sites at all?
Citations are About More than Credibility
Thus far, the dialog around whether large language model-powered search engines should cite sources has focused on allowing users to verify the summarized information they receive. That’s an important feature, but equally important is whether rich data sets continue to exist and be updated so search engines can access fresh and reliable information.
That’s a real concern, because if large language model-powered search engines summarize and don’t footnote their sources, what motivation will the sites they scrape have to soldier on? Given today’s dramatically reduced newsrooms at for-profit news sites, the result is likely to be yet more news voids, with more of what we read coming from fewer and more broadly syndicated news sources.
And Also About More Then Money
Previously, the public wars over scraping data from news sites have focused on whether news aggregators like Meta and Google should be required to share advertising profits with their sources. In some countries abroad, they have been made to do so. Now big content owners like Getty Images are asserting similar claims when their copyrighted materials are used without their permission to train LLMs. Perhaps Congress or the courts will come to the rescue and provide relief, but that remains to be seen.
For the millions of small sites, though, it’s not about money at all, but survival. While some have found ways to monetize their content, the only reward the vast majority of small sites (like this one) receive is the attention and respect their efforts earn. Take that away, and these authors are relegated to the role of ghostwriters for everyone from venture capital-backed startups to the largest technology companies in the world. And that’s a dead-end street.
The Answer is to Supplement Rather than Replace Traditional Search
As the debates swirl over whether and how large language models must compensate training sets, users and content providers should unite to demand that Chatbots don’t replace traditional search results, but supplement them instead. Even with such a requirement, fewer and fewer search engine users can be expected to consult original sources as Chatbots get better and better. But without visibility to search engine users at all, users will have no choice but to trust the bots, and content creators will have no reason to create the information that chat bots will be useless without.
If that happens, we’ll all be poorer as a result – and that will be the result if this issue isn’t given the attention it deserves..
I do think that, unfortunately, all data sources are not accessible to everybody, some guys having blocked for their own profit the door to the best sources leading to research works and patents. Please, have a look to Wikipedia pages and please take your time for a short comparison between the pages dealing with the same subject in different languages (nl, be, de, uk, jp). After translation you will be surprised by the different answers you will receive. It’s not a joke !