Google puts publishers between a rock & a hard place over AI

Pocket

Publishers complain that Google’s dominance of internet search has put them “between rock and hard place” when it comes to using their copyrighted work as the basis for its artificial intelligence models.

They claim that Google has a huge advantage over its competitors because businesses are afraid that blocking Google’s AI-search “crawlers”, they will lose valuable traffic.

Most publishers, including media organisations, have blocked OpenAI ‘s crawler . This bot sucks up their content and feeds it to ChatGPT. They fear that blocking Google’s equivalent which provides its Bard chatbot would be detrimental to them over the long-term when it comes time to make their information accessible and findable on traditional Google.

One person said: “We do not want to take any action that will result in less traffic when Google combines AI with search.” “So we have turned off OpenAI’s crawler, but we haven’t turned off Google’s.” We are caught between a stone and a difficult place.

Google announced at the end of 2017 that it would separate its crawlers so that publishers can choose whether their data is scraped for AI systems, or just its search engine. has a new version called Search generative experience or SGE. This is a hybrid between generative AI paragraphs, and traditional search. Publishers fear that if they block Google crawlers, this will remove them from the results pages. Copyright holders say that they are powerless in this dispute because they heavily rely on Google for traffic.

Owen Meredith is the chief executive of The News Media Association in Britain. He said: “Individual Publishers will inevitably take a business view as to whether they opt out or not based on their own individual business model. Many publishers will face the challenge of interdependency of Big Tech platforms in every aspect of their business – from advertising to discoverability. Publishers could feel vulnerable about the reaction of Big Tech if they opt out.

Google claims that it understands the importance of generating AI to return traffic to content creators and believes the new feature will provide more options for users searching for information. As we continue to develop LLM-powered features, a spokesperson said that we will continue to prioritize experiences that send valuable traffic into the news ecosystem.

The New York Times sued OpenAI over alleged copyright violations, alleging that the technology company used newspaper information without permission.

Search generative experiences are intended to provide a starting point for users to explore the web. We show more links in search, and on the results pages, links to sources from a wider variety of sources, creating new possibilities for content discovery.

ChatGPT, launched in November 2022, brought generative AI into the public’s consciousness. Since then, OpenAI, supported by Microsoft, Google, and Meta, has dominated the market. They have the resources and computing capacity needed to create the large-language model that is the basis for engines that can produce everything from images to text in a natural way.

The rights to content used in AI are a major issue for the creative industries and technology companies around the world. The Intellectual Property Office in Britain was given the job of bringing the two sides to an agreement. The talks were unsuccessful and the Department for Science, Innovation and Technology was given the task of bringing the two sides together to find an agreement on the issue.

There are also several court cases that will help shape the future debate. The New York Times sued OpenAI in the United States for alleged copyright violations, alleging that the company used newspaper information to train their artificial intelligence models, without permission or compensation. The paper claimed this undermined the business model of the newspaper and threatened independent journalism. Getty Images has sued Stability AI in Britain’s High Court for copyright. The company claims that it “unlawfully scraped” millions of images to train their picture-creation system.

Search engines use crawlers to scan and index the content of web pages across the Internet (Adam MacVeigh & Katie Prescott).

Imagine a crawler working as a librarian in incredibly large library. They catalog each book (webpage), and they have information about the main themes and chapters of those pages (subpages). They can search the catalogue to find relevant pages and subpages when asked about a particular topic.

The index of a search engine is updated constantly with new information and webpages. Search results may be ranked higher for newer, more popular pages. Older or less relevant books could be at the back of the stacks.

It is difficult to estimate the number of websites. The majority of sources agree that only a small percentage are indexed and the vast majority is not.

There are some steps that can be taken in order to stop crawling. However, if an individual can navigate from your home page to another, a crawler will also be able to do so. This is not a problem that can be solved with a single method.

Websites can give instructions to crawlers on whether or not they can access their information. This is happening more often to let AI companies know that the information cannot be used in large-language models. The majority of reputable companies follow these rules set up by websites to not “crawl their information”.

Search engines can block users who repeatedly violate certain guidelines based on the internet protocol address (IP), a unique number that is assigned to every internet-connected device. Search engines can detect an IP and temporarily block it if multiple requests are received from that IP.

A unique fingerprint is also generated by devices, based on a combination between hardware, software and device settings. This fingerprint helps identify someone who makes multiple requests, while trying to hide their identity through VPNs or deleting cookies. This tracking method prevents users from circumventing blockages by simply changing their IP address.

There are services, however, that can do both IP and fingerprint re-cycle to avoid detection. Scrapers and content producers are in a constant race for technological superiority.

If someone wants to use your content, they will find a way.

Pocket