The capabilities of artificial intelligences are increasing with each passing day. From “simply” being able to converse with them on trivial topics, AIs are now also able to see, hear and recognize us.
But, as we have explained on other occasions, for this to happen, companies must train their language models with huge amounts of data. Data that, on many occasions, comes from people, companies and websites that have not given their express consent.
Just today we were talking about how Meta has made use of its Facebook and Instagram users’ posts to train Meta AI, its new chatbot, and Google is another big technology company that does the same, but using all the websites it indexes. Of course, as of this week there is already a way for Google not to use your website to train Bard or other language models.
In a blog post, Google announced the launch of Google-Extended, a new control that publishers can use to allow or disallow data from their sites to be used to train Bard and Vertex AI, as well as “future generations of models that power these products.”
In other words, those who want to prevent Google from using their web information to train its artificial intelligence need only disable “User-Agent: Google-Extended” in the robots.txt file, a document that tells automated web crawlers what content they can access.
Google does not mention the word “train” at any time in its press release, but talks about “helping to improve generative APIs”. A euphemism that clearly tries to dilute the bad image that both this and other companies are earning for not asking for the express collaboration of people to train their AIs. “Train our AIs”? No, no: “help us improve our tools”.