The AI has learned to encode messages that humans are not capable of understanding

This discovery isn't official, but it's still frightening.

The AI has learned to encode messages that humans are not capable of understanding
Chema Carvajal Sarabia

Chema Carvajal Sarabia

  • Updated:

AI is becoming more present in our lives every day. And it’s becoming smarter every day. Now, we’ve discovered that large language models can write between the lines, encode messages right under our noses without us even realizing it.


A recent study (though not yet peer-reviewed) conducted by the AI alignment research group Redwood Research discovered that large language models (LLMs) are incredibly adept at a type of steganography nicknamed “encoded reasoning.”

Essentially, according to the study, LLMs can be trained to use secret messages that conceal their thinking processes step by step, a practice that, intriguingly, could make their outputs more accurate and simultaneously more deceptive.

The danger posed by this discovery is significant

The encoded reasoning could undermine the transparency of the model’s reasoning process and make it harder to supervise the behavior of AI systems.

But what exactly is encoded reasoning? A widely adopted technique to enhance the accuracy of Large Language Models (LLMs) is Chain of Thought reasoning (CoT). This method instructs the model to unveil its logic step by step, leading to the final response. Studies show that CoT improves the performance of LLMs in many tasks and aids in interpreting their outputs.

However, “encoded reasoning” allows models to benefit from the performance boost of CoT reasoning without revealing the logic behind their outputs.

The authors acknowledge that learning encoded reasoning poses a challenge because this type of reasoning rarely appears in the pretraining data of LLMs.

The reasoning encoded in practice

The researchers present an example of a proof of concept in which models learn encoded reasoning. They fine-tuned GPT-3.5-Turbo to learn a simple scheme of steganographic encoding. This scheme hides the model’s chain of thought while generating much more convincing responses than those produced by a model trained to directly output its final answer.

The researchers discovered that fine-tuning LLMs could promote encoded reasoning for two reasons. Firstly, correct responses are usually rewarded. Secondly, some intermediate steps of reasoning might be penalized because they either elongate the outcome or do not align with user expectations, even if they lead to correct answers.

They also observed that LLMs trained on their own generations, such as those trained with reinforcement learning, can learn to execute behaviors that are rare in human text, as long as they are rewarded for doing so.

This isn’t unprecedented. In 2016, DeepMind’s AlphaGo, initially trained on games played by humans, learned to surpass its human trainers and even discovered moves that baffled human experts in that age-old game.

“The LLMs could learn to do steganography if they are capable of exploring their way to it,” the researchers write.

The researchers caution that encoded reasoning could undermine the transparency of the AI’s reasoning process. It could also facilitate hidden communication between LLM instances, especially when multiple LLMs interact within an AI agent ecosystem.

“Both capabilities would undermine the oversight of advanced AI systems that will be developed in the coming years,” they warn.

According to the study, paraphrasing is a potential countermeasure against encoded reasoning. This strategy involves a defender attempting to rephrase and standardize the LLM’s response in a way that removes hidden information while preserving valuable semantic information. The difference between encoded and paraphrased responses can then be used to decipher the hidden information.


The researchers employed two paraphrasing techniques. The default technique instructed GPT-3.5-Turbo to rephrase the original response while preserving semantic information. A second method, called “contextually aware paraphrasing,” included the original question in the prompt and directed GPT-3.5-Turbo to rephrase the response while retaining only the information relevant to the original question.

Their experiments showed that, for certain tasks, paraphrasing can effectively remove hidden information from the LLM’s output without compromising its quality.

However, research in this area of LLMs is still in its early stages, and there is much work to be done. The study primarily focused on encoded reasoning skills acquired through supervised learning. Future research could expand on this proof of concept and explore when and how LLMs trained with reinforcement learning might develop their own steganographic strategies.

Chema Carvajal Sarabia

Chema Carvajal Sarabia

Journalist specialized in technology, entertainment and video games. Writing about what I'm passionate about (gadgets, games and movies) allows me to stay sane and wake up with a smile on my face when the alarm clock goes off. PS: this is not true 100% of the time.

Latest from Chema Carvajal Sarabia

Editorial Guidelines