The Advent Of ‘Thinking Tokens’ Causes Unforeseen Inflationary Impact On Generative AI

📆 11/5/2025 2:40 AM

Artificial Intelligence AI News

Generative AI Large Language Model LLM, Openai Chatgpt GPT-5 GPT-4O, Anthropic Claude Google Gemini Xai Grok Meta Llama

📆11/5/2025 2:40 AM

📰ForbesTech

⏱502 sec. here / 16 min. at publisher

📊News: 225% · Publisher: 59%

Hidden inside of AI LLMs are so-called Thinking Tokens (TTs). Turns out these TTs have an inflationary impact on generative AI. It's an AI Insider scoop.

In today’s column, I examine an AI-insider topic that has rather startling inflationary impacts on the overall costs of how generative AI and large language models work and is relatively unknown to those outside of the AI community.

I will walk you through the technical underpinnings and explain the core considerations, which have to do with a controversial approach encompassing “thinking tokens” .This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities . Construe the tokens as residing on a conveyor belt and moving along on a kind of assembly or processing line. As each token comes up to the core area for processing, the AI is designed to give a fixed amount of time to process each token. The token arriving at the core area gets its allotted fixed time, and then the conveyor belt moves things along. One by one, the tokens coming up to the processing spot are all getting their same and fair share of time. But what if the tokens are part of a really tough question that is being addressed? The fixed amount of time might not be sufficient for the AI to consider a larger range of possibilities. In a sense, the fixed amount of time is going to stifle the AI from doing enough processing to figure out a more well-rounded answer.Which do you think is more important – getting the AI to work swiftly or getting better answers? Let’s assume that users are more likely to tolerate a bit of latency if they are getting better answers to their questions, particularly in the circumstance of the AI handling tough questions. A means to give the AI more processing time would be to toss into the mix a special kind of token that somewhat pauses the streaming of the answer and allows the internal processing to get an added stint of computer time. Here’s what we can do. A special token that only serves to spur breathing time will be added to the assembly line of tokens that are moving along on the conveyor belt. The special token has no other notable purpose and requires no processing of its own. All that this token does is act as an “uh” or “I know” and create a kind of gap between the tokens that are flowing along.Researchers tried this approach and observed that the special tokens were indeed giving the AI added processing time to cope with the true tokens. We might therefore put these special tokens into a stream of tokens so that, from moment to moment, the AI is getting boosted processing time. I will illustrate this with a brief example. Assume that the AI is processing the sentence “The dog barked at the cat.” Let’s assume that each word is turned into a token. And, each of those tokens will get the same amount of fixed time for processing inside the AI.at the cat.” The way to interpret this is that after the words “The dog” there is a special token that does nothing other than allow the AI to further process the “The dog” part of the sentence. The same thing will happen after encountering the word “barked”. The special token itself does not chew up any time. It merely creates a breathing space for the AI to continue the processing it already had underway.Well, it depends. If the prompt entails an easy question, it might be prudent not to toss any of the special tokens into the mix. Just let the AI do its usual processing. If the user has provided a tough question, we might place a few special tokens into the mix. We might even go so far as placing a special token after every token of the prompt, getting us a lot of added processing time, for example, “TheThat is going to have the AI essentially double the amount of processing time. This doesn’t guarantee a better response by the AI, but if the prompt entails something tough, it might end up producing a better answer.Some enterprising researchers went ahead and experimented with the inclusion of special tokens into the internal process of generative AI and LLMs. I’ll show you their initial findings. Before I do so, I’ll mention a small fracas about this. They could have referred to the special tokens as a pausing token, or a take-a-break token, but instead they named the special tokens as so-called thinking tokens. Not everyone likes that naming convention. Here’s why. The takeaway that you might have is that these are tokens that somehow themselves are instinctively embodying thinking, but the real purpose is to shift the AI away from processing the special token and instead allot more time to the other true tokens that are being processed.Furthermore, there are other kinds of special tokens that are intended to provide added value instinctively by themselves. You might be tempted to refer to those tokens as thinking tokens. To avoid that kind of confusion, the general parlance is that those are reasoning tokens. The problem concerning names gets regrettably worse since an AI developer might sloppily refer to reasoning tokens as thinking tokens or refer to thinking tokens as reasoning tokens. It’s messy.The now-classic research paper that caught attention on the thinking tokens topic was entitled “Thinking Tokens for Language Modeling” by David Herel, Tomas Mikolov,Our approach is to introduce special ’thinking tokens’ after each word in a sentence whenever a complex problem is encountered.” “The core idea is that each ’thinking token’ would buy more time for the model before an answer is expected, which would be used to run additional computations to better answer a complex problem that was presented.” “This concept has great potential in recurrent neural networks due to their architecture, because it enables the RNN to perform multiple in-memory operations in a single step, meaning that extra calculations can be run in the hidden layer multiple times.” “Experiments execution has successfully produced numerous examples where the usage of ’thinking tokens’ leads to an improvement in the model’s judgment. Preliminary results show that sentences that require non-trivial reasoning, have the biggest improvement in perplexity when ’thinking tokens’ are used compared to the standard model.” The beauty of this approach is that you don’t have to do much to implement it. You don’t need to utterly rejigger the guts of the AI system. Instead, just make a modest modification to allow for a special token that, when encountered, is not processed for itself, and instead allows more processing time for the other nearby real tokens.Since this clever trickery tends to get better answers from AI, plus you don’t need to completely turn the AI upside down to adopt the approach, many AI makers were eager to put this into practice. Envision that we go ahead and add the special tokens to all manner of user prompts that are being entered into the AI. The user doesn’t see that we are doing so. It is all done internally within the AI. A user is going to potentially experience added latency. Their answer doesn’t appear as quickly as it might have. Maybe the user notices, maybe they don’t. Another factor is cost. The AI is doing more processing because the special token is prodding it to do so. If a user is paying by the number of tokens processed or by the amount of computing time consumed, they are going to see an increase in their usage billing.An individual user might not be able to discern an uptick in their billing or realize that their answers are taking a slightly longer time to appear. What about when we have thousands of users, or maybe millions of users? You might be aware that OpenAI touts the claim that they have 800 million weekly active users. It’s an impressively large number of users. The gist is that at scale, the consumption of computer processing time is bound to be a lot higher than it otherwise would have been. This is happening on a global basis. Servers in vast data centers are churning away, including doing so because of the inserted special tokens. The electrical energy consumed goes up, as does whatever cooling method might be used, such as water cooling.One perspective is that the AI makers ought to be more judicious in how they use thinking tokens. Some are very careful, others are more loosey-goosey. The crux is that there is an ROI or tradeoff of employing the thinking tokens. Use them sparingly, some insist. Others reply that thinking tokens are simply the natural need to ensure that enough processing time is being devoted to what users ask. Use thinking tokens adroitly and don’t worry about the worrywarts. Some researchers have critically assessed the value of thinking tokens and emphasized that other approaches might be a better way to go. One such study asserted that the use of chain-of-thought , gets you more bang for the buck, as noted in “Rethinking Thinking Tokens: Understanding Why They Underperform in Practice”, Sreeram Vennam, David Valente, David Herel, Ponnurangam Kumaraguru, arXiv, November 18, 2024, per these points : “Thinking Tokens have been proposed as an unsupervised method to facilitate reasoning in language models.” “However, despite their conceptual appeal, our findings show that TTs marginally improves performance and consistently underperforms compared to Chain-of-Thought reasoning across multiple benchmarks.” “We hypothesize that this underperformance stems from the reliance on a single embedding for TTs, which results in inconsistent learning signals and introduces noisy gradients.” “When a single embedding is used for TTs, during backpropagation the model receives inconsistent learning signals, leading to noisy gradient updates. This noise disrupts learning, particularly in tasks that require structured intermediate steps, such as arithmetic reasoning or multi-hop commonsense tasks.”I am assuming that you might be somewhat startled to now realize that there are hidden tokens that are potentially stoking greater consumption of computing processing, and these out-of-view tokens are considered to be an inflationary element of modern-era AI. As the famed Arthur Conan Doyle once remarked: “It has long been an axiom of mine that the little things are infinitely the most important.” That is equally true when it comes to what is happening inside contemporary AI.

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

Generative AI Large Language Model LLM Openai Chatgpt GPT-5 GPT-4O Anthropic Claude Google Gemini Xai Grok Meta Llama Thinking Cognition Reasoning Token Tokenization Inflationary Inflation Cost Energy Electrical Cool Hidden Pause Human Behavior

Write Comment

United States Latest News, United States Headlines

Similar News:You can also read news stories similar to this one that we have collected from other news sources.

These Unique Advent Calendars Are Cool Enough For Even the Pickiest TeenOur teenagers may not believe in Santa any more, but there's still something so magical about a countdown to Christmas — and who doesn't like getting a fun little surprise every day?! That's why advent calendars are the perfect way to build up joy all through December.
Read more »

Caesars Sportsbook Promo Code COVERS20X: Collect 20 100% Profit Boost Tokens for Cardinals vs. CowboysUse Caesars Sportsbook promo code COVERS20X for Cardinals vs Cowboys Monday night. Bet $1, get 20 100% profit boosts worth up to $2,500 each.
Read more »

Chainalink’s (LINK) Supply Shock Begins? 15 Million Tokens Vanish From Exchanges in 30 DaysFalling exchange reserves from 180 million LINK to 146 million LINK indicate a collapse in structural sell pressure.
Read more »

36 Best Advent Calendars for Teens, According to TikTok 2025The best Advent calendars for teens will impress even the most jaded teenagers, including TikTok-popular beauty, skin care, chocolate, & jewelry LEGO calendars.
Read more »

Aldi's Advent Calendar Lineup for 2024: From Wine to Hot SauceAldi is rolling out a diverse range of Advent calendars this year, catering to various tastes and budgets. Highlights include a Wine Advent Calendar, a Hot Sauce Advent Calendar, and options for kids featuring Barbie, Bluey, and Lego sets. The calendars offer a fun way to count down to the holidays with surprises like wine selections, hot sauces, and themed toys.
Read more »

Caesars Sportsbook Promo Code 'COVERS20X' Awards 20 100% Boost Tokens On A Thunder-Clippers BetUse Caesars Sportsbook promo code COVERS20X for Thunder vs Clippers Tuesday. Bet $1, get 20 100% profit boosts worth up to $2,500 each.
Read more »