Experimenting budget forcing and test-time scaling [WIP]
[IMPORTANT]: Actively working on the blog. From the paper: https://arxiv.org/pdf/2501.19393 Figure 3. Budget forcing with s1-32B. The model tries to stop after “…is 2.”, but we suppress the end-of-thinking token delimiter instead appending “Wait” leading s1-32B to self-correct its answer. Interesting nuggets from s1 paper (methodology) They collected a dataset of 1k examples with reasoning traces from Google Gemini model and performed SFT (supervised fine tuning). They fix response lengths by adding “wait” tokens in certain cases to get models to generate longer CoT’s, verify, and correct itself, or halt token generation by introducing an EOT (end of thinking) token delimiter and the authors call this phenomenon as “budget forcing.” Budget Forcing Not to be picky or pedantic but budget forcing (BF) is still not a parallel inference scaling technique (as seen in o-1 or by Gemini Thinking). As the authors point out, we can think of BF as a sequential inference scaling technique. Despite the <wait> and <think tokens at appropriate steps, the model is still generating one token at a time, the only difference being in total number of tokens. ...