Video thumbnail

SmolLMv3 - A Small Reasoner with Tool Use.

Sam Witteveen9 de julho de 2025

The Hugging Face team has recently unveiled SmolLMv3, a new 3 billion parameter model that marks a significant advancement in local model performance and design. This model stands out not only for its impressive capabilities at its size but also for the unparalleled transparency in its development process, with Hugging Face releasing a comprehensive "blueprint" detailing every step of its training. SmolLMv3 is designed to run efficiently on various local devices, including mobile phones, making advanced AI readily accessible without relying on proprietary models. It introduces innovative architectural features, such as "Nope" (a causal MUs variation) and a dual-think reasoning system, allowing users to control the model's reasoning depth. The training involved an extensive 11 trillion tokens, distributed across a three-phase pre-training process, emphasizing different data types like web content, code, and mathematics. Furthermore, the model incorporates advanced alignment techniques, including a variant of DPO, and utilizes model merging to consolidate checkpoints. Its ability to perform function calling and agentic tasks makes it a promising candidate for local AI applications.

SmolLMv3 Model Overview

Hugging Face has launched SmolLMv3, a 3 billion parameter (3B) model available in base, instruct, and ONNX versions. Community-driven GGUF versions are also emerging for use with Ollama and LM Studio. This model positions itself between Qwen 3 1.7B and Qwen 3 4B in terms of size, making it suitable for deployment on many mobile devices. Preliminary benchmarks indicate that it outperforms older models like Qwen 2.53B and Llama 3.2B. The model was trained on an impressive 11 trillion tokens, a testament to the extensive computational effort involved. Hugging Face claims it represents the state-of-the-art for 3B models and is highly competitive even with 4B models. A notable feature is its "dual-think reasoning system," which allows users to enable or disable reasoning capabilities.

This is a new model that came out today from Hugging Face, actually trained by Hugging Face, created by Hugging Face. It builds on the previous small LMS that they've had in the past, slightly bigger than the other ones at 3B.

While advertised as multilingual, the current implementation supports only six European languages, which may not meet broad expectations for "multilingual" functionality. Furthermore, SmolLMv3 supports a long context window of up to 128K tokens, with potential for up to 256K, though its effectiveness at the higher end is yet to be fully determined.

The Blueprint: An Unprecedented Level of Transparency

One of the most significant aspects of the SmolLMv3 release is the accompanying "blueprint," which details every step of the model's training process.

And this blueprint is exactly how they've done each step of the training. So even with the open weights models from Deepseek from Quen, usually they're quite good in telling us at least roughly what they did, unlike the papers coming out of the proprietary labs which really aren't giving you any sort of real facts and basically just maybe have some vague clues every now and then. This is actually a full blueprint of exactly what they did for each of the steps from the pre-training recipe, how they set up the distributed training, how they did the model architecture, right through to things like the long context training and all the post-training recipes for actually putting this together.

This level of transparency sets it apart from many other open-source models, which often provide only general insights into their methodologies, and stands in stark contrast to proprietary models that typically offer vague details. The blueprint covers everything from the pre-training recipe and distributed training setup to the model architecture, long context training, and all post-training recipes. This comprehensive documentation provides invaluable information for researchers and developers looking to understand and replicate the training process of small large language models.

Architectural Innovations and Training Details

SmolLMv3 utilizes an architecture similar to some Llama 3 designs, incorporating Grouped Query Attention (GQA). A particularly interesting innovation is its use of "Nope," which simplifies positional embeddings by replacing Rotary Positional Embeddings (RoPE) with a causal Mus-based approach. The development team also drew inspiration from projects like MoE 2``` for improving training stability, specifically by removing weight decay from embedding layers. The training budget for SmolLMv3 was remarkably efficient: 384 NVIDIA H100 GPUs for 24 days, tallying approximately 220,000 GPU hours. This translates to a training cost in the "few hundreds of thousands of dollars," a significantly lower figure compared to the millions often associated with training larger, proprietary models. The cost-effectiveness is partly attributed to the decreasing rental prices of H100 GPUs, making advanced model training more accessible.

Data Mixes and Pre-training Phases

The blueprint reveals a sophisticated three-phase pre-training strategy with an annealing phase at the end, which is a rare level of detail for open model releases.

So, it's really nice to see here that they're talking about doing a three-phase pre-training with an kneeling sort of at the end of that, but more interestingly looking at the splits of data as they go through this. So, you can see at the start it's very webheavy for that sort of long phase, but then in phase two and phase three, they actually increase the code. They increase the math quite a lot in there just for this pre-training phrase.

The initial phase heavily relies on web data, followed by successive phases (Phase 2 and Phase 3) that significantly increase the proportion of code and mathematical data. This structured approach helps in building a versatile model capable of handling diverse tasks. The reasoning capabilities of SmolLMv3 appear to be heavily influenced by existing high-quality reasoning traces, specifically those from DeepSeek R1 and Qwen 3, used to generate synthetic data for reasoning. While some reinforcement learning with verifiable rewards (RLVR) might be involved, it does not seem to be as extensive as in some other labs. The model also leverages a novel variant of Direct Preference Optimization (DPO) for alignment and employs model merging techniques to combine various checkpoints into an optimized "super checkpoint." Hugging Face has made the base and instruction-tuned models publicly available, and there's hope that intermediate checkpoints from each training phase will be released in the future to facilitate further research.

Code Demo and Performance Insights

Setting up SmolLMv3 for local use is straightforward, integrating seamlessly with Hugging Face's Transformers library, SGLang, and vLLM. The model demonstrates a "dual-think" reasoning system where passing a specific prompt (e.g., system prompt) can enable detailed thought processes, often appearing as long, multi-step thinking tags before the final answer. Conversely, including "/n nothink" in the system prompt allows for direct answers without visible reasoning steps. While the thinking process can be extensive and detailed, particularly for detailed planning tasks like outlining a successful lemonade stand, it doesn't always break down into numbered sections, similar to some DeepSeek and Qwen 3 models. For certain tasks, such as code generation or factual recall, the model might produce "empty thinking" or omit the reasoning output altogether, suggesting it skips the thought process for simpler or pre-memorized information. Despite this, the quality of the thinking, when present, is impressive for a 3B model, often providing comprehensive breakdowns and justifications. Its performance on tasks like GSM8K mathematical problems is particularly noteworthy, outperforming much larger models from a year or two ago.

Tool Use and Agentic Capabilities

One of SmolLMv3's most compelling features is its function-calling ability, making it suitable for agentic applications. Users can define tool schemas, and the model can accurately interpret user queries to invoke the correct tool with appropriate arguments. For example, when asked "What's the weather like in Copenhagen?", the model successfully identifies the `get_weather` tool and the "Copenhagen" argument. Similarly, for queries like "When will OpenAI release the open weights model?", it correctly identifies the need for a web search and formulates relevant keywords such as "OpenAI open weights model release date rumors."

So, when we come in and look at the tool use, we can basically just define a tool. You can see here that we've just defined this the schema for the tool in here. We can then pass in a message and sure enough, what it will do is that the message this is showing all the data that's coming out. Sure enough, the message will actually then include a tool call in it. So in this case, the tool call is the get weather and it worked out that it was Copenhagen, which is what it should have been for this. So I tried this out for a number of different functions.

While SmolLMv3's knowledge cutoff is June 2024, it sometimes opts to use a search tool for factual questions that it theoretically should know, like "Who won the Nobel Prize in Chemistry (2024)?". This behavior is considered a positive attribute, as it indicates a preference for seeking up-to-date information when uncertainty exists, rather than providing potentially outdated or incorrect memorized facts. The model also handles situations where tools are not needed, correctly choosing not to invoke any tools when asked general questions that don't require external information. However, there were some instances where it attempted to use a search tool when it wasn't strictly necessary, suggesting that tool descriptions and prompt engineering might need fine-tuning for optimal agentic performance. The model is also expected to be available on platforms like Ollama and LM Studio soon.

Takeaways

Transparency in AI Development: Hugging Face's release of a "blueprint" for SmolLMv3 provides unprecedented detail on its training process, from pre-training recipes to post-training techniques, setting a new standard for openness in the AI community.
Efficient Small Model Performance: At 3 billion parameters, SmolLMv3 demonstrates state-of-the-art performance, competing with larger 4B models and maintaining a cost-effective training budget of a few hundred thousand dollars, making it viable for local deployment, including on mobile devices.
Innovative Reasoning and Tool Use: The "dual-think" reasoning system allows for controllable depth of thought, with function-calling capabilities that enable SmolLMv3 to act as an agent, accurately invoking tools based on user queries, representing a significant step towards practical local AI applications.

References

This article was AI generated. It may contain errors and should be verified with the original source.

ClarifyTube

© 2025 ClarifyTube. All rights reserved.