Loading...
Move Over LLaMA: Tencent's New Open LLM is Ready to Self-Host

Move Over LLaMA: Tencent's New Open LLM is Ready to Self-Host

Jonas Scholz - Co-Founder von sliplane.ioJonas Scholz
5 min

Tencent just released a new open-source model called Hunyuan-A13B-Instruct. It has open weights (not sure about code), and it runs locally (well if you have a B200 GPU). If you're curious about how it performs and want to try it out yourself, here's how to set it up on a rented GPU in a few minutes.


What is Hunyuan-A13B?

Hunyuan-A13B is a Mixture-of-Experts (MoE) model with 80 billion total parameters, but only 13 billion active at a time. This means inference is much cheaper than a full dense model.

Mixture-of-Experts (MoE) is a neural network architecture where only a subset of specialized "expert" sub-networks are activated for each input, reducing computation while increasing model capacity. A gating mechanism dynamically selects which experts to use based on the input, allowing the model to scale efficiently without always using all parameters.

Some highlights:

  • Supports 256K context out of the box
  • Fast and slow thinking modes
  • Grouped Query Attention (GQA) for more efficient inference
  • Agent-oriented tuning, with benchmark results on BFCL-v3 and τ-Bench
  • Quantization support, including GPTQ

So far, it looks like a solid candidate for local experimentation, especially for long-context or agent-type tasks. I'm still testing how it compares to other models like LLaMA 3, Mixtral, and Claude 3.


Step 1: Spin Up a RunPod Instance

The easiest way to try it is RunPod(This link will give you between $5 and $500 credits!). You'll need:

  • A 300 GB network volume
  • A B200 GPU (I don't think less works, you need ~150GB of VRAM)
  • A supported PyTorch image

Create a Network Volume

  • Region: use one where B200 is available (currently eu-ro-1)
  • Size: 300 GB
  • Cost: around $21/month (billed even if unused)

Create a Pod

  • GPU type: B200
  • Image: runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 ⚠️ Earlier versions didn't work in my testing
  • GPU Count: 1
  • Enable SSH + Jupyter
  • Attach your network volume


Step 2: Install Dependencies

In the notebook terminal:

%pip install transformers tiktoken accelerate gptqmodel optimum

Step 3: Load the Model

Set the cache path so that downloads go to the mounted volume instead of the default root directory:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re

os.environ['HF_HOME'] = '/workspace/hf-cache' #
model_path = 'tencent/Hunyuan-A13B-Instruct'

tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, cache_dir='/workspace/hf-cache/', local_files_only=False, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)

messages = [
  {
  "role": "user",
  "content": "What does the frog say?"
  },
]

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                  enable_thinking=True # Toggle thinking mode (default: True)
                                              )

outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=5000)
output_text = tokenizer.decode(outputs[0])
print(output_text)

Notes:

  • First run will download ~150 GB of weights
  • VRAM usage is ~153 GB during inference
  • Loading into VRAM takes a few minutes
  • If GPU util (not just VRAM) goes up, it's running
  • You can set device_map="cpu" if testing on CPU only. Make sure you have around 200GB of RAM and a good CPU

Costs

  • B200 pod: $6.39/hour
  • Network volume: $21/month, even if unused
  • Suggestion: shut the pod down when not in use x)

Tooling Notes

  • llama.cpp support is not there yet. PR in progress: #14425
  • Works fine in Python with transformers and bfloat16

Benchmark

The official benchmarks are available on Hugging Face and evaluated by TRT-LLM-backend.

ModelHunyuan-LargeQwen2.5-72BQwen3-A22BHunyuan-A13B
MMLU88.4086.1087.8188.17
MMLU-Pro60.2058.1068.1867.23
MMLU-Redux87.4783.9087.4087.67
BBH86.3085.8088.8787.56
SuperGPQA38.9036.2044.0641.32
EvalPlus75.6965.9377.6078.64
MultiPL-E59.1360.5065.9469.33
MBPP72.6076.0081.4083.86
CRUX-I57.0057.63-70.13
CRUX-O60.6366.2079.0077.00
MATH69.8062.1271.8472.35
CMATH91.3084.80-91.17
GSM8k92.8091.5094.3991.83
GPQA25.1845.9047.4749.12

Hunyuan-A13B-Instruct has achieved highly competitive performance across multiple benchmarks, particularly in mathematics, science, agent domains, and more. We compared it with several powerful models, and the results are shown below. - Tencent

TopicBenchOpenAI-o1-1217DeepSeek R1Qwen3-A22BHunyuan-A13B-Instruct
MathematicsAIME 2024
AIME 2025
MATH
74.3
79.2
96.4
79.8
70
94.9
85.7
81.5
94.0
87.3
76.8
94.3
ScienceGPQA-Diamond
OlympiadBench
78
83.1
71.5
82.4
71.1
85.7
71.2
82.7
CodingLivecodebench
Fullstackbench
ArtifactsBench
63.9
64.6
38.6
65.9
71.6
44.6
70.7
65.6
44.6
63.9
67.8
43
ReasoningBBH
DROP
ZebraLogic
80.4
90.2
81
83.7
92.2
78.7
88.9
90.3
80.3
89.1
91.1
84.7
Instruction
Following
IF-Eval
SysBench
91.8
82.5
88.3
77.7
83.4
74.2
84.7
76.1
Text
Creation
LengthCtrl
InsCtrl
60.1
74.8
55.9
69
53.3
73.7
55.4
71.9
NLUComplexNLU
Word-Task
64.7
67.1
64.5
76.3
59.8
56.4
61.2
62.9
AgentBDCL v3
τ-Bench
ComplexFuncBench
C3-Bench
67.8
60.4
47.6
58.8
56.9
43.8
41.1
55.3
70.8
44.6
40.6
51.7
78.3
54.7
61.2
63.5

Conclusion

This is one of the more interesting open MoE models out right now. It supports long contexts, has some thoughtful design choices, and it's easy enough to run. I'm still evaluating how good it actually is, especially compared to Mistral Magistral and other recent models. If you want to test it yourself, this setup gets you going quickly.

Cheers,

Jonas, Co-Founder of sliplane.io

Welcome to the container cloud

Sliplane makes it simple to deploy containers in the cloud and scale up as you grow. Try it now and get started in minutes!