
2. Token Processing for Efficiency and Compatibility: To maintain compatibility with our existing Megatron-LM pipeline and improve efficiency, we modified MaxText’s token processing logic. Our data preparation method constructs each training sequence by appending the first token of the subsequent sequence. This creates overlapping, continuous sequences, ensuring that no information is lost at the boundaries and maximizing data utilization.
To validate our new TPU-based workflow, we trained two models. First, we trained the Kanana 2.1 billion parameter model from scratch, and the results demonstrated that our MaxText implementation achieved performance comparable to our existing GPU-based Megatron-LM pipeline at each stage. Second, we performed depth upscaling with continued pre-training from our existing 8B model to a 9.8B architecture. Both approaches succeeded and showed consistent improvements across various benchmarks, confirming that the results on GPU were effectively reproduced on TPU.
Advancing our approach: Training Mixture-of-Experts (MoE) models with MaxText
With the core pipeline validated, we began experimenting with more advanced architectures, specifically MoE models, to build inference-efficient models that maintain strong performance. Our objectives were to explore upcycling an existing dense model into an MoE structure and to evaluate the suitability of the TPU and MaxText stack for this task.
For the experiment, we upcycled our 2.1B dense model into a 13.4B parameter (2.3B active) MoE architecture with 64 experts and 8 active experts per token. We trained this model on the exact same dataset as the original dense model to isolate the impact of the architectural change. The training was performed on v5e TPUs using MaxText with Fully Sharded Data Parallelism (FSDP).
The implementation process was straightforward. We found that MaxText’s flexible design, built on Flax, Optax, and Orbax, was well-suited for the wide range of ablations required for MoE research. Specifically:
Source Credit: https://cloud.google.com/blog/products/infrastructure-modernization/kakaos-journey-with-jax-and-cloud-tpus/