OmniCodec
OmniCodec: Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement
- Demo Page: OmniCodec Demo Page
- Huggingface: Huggingface
- Arxiv: Arxiv
Overview
This repo contains:
- Training:
train.py(Accelerate + GAN / WavLM-related losses per config) - Dataset:
dataset.py(multi-domain mixing; loads audio paths fromscp) - Inference:
infer.py(reconstructs audio with a pretrained checkpoint) - Config:
config/config_omnicodec.yaml
Environment
Requirements
Install python dependencies:
pip install -r requirements.txt
Notes:
requirements.txtcontains an editable install line-e OmniCodec/transformers-main. Make sure the referenced path exists in your environment, or adjust/remove that line if you already havetransformersinstalled.
Data preparation (scp)
The training config expects 3 scp files (one per domain): speech / music / sound.
Each line in scp can be either:
utt_id /abs/or/rel/path/to/audio.wav/abs/or/rel/path/to/audio.wav(utt will be inferred from filename)
Example:
utt0001 /data/speech/utt0001.wav
utt0002 /data/speech/utt0002.wav
What dataset does
For each item, dataset.py will:
- load audio with
librosa.load(..., sr=sample_rate, mono=True) - apply
librosa.util.normalize(wav) * 0.95 - crop/pad/repeat to
segment_size(default: 240000 samples @ 24kHz = 10s) - return a dict:
{"wav": Tensor[T], "utt": str, "text": None}
Failed samples return None and are filtered by collate_fn in train.py.
Configure
Edit config/config_omnicodec.yaml:
- Data
data.speech_train_shards_dir: path tospeech.scpdata.music_train_shards_dir: path tomusic.scpdata.sound_train_shards_dir: path tosound.scpdata.sample_rate: default24000data.segment_size: default240000
- Pretrained SSL (WavLM)
model.wavlmloss.ckpt_path: defaultpretrain_model/ssl/wavlm-base-pluswav_lm_model: defaultpretrain_model/ssl/wavlm_model/wavlm
- Output
train.save_dir: default./exps/omnicodec
Training
Run training with the provided config:
python train.py -c config/config_omnicodec.yaml
Checkpoints and logs are written to train.save_dir (default: ./exps/omnicodec).
Inference (reconstruction)
Prepare checkpoint
infer.py loads the checkpoint from:
pretrained_model/omnicodec.pth
Place your pretrained weights at that path (or edit infer.py to point to your checkpoint).
Run
Put test audio files in:
./testset/speech/
Then run:
python infer.py -c config/config_omnicodec.yaml
Outputs will be written to:
./outputs/
Project structure
.
├─ config/
│ └─ config_omnicodec.yaml
├─ dataset.py
├─ train.py
├─ infer.py
├─ models/
├─ modules/
├─ quantization/
├─ discriminators/
├─ losses/
├─ utils/
└─ requirements.txt
Acknowledgements
- This repo benefits from moshi
- This repo benefits from Qwen3Omni
- This repo benefits from DAC
- This repo benefits from BigVGAN
- This repo benefits from SpeechTokenizer
Citation
If you use this work, please cite:
@misc{hu2026omnicodeclowframerate,
title={OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement},
author={Jingbin Hu and Haoyu Zhang and Dake Guo and Qirui Zhan and Wenhao Li and Huakang Chen and Guobin Ma and Hanke Xie and Chengyou Wang and Pengyuan Xie and Chuan Xie and Qiang Zhang and Lei Xie},
year={2026},
eprint={2603.20638},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.20638},
}
License
See the repository license