Complete installation guide for Fish Speech (OpenAudio) with RTX 50-series GPU support on Windows WSL2.

Prerequisites

  • Windows with WSL2 installed
  • NVIDIA RTX 5070 GPU
  • NVIDIA drivers installed on Windows host
  • Miniconda/Anaconda installed

Issue Overview

RTX 50-series GPUs require PyTorch with CUDA 12.8 support due to their sm_120 compute capability. Standard PyTorch installations cause CUDA kernel errors with newer GPU architectures.

Installation Steps

1. Create Conda Environment

# Create new environment for Fish Speech
conda create -n fish-speech python=3.10
conda activate fish-speech

2. Install PyTorch with CUDA 12.8 Support (Critical)

# Remove any existing PyTorch installation
pip uninstall torch torchvision torchaudio -y

# Install PyTorch with CUDA 12.8 (essential for RTX 50-series)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Test GPU compatibility
python -c "
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA version: {torch.version.cuda}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name()}')
    x = torch.randn(1000, 1000).cuda()
    y = x @ x
    print('✅ GPU computation successful!')
"

Expected Output:

PyTorch version: 2.7.1+cu128
CUDA version: 12.8
CUDA available: True
GPU: NVIDIA GeForce RTX 5070
✅ GPU computation successful!

3. Install Fish Speech

# Clone Fish Speech repository
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

# Install Fish Speech dependencies
pip install -e .

4. Download Models

# Download Fish Speech 1.5 model
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5

# Optional: Login to Hugging Face for OpenAudio models (requires access)
huggingface-cli login
# Then download OpenAudio S1-mini (if you have access)
huggingface-cli download fishaudio/OpenAudio-S1-mini --local-dir checkpoints/openaudio-s1-mini

5. Launch Web UI

# Start Fish Speech Web UI
python tools/run_webui.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"

# Alternative with explicit decoder path
python tools/run_webui.py \
  --llama-checkpoint-path "checkpoints/fish-speech-1.5" \
  --decoder-checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

Web Interface Access: http://localhost:7860

Voice Cloning Setup

Recording Reference Audio

  1. Duration: 10-30 seconds
  2. Quality: Clear, natural speech
  3. Language: Any language (cross-lingual cloning supported)
  4. Format: WAV, MP3, or FLAC
  5. Content: Natural conversation, avoid reading monotonously

Cross-lingual Voice Cloning Process

  1. Upload your reference audio (e.g., Dutch voice)
  2. Enter text in target language (e.g., English)
  3. Select language and voice settings
  4. Generate speech with your voice in the new language

Performance Expectations on RTX 5070

Hardware Utilization

  • VRAM Usage: 4-6GB during inference
  • Processing Speed: ~1:5 real-time factor (1 minute audio = 5 minutes processing)
  • First Run: Slower due to model loading and caching
  • Subsequent Runs: Significantly faster

Generation Times

  • Short sentences (10-20 words): 30-60 seconds
  • Medium texts (50-100 words): 2-3 minutes
  • Long texts (200+ words): 3-5 minutes

Features and Capabilities

Multilingual Support

  • English, Chinese, Japanese, German, Arabic, Russian, Dutch, Italian, Portuguese
  • Cross-lingual voice cloning: Use voice from one language to speak another
  • No phonetic dependencies: Handles multiple scripts naturally

Emotional Controls

Fish Speech supports emotional markers in text:

(happy) Hello there!
(sad) I miss you so much.
(angry) This is unacceptable!
(laughing) Ha,ha,ha that's hilarious!

Voice Quality

  • #1 ranking on TTS-Arena2 benchmark
  • 0.008 WER and 0.004 CER on English text
  • Natural prosody and intonation
  • Emotion and style transfer capabilities

Environment Management

Daily Usage

# Activate environment
conda activate fish-speech

# Navigate to Fish Speech directory
cd ~/fish-speech

# Start web interface
python tools/run_webui.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"

Create Startup Script

# Create convenient startup script
cat << 'EOF' > ~/start_fish_speech.sh
#!/bin/bash
conda activate fish-speech
cd ~/fish-speech
python tools/run_webui.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"
EOF

chmod +x ~/start_fish_speech.sh

# Usage: ./start_fish_speech.sh

Troubleshooting

Common Issues

  1. CUDA kernel errors

    • Solution: Ensure PyTorch 2.7+ with CUDA 12.8 is installed
    • Verify: python -c "import torch; print(torch.version.cuda)"
  2. FileNotFoundError for models

    • Check: ls -la checkpoints/fish-speech-1.5/
    • Re-download: huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5
  3. Slow processing on first run

    • Normal behavior: First inference loads models and creates cache
    • Subsequent runs: Much faster due to caching
  4. Web UI not accessible

    • Check: Process is running and listening on port 7860
    • Access: http://localhost:7860 or http://127.0.0.1:7860

GPU Optimization

# Verify GPU usage during inference
nvidia-smi

# Monitor GPU utilization
watch -n 1 nvidia-smi

Integration Options

API Server

# Start HTTP API server for programmatic access
python tools/api_server.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"

# API endpoint: http://localhost:8080

Command Line Interface

# Direct inference via command line
python fish_speech/models/text2semantic/inference.py \
  --text "Your text here" \
  --prompt-text "Reference text" \
  --prompt-tokens "reference_audio.npy"

Key Success Factors

PyTorch 2.7+ with CUDA 12.8 (essential for RTX 50-series)
Clean conda environment (avoids dependency conflicts)
Proper model downloads (verify checkpoints directory)
Quality reference audio (10-30 seconds, clear speech)
Adequate VRAM (RTX 5070’s 12GB is perfect)
Patient first run (subsequent runs are much faster)

Next Steps

  • Test voice cloning with different languages
  • Experiment with emotional markers in text
  • Set up API integration for automated workflows
  • Optimize reference audio for best cloning results
  • Explore batch processing for multiple texts

Fish Speech (OpenAudio) provides state-of-the-art voice cloning capabilities with excellent cross-lingual support, making it ideal for multilingual content creation and voice synthesis applications.