Fish Speech Installation Guide for RTX 5070 on WSL2

Complete installation guide for Fish Speech (OpenAudio) with RTX 50-series GPU support on Windows WSL2.

Prerequisites

Windows with WSL2 installed
NVIDIA RTX 5070 GPU
NVIDIA drivers installed on Windows host
Miniconda/Anaconda installed

Issue Overview

RTX 50-series GPUs require PyTorch with CUDA 12.8 support due to their sm_120 compute capability. Standard PyTorch installations cause CUDA kernel errors with newer GPU architectures.

Installation Steps

1. Create Conda Environment

# Create new environment for Fish Speech
conda create -n fish-speech python=3.10
conda activate fish-speech

2. Install PyTorch with CUDA 12.8 Support (Critical)

# Remove any existing PyTorch installation
pip uninstall torch torchvision torchaudio -y

# Install PyTorch with CUDA 12.8 (essential for RTX 50-series)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Test GPU compatibility
python -c "
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA version: {torch.version.cuda}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name()}')
    x = torch.randn(1000, 1000).cuda()
    y = x @ x
    print('✅ GPU computation successful!')
"

Expected Output:

PyTorch version: 2.7.1+cu128
CUDA version: 12.8
CUDA available: True
GPU: NVIDIA GeForce RTX 5070
✅ GPU computation successful!

3. Install Fish Speech

# Clone Fish Speech repository
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

# Install Fish Speech dependencies
pip install -e .

4. Download Models

# Download Fish Speech 1.5 model
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5

# Optional: Login to Hugging Face for OpenAudio models (requires access)
huggingface-cli login
# Then download OpenAudio S1-mini (if you have access)
huggingface-cli download fishaudio/OpenAudio-S1-mini --local-dir checkpoints/openaudio-s1-mini

5. Launch Web UI

# Start Fish Speech Web UI
python tools/run_webui.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"

# Alternative with explicit decoder path
python tools/run_webui.py \
  --llama-checkpoint-path "checkpoints/fish-speech-1.5" \
  --decoder-checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

Web Interface Access: http://localhost:7860

Voice Cloning Setup

Recording Reference Audio

Duration: 10-30 seconds
Quality: Clear, natural speech
Language: Any language (cross-lingual cloning supported)
Format: WAV, MP3, or FLAC
Content: Natural conversation, avoid reading monotonously

Cross-lingual Voice Cloning Process

Upload your reference audio (e.g., Dutch voice)
Enter text in target language (e.g., English)
Select language and voice settings
Generate speech with your voice in the new language

Performance Expectations on RTX 5070

Hardware Utilization

VRAM Usage: 4-6GB during inference
Processing Speed: ~1:5 real-time factor (1 minute audio = 5 minutes processing)
First Run: Slower due to model loading and caching
Subsequent Runs: Significantly faster

Generation Times

Short sentences (10-20 words): 30-60 seconds
Medium texts (50-100 words): 2-3 minutes
Long texts (200+ words): 3-5 minutes

Features and Capabilities

Multilingual Support

English, Chinese, Japanese, German, Arabic, Russian, Dutch, Italian, Portuguese
Cross-lingual voice cloning: Use voice from one language to speak another
No phonetic dependencies: Handles multiple scripts naturally

Emotional Controls

Fish Speech supports emotional markers in text:

(happy) Hello there!
(sad) I miss you so much.
(angry) This is unacceptable!
(laughing) Ha,ha,ha that's hilarious!

Voice Quality

#1 ranking on TTS-Arena2 benchmark
0.008 WER and 0.004 CER on English text
Natural prosody and intonation
Emotion and style transfer capabilities

Environment Management

Daily Usage

# Activate environment
conda activate fish-speech

# Navigate to Fish Speech directory
cd ~/fish-speech

# Start web interface
python tools/run_webui.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"

Create Startup Script

# Create convenient startup script
cat << 'EOF' > ~/start_fish_speech.sh
#!/bin/bash
conda activate fish-speech
cd ~/fish-speech
python tools/run_webui.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"
EOF

chmod +x ~/start_fish_speech.sh

# Usage: ./start_fish_speech.sh

Troubleshooting

Common Issues

CUDA kernel errors
- Solution: Ensure PyTorch 2.7+ with CUDA 12.8 is installed
- Verify: python -c "import torch; print(torch.version.cuda)"
FileNotFoundError for models
- Check: ls -la checkpoints/fish-speech-1.5/
- Re-download: huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5
Slow processing on first run
- Normal behavior: First inference loads models and creates cache
- Subsequent runs: Much faster due to caching
Web UI not accessible
- Check: Process is running and listening on port 7860
- Access: http://localhost:7860 or http://127.0.0.1:7860

GPU Optimization

# Verify GPU usage during inference
nvidia-smi

# Monitor GPU utilization
watch -n 1 nvidia-smi

Integration Options

API Server

# Start HTTP API server for programmatic access
python tools/api_server.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"

# API endpoint: http://localhost:8080

Command Line Interface

# Direct inference via command line
python fish_speech/models/text2semantic/inference.py \
  --text "Your text here" \
  --prompt-text "Reference text" \
  --prompt-tokens "reference_audio.npy"

Key Success Factors

✅ PyTorch 2.7+ with CUDA 12.8 (essential for RTX 50-series)
✅ Clean conda environment (avoids dependency conflicts)
✅ Proper model downloads (verify checkpoints directory)
✅ Quality reference audio (10-30 seconds, clear speech)
✅ Adequate VRAM (RTX 5070’s 12GB is perfect)
✅ Patient first run (subsequent runs are much faster)

Next Steps

Test voice cloning with different languages
Experiment with emotional markers in text
Set up API integration for automated workflows
Optimize reference audio for best cloning results
Explore batch processing for multiple texts

Fish Speech (OpenAudio) provides state-of-the-art voice cloning capabilities with excellent cross-lingual support, making it ideal for multilingual content creation and voice synthesis applications.

Prerequisites¶

Issue Overview¶

Installation Steps¶

1. Create Conda Environment¶

2. Install PyTorch with CUDA 12.8 Support (Critical)¶

3. Install Fish Speech¶

4. Download Models¶

5. Launch Web UI¶

Voice Cloning Setup¶

Recording Reference Audio¶

Cross-lingual Voice Cloning Process¶

Performance Expectations on RTX 5070¶

Hardware Utilization¶

Generation Times¶

Features and Capabilities¶

Multilingual Support¶

Emotional Controls¶

Voice Quality¶

Environment Management¶

Daily Usage¶

Create Startup Script¶

Troubleshooting¶

Common Issues¶

GPU Optimization¶

Integration Options¶

API Server¶

Command Line Interface¶

Key Success Factors¶

Next Steps¶