Complete installation guide for Fish Speech (OpenAudio) with RTX 50-series GPU support on Windows WSL2.
Prerequisites
- Windows with WSL2 installed
- NVIDIA RTX 5070 GPU
- NVIDIA drivers installed on Windows host
- Miniconda/Anaconda installed
Issue Overview
RTX 50-series GPUs require PyTorch with CUDA 12.8 support due to their sm_120 compute capability. Standard PyTorch installations cause CUDA kernel errors with newer GPU architectures.
Installation Steps
1. Create Conda Environment
# Create new environment for Fish Speech
conda create -n fish-speech python=3.10
conda activate fish-speech
2. Install PyTorch with CUDA 12.8 Support (Critical)
# Remove any existing PyTorch installation
pip uninstall torch torchvision torchaudio -y
# Install PyTorch with CUDA 12.8 (essential for RTX 50-series)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# Test GPU compatibility
python -c "
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA version: {torch.version.cuda}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
print(f'GPU: {torch.cuda.get_device_name()}')
x = torch.randn(1000, 1000).cuda()
y = x @ x
print('✅ GPU computation successful!')
"
Expected Output:
PyTorch version: 2.7.1+cu128
CUDA version: 12.8
CUDA available: True
GPU: NVIDIA GeForce RTX 5070
✅ GPU computation successful!
3. Install Fish Speech
# Clone Fish Speech repository
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
# Install Fish Speech dependencies
pip install -e .
4. Download Models
# Download Fish Speech 1.5 model
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5
# Optional: Login to Hugging Face for OpenAudio models (requires access)
huggingface-cli login
# Then download OpenAudio S1-mini (if you have access)
huggingface-cli download fishaudio/OpenAudio-S1-mini --local-dir checkpoints/openaudio-s1-mini
5. Launch Web UI
# Start Fish Speech Web UI
python tools/run_webui.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"
# Alternative with explicit decoder path
python tools/run_webui.py \
--llama-checkpoint-path "checkpoints/fish-speech-1.5" \
--decoder-checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"
Web Interface Access: http://localhost:7860
Voice Cloning Setup
Recording Reference Audio
- Duration: 10-30 seconds
- Quality: Clear, natural speech
- Language: Any language (cross-lingual cloning supported)
- Format: WAV, MP3, or FLAC
- Content: Natural conversation, avoid reading monotonously
Cross-lingual Voice Cloning Process
- Upload your reference audio (e.g., Dutch voice)
- Enter text in target language (e.g., English)
- Select language and voice settings
- Generate speech with your voice in the new language
Performance Expectations on RTX 5070
Hardware Utilization
- VRAM Usage: 4-6GB during inference
- Processing Speed: ~1:5 real-time factor (1 minute audio = 5 minutes processing)
- First Run: Slower due to model loading and caching
- Subsequent Runs: Significantly faster
Generation Times
- Short sentences (10-20 words): 30-60 seconds
- Medium texts (50-100 words): 2-3 minutes
- Long texts (200+ words): 3-5 minutes
Features and Capabilities
Multilingual Support
- English, Chinese, Japanese, German, Arabic, Russian, Dutch, Italian, Portuguese
- Cross-lingual voice cloning: Use voice from one language to speak another
- No phonetic dependencies: Handles multiple scripts naturally
Emotional Controls
Fish Speech supports emotional markers in text:
(happy) Hello there!
(sad) I miss you so much.
(angry) This is unacceptable!
(laughing) Ha,ha,ha that's hilarious!
Voice Quality
- #1 ranking on TTS-Arena2 benchmark
- 0.008 WER and 0.004 CER on English text
- Natural prosody and intonation
- Emotion and style transfer capabilities
Environment Management
Daily Usage
# Activate environment
conda activate fish-speech
# Navigate to Fish Speech directory
cd ~/fish-speech
# Start web interface
python tools/run_webui.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"
Create Startup Script
# Create convenient startup script
cat << 'EOF' > ~/start_fish_speech.sh
#!/bin/bash
conda activate fish-speech
cd ~/fish-speech
python tools/run_webui.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"
EOF
chmod +x ~/start_fish_speech.sh
# Usage: ./start_fish_speech.sh
Troubleshooting
Common Issues
CUDA kernel errors
- Solution: Ensure PyTorch 2.7+ with CUDA 12.8 is installed
- Verify:
python -c "import torch; print(torch.version.cuda)"
FileNotFoundError for models
- Check:
ls -la checkpoints/fish-speech-1.5/ - Re-download:
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5
- Check:
Slow processing on first run
- Normal behavior: First inference loads models and creates cache
- Subsequent runs: Much faster due to caching
Web UI not accessible
- Check: Process is running and listening on port 7860
- Access:
http://localhost:7860orhttp://127.0.0.1:7860
GPU Optimization
# Verify GPU usage during inference
nvidia-smi
# Monitor GPU utilization
watch -n 1 nvidia-smi
Integration Options
API Server
# Start HTTP API server for programmatic access
python tools/api_server.py --llama-checkpoint-path "checkpoints/fish-speech-1.5"
# API endpoint: http://localhost:8080
Command Line Interface
# Direct inference via command line
python fish_speech/models/text2semantic/inference.py \
--text "Your text here" \
--prompt-text "Reference text" \
--prompt-tokens "reference_audio.npy"
Key Success Factors
✅ PyTorch 2.7+ with CUDA 12.8 (essential for RTX 50-series)
✅ Clean conda environment (avoids dependency conflicts)
✅ Proper model downloads (verify checkpoints directory)
✅ Quality reference audio (10-30 seconds, clear speech)
✅ Adequate VRAM (RTX 5070’s 12GB is perfect)
✅ Patient first run (subsequent runs are much faster)
Next Steps
- Test voice cloning with different languages
- Experiment with emotional markers in text
- Set up API integration for automated workflows
- Optimize reference audio for best cloning results
- Explore batch processing for multiple texts
Fish Speech (OpenAudio) provides state-of-the-art voice cloning capabilities with excellent cross-lingual support, making it ideal for multilingual content creation and voice synthesis applications.