LearnSpeak uses Azure Cognitive Services Speech SDK for text-to-speech audio generation. This guide will help you set up your Azure TTS credentials.
- Azure account (free tier available)
- Access to Azure Portal
- macOS only: Azure Speech SDK C library (automated install script provided)
Before configuring Azure credentials, you need to install the Azure Speech SDK C library on macOS:
cd backend
./setup-speech-sdk.shThis script will:
- Download Azure Speech SDK v1.43.0 for macOS
- Extract it to
backend/lib/speechsdk/ - Clean up temporary files
The SDK is about 12MB and is required for the Go Speech SDK to work.
Note: The lib/ directory is in .gitignore and won't be committed to git.
- Go to Azure Portal
- Click Create a resource
- Search for Speech
- Select Speech by Microsoft
- Click Create
Fill in the following details:
- Subscription: Choose your subscription
- Resource group: Create new or select existing
- Region: Choose closest region (e.g.,
eastus,westus2,eastasia) - Name: Give it a unique name (e.g.,
learnspeak-tts) - Pricing tier:
- Free (F0): 5 audio hours/month, up to 500k characters/month
- Standard (S0): Pay-as-you-go, $1 per 1M characters (Cantonese)
Click Review + Create, then Create
- Once deployed, go to your Speech resource
- Click Keys and Endpoint in the left menu
- Copy one of the keys (KEY 1 or KEY 2)
- Note the Location/Region (e.g.,
eastus)
-
Copy
.env.exampleto.envin the backend directory:cp backend/.env.example backend/.env
-
Edit
backend/.envand set:AZURE_TTS_KEY=your-key-from-step-3 AZURE_TTS_REGION=eastus # Your region from step 3 TTS_VOICE_CANTONESE=zh-HK-HiuGaaiNeural # Default Cantonese voice TTS_CACHE_ENABLED=true # Enable caching to reduce costs
-
Start the backend:
cd backend sh run.sh -
Log in to the frontend and try creating a word with a translation
-
Click 🎙️ Generate Audio - audio should be generated and cached
- 5 audio hours per month (300 minutes)
- 500,000 characters per month
- Good for development and small-scale testing
- Neural voices (high quality): $16 per 1M characters
- Standard voices: $4 per 1M characters
- Cantonese uses neural voices
- 1 word (average 3 characters): $0.000048
- 1000 words: $0.048 (less than 5 cents)
- 10,000 words: $0.48 (less than 50 cents)
With caching: Same audio reused across all students, so cost is one-time per unique word/phrase.
LearnSpeak defaults to zh-HK-HiuGaaiNeural (female voice), but you can change it in .env:
- zh-HK-HiuGaaiNeural - Female, friendly
- zh-HK-HiuMaanNeural - Female, calm
- zh-HK-WanLungNeural - Male, clear
Preview voices at: https://speech.microsoft.com/portal/voicegallery
LearnSpeak caches generated audio using MD5 hashing:
- Cache key: MD5(text + voice + language)
- Storage:
backend/uploads/tts-cache/[hash].mp3 - Benefits:
- Same word = same audio file (no duplicate API calls)
- Reused across all students
- Persists across restarts
- Check
AZURE_TTS_KEYis set in.env - Verify the key is correct (no extra spaces)
- Restart the backend after changing
.env
- Verify
AZURE_TTS_REGIONmatches your Speech resource region - Check your Azure subscription is active
- Verify you haven't exceeded free tier limits
- Check browser console for errors
- Verify audio file exists in
uploads/tts-cache/ - Check file permissions (should be readable)
- Azure neural voices are high quality by default
- Check
speechConfig.SetSpeechSynthesisOutputFormat()is set to MP3 16kHz - Verify you're using a neural voice (ends with
Neural)
- Never commit
.envfile to git - Use environment variables in production
- Rotate keys periodically
- Use Azure Key Vault for production deployments
- Monitor usage in Azure Portal to detect anomalies