Similarly to how Dalle works, support whisper to transcribe audio and video.
Upload audio -> get transcription.
Thanks!