Skip to main content

Transcribe โ€” From audio file

POST 

/transcribe

Generate a transcript from an audio file. Only audio/* mime types are supported. The maximum duration is 10 minutes. If you have longer files, please use the asynchronous equivalent.

Requestโ€‹

Body

required

    request_parameters

    objectrequired
    The object containing all the information needed along with the audio file to transcribe.
    speech_localespeech_locale (string)required

    The spoken or written locale of the transcript, representing both the language and its specific regional variant.

    Possible values: [ENGLISH_US, ENGLISH_UK, SPANISH_ES, SPANISH_MX, FRENCH_FR, ARABIC_EG, ARABIC_LB, ARABIC_MA, ARABIC_SA, ARMENIAN_AM, BENGALI_IN, CANTONESE_CN, CROATIAN_HR, FILIPINO_PH, GERMAN_DE, GREEK_GR, GUJARATI_IN, HEBREW_IL, HINDI_IN, ITALIAN_IT, JAPANESE_JP, KHMER_KH, KOREAN_KR, MANDARIN_CN, PERSIAN_IR, POLISH_PL, PORTUGUESE_PT, PUNJABI_IN, RUSSIAN_RU, SERBIAN_RS, TAMIL_IN, TELUGU_IN, THAI_TH, URDU_IN, VIETNAMESE_VN]

    Example: ENGLISH_US
    split_by_sentenceboolean

    Indicates whether to segment transcription results at sentence boundaries. Default is false, meaning that a single transcript item may encompass multiple sentences, provided they are not delineated by pauses (silence) in the audio.

    Default value: false
    filebinaryrequired

Responsesโ€‹

Results of processing the audio file.
Schema

    transcript

    object[]

    required

  • Array [

  • textstringrequired

    The transcribed text.

    Example: Also, Iโ€™m allergic to peanuts.
    speakercopilot_speaker (string)required

    Who said the text in this transcript item.

    Possible values: [doctor, patient, unspecified]

    Example: doctor
    start_offset_msintegerrequired

    Start time of this transcription item as the offset, in milliseconds, from the start of the audio file.

    Example: 65100
    end_offset_msintegerrequired

    End time of this transcription item as the offset, in milliseconds, from the start of the audio file. Equals the start_time_ms plus the duration of the related transcribed audio portion.

    Example: 69300
  • ]