Windows Media Audio Voice

From MultimediaWiki
Jump to navigation Jump to search

The voice codec from Microsoft's line of Windows Media audio codecs.

WMAVoice packets are generally encapsulated in an ASF container stream. Voice codecs are optimized for speech data transmitted at low samplerates (e.g. 8000Hz) and bitrates (6-20kbps). Except for some interesting peculiarities, the format is very similar to other members of the CELP-family of speech codecs (QCELP, ACELP, SIPRO, etc.).

Stream header

The ASF stream header specifies WAVEFORMATEX, containing things like sample_rate, block_align, channels, and (codec-specific) extradata.

Extradata

The extradata contains the following information (46 bytes):

  • 0-18: WMAPro-extradata (for WMAPro-in-WMAVoice packet data)
  • 19-22: stream flags, in LE format [32 bits]. Meaning of bits, from least to most significant:
    • 0: whether synthesized speech data should go through a postfilter before being presented to the application requesting decoding
    • 1: this is a hint to the decoder as for whether the postfilter should be applied "inline" during sample synthesis (i.e. "per-block"), or whether it should be delayed until sample synthesis is finished (i.e. "per-frame"); the decoder can ignore this at its own will
    • 2- 5: postfilter frequency domain, specifies which particular preset table should be used during frequency domain filtering [0-15]
    • 6: whether the postfilter should apply spectrum tilt correction (clipping protection)
    • 7-10: postfilter projection mode [0-15]
    • 11: unused
    • 12: if set, indicates a (LSP/LPC) filter order of 16; if unset, the filter order is 10
    • 13: specifies which table should be used for interpolation of residual LSPs
    • 14: specifies which table should be used for mean LSP values
    • 15-18: unused
    • 19-20: channel-related options for WMAPro-in-WMAVoice
    • 21-23: hint to decoder on how many postfilter loops to use per frame
      • 21: 8 (each loop filters 20 samples)
      • 22: 4 (each loop filters 40 samples; only if 21 is unset, or default if none are set)
      • 23: 2 (each loop filters 80 samples; only if 21 and 22 are unset)
    • 24-31: unused
  • 23-46: frame type VLC setup bytes
    • 17x3 bits (+trailing zeroes). WMAVoice has 17 frame types (see below), and most streams only use a small subset of these. They use VLC codes to minimize the number of bits used to encode the frame type. To parse the init code, read each 3-bit value (triplet: [0-7]). Keep track of how often each value has occurred, this should be max. 3 for [0-6] or max. 4 for 7. For the VLC table, the triplet value * 3 + times_occurred is the VLC table index, and the triplet number (i.e. [0-17]) is the value it encodes. So for a linear (identity) VLC table, the triplets would code 0-0-0-1-1-1-2-2-2-3-3-3-[etc].
    • the VLC codes themselves are coded as pairs of two bits. If the two bits are both 1, then read another two bits, and do this a maximum of 7 times (so the max. number of bits read is 14). The VLC index is then (n_bits/2) * 3 + value_of_last_two_bits, and the value it encodes can be deducted as specified above.

Stream data

Each stream packet contains block_align (as specified in the ASF stream header) bytes, or a multiple thereof, and has a header at the beginning of each block_align bytes. The packet as a whole should be read as a bitstream, where the most significant bit is read first.

Packet header

Each packet starts with a packet header:

  • 4: packet sequence number - should increment by one for each packet
  • 1: whether the superframes in this packet use residual LSP coding (1), or only independent LSP coding (0)
  • n x 6: number of superframes in this packet (including the start of a cross-packet superframe at the end of the packet, but excluding the end of a cross-packet at the beginning of the packet). To parse this value, read 6 bits until not all 6 bits are 1 (i.e. 0x3F). n_superframes is the sum of all six-bit-values.
  • s_bits: The number of bits at the beginning of the packet, which represents the end of a cross-packet superframe. If resyncing, this amount of bits can be skipped to arrive at the start of a new superframe. The amount of bits used to write this value can be derived as follows: ceil(log2(block_align_in_bits)), where block_align_in_bits is block_align, as specified in the ASF header, * 8.
  • The packet header is followed by a number of superframes (where that number is specified in the packet header). Superframes can cross packet boundaries. In that case, the decoder is responsible for concatenating the start of a superframe (at the end of a packet) to the end of a superframe (at the beginning of the next packet), so that it can be decoded at once.

Superframes

Superframes are the primary means of voice data coding in the WMAVoice codec. Superframes code for 480 samples of speech data, divided over 3 frames (see below). In addition to the contained frames, superframes also encode LSs and some header data. The frame data is parsed as follows:

  • 1: WMAPro-in-WMAVoice-bit - if set, the remainder of the superframe should be parsed as a WMAPro superframe (instead of a WMAVoice superframe). If unset, parse as a WMAVoice superframe.
  • 1: custom-nr-of-samples: if set, the superframe specifies a custom number of samples encoded by this superframe. If not set, the superframe encodes 480 samples.
  • [12]: (only if custom-nr-of-samples == 1) number of samples in this superframe. Should be >= 0 and <= 480.
  • [48 or 60]: (if residually coded, i.e. LSPs are coded for all frames in the superframe at once) LSPs. See below for details on LSP coding.
    • If filter order is 10, this is 48 bits (24 bits for the independently-coded LSPs of the third frame, and another 24 bits for the information needed to derive the residually-coded LSPs for the first two frames);
    • If filter order is 16, this is 60 bits (34 bits for the independently-coded LSPs of the third frame, and another 26 bits for the information needed to derive the residually-coded LSPs for the first two frames).
  • After that, the data for 3 frames follows. If LSPs are independently-coded (i.e. not residually), then each frame is preceded by 24 or 34 bits coding for the independent LSPs. LSP coding is detailed below.

Frames

Frames code 160 samples of speech data, which are divided over a number of blocks (1, 2, 4 or 8, depending on the frame type). These are coded either by a hardcoded, fixed codebook, or by a combination of fixed and adaptive codebook, plus a gain for each. After all of that is decoded, the resulting excitation signal can be used to run a speech synthesis filter (which uses the earlier-mentioned LSPs, converted to LPCs). The result of that filter is the wave signal representing the source audio. Frame data is parsed as follows:

  • [..]