Windows Media Audio Voice

From MultimediaWiki
Jump to navigation Jump to search

The voice codec from Microsoft's line of Windows Media audio codecs.

WMAVoice packets are generally encapsulated in an ASF container stream. Voice codecs are optimized for speech data transmitted at low samplerates (e.g. 8000Hz) and bitrates (6-20kbps). Except for some interesting peculiarities, the format is very similar to other members of the CELP-family of speech codecs (QCELP, ACELP, SIPRO, etc.).

Stream header

The ASF stream header specifies WAVEFORMATEX, containing things like sample_rate, block_align, channels, and (codec-specific) extradata.

Extradata

The extradata contains the following information (46 bytes):

  • 0-18: WMAPro-extradata (for WMAPro-in-WMAVoice packet data)
  • 19-22: stream flags, in LE format [32 bits]. Meaning of bits, from least to most significant:
    • 0: whether synthesized speech data should go through a postfilter before being presented to the application requesting decoding
    • 1: this is a hint to the decoder as for whether the postfilter should be applied "inline" during sample synthesis (i.e. "per-block"), or whether it should be delayed until sample synthesis is finished (i.e. "per-frame"); the decoder can ignore this at its own will
    • 2- 5: postfilter frequency domain, specifies which particular preset table should be used during frequency domain filtering [0-15]
    • 6: whether the postfilter should apply spectrum tilt correction (clipping protection)
    • 7-10: postfilter projection mode [0-15]
    • 11: unused
    • 12: if set, indicates a (LSP/LPC) filter order of 16; if unset, the filter order is 10
    • 13: specifies which table should be used for interpolation of residual LSPs
    • 14: specifies which table should be used for mean LSP values
    • 15-18: unused
    • 19-20: channel-related options for WMAPro-in-WMAVoice
    • 21-23: hint to decoder on how many postfilter loops to use per frame
      • 21: 8 (each loop filters 20 samples)
      • 22: 4 (each loop filters 40 samples; only if 21 is unset, or default if none are set)
      • 23: 2 (each loop filters 80 samples; only if 21 and 22 are unset)
    • 24-31: unused
  • 23-46: frame type VLC setup bytes
    • 17 x 3 bits (+trailing zeroes). WMAVoice has 17 frame types (see below), and most streams only use a small subset of these. They use VLC codes to minimize the number of bits used to encode the frame type. To parse the init code, read each 3-bit value (triplet: [0-7]). Keep track of how often each value has occurred, this should be max. 3 for [0-6] or max. 4 for 7. For the VLC table, the triplet value * 3 + times_occurred is the VLC table index, and the triplet number (i.e. [0-17]) is the value it encodes. So for a linear (identity) VLC table, the triplets would code 0-0-0-1-1-1-2-2-2-3-3-3-[etc].
    • the VLC codes themselves are coded as pairs of two bits. If the two bits are both 1, then read another two bits, and do this a maximum of 7 times (so the max. number of bits read is 14). The VLC index is then (n_bits/2) * 3 + value_of_last_two_bits, and the value it encodes can be deducted as specified above.

Stream data

Each stream packet contains block_align (as specified in the ASF stream header) bytes, or a multiple thereof, and has a header at the beginning of each block_align bytes. The packet as a whole should be read as a bitstream, where the most significant bit is read first.

Packet header

Each packet starts with a packet header:

  • 4: packet sequence number - should increment by one for each packet
  • 1: whether the superframes in this packet use residual LSP coding (1), or only independent LSP coding (0)
  • n x 6: number of superframes in this packet (including the start of a cross-packet superframe at the end of the packet, but excluding the end of a cross-packet at the beginning of the packet). To parse this value, read 6 bits until not all 6 bits are 1 (i.e. 0x3F). n_superframes is the sum of all six-bit-values.
  • s_bits: The number of bits at the beginning of the packet, which represents the end of a cross-packet superframe. If resyncing, this amount of bits can be skipped to arrive at the start of a new superframe. The amount of bits used to write this value can be derived as follows: ceil(log2(block_align_in_bits)), where block_align_in_bits is block_align, as specified in the ASF header, * 8.
  • The packet header is followed by a number of superframes (where that number is specified in the packet header). Superframes can cross packet boundaries. In that case, the decoder is responsible for concatenating the start of a superframe (at the end of a packet) to the end of a superframe (at the beginning of the next packet), so that it can be decoded at once.

Superframes

Superframes are the primary means of voice data coding in the WMAVoice codec. Superframes code for 480 samples of speech data, divided over 3 frames (see below). In addition to the contained frames, superframes also encode LSs and some header data. The frame data is parsed as follows:

  • 1: WMAPro-in-WMAVoice-bit - if set, the remainder of the superframe should be parsed as a WMAPro superframe (instead of a WMAVoice superframe). If unset, parse as a WMAVoice superframe.
  • 1: custom-nr-of-samples: if set, the superframe specifies a custom number of samples encoded by this superframe. If not set, the superframe encodes 480 samples.
  • [12]: (only if custom-nr-of-samples == 1) number of samples in this superframe. Should be >= 0 and <= 480.
  • [48 or 60]: (if residually coded, i.e. LSPs are coded for all frames in the superframe at once) LSPs. See below for details on LSP coding.
    • If filter order is 10, this is 48 bits (24 bits for the independently-coded LSPs of the third frame, and another 24 bits for the information needed to derive the residually-coded LSPs for the first two frames);
    • If filter order is 16, this is 60 bits (34 bits for the independently-coded LSPs of the third frame, and another 26 bits for the information needed to derive the residually-coded LSPs for the first two frames).
  • After that, the data for 3 frames follows. If LSPs are independently-coded (i.e. not residually), then each frame is preceded by 24 or 34 bits coding for the independent LSPs. LSP coding is detailed below.

Frames

Frames code 160 samples of speech data, which are divided over a number of blocks (1, 2, 4 or 8, depending on the frame type). These are coded either by a hardcoded, fixed codebook, or by a combination of fixed and adaptive codebook, plus a gain for each. After all of that is decoded, the resulting excitation signal can be used to run a speech synthesis filter (which uses the earlier-mentioned LSPs, converted to LPCs). The result of that filter is the wave signal representing the source audio. Frame data is parsed as follows:

  • n x 2: frames start with a VLC-coded frame type indicator, as discussed earlier in the extradata-section. WMAVoice has 17 frame types, which differ in their number of blocks per frame, adaptive codebook (ACB) type, fixed codebook (FCB) type and/or number or distribution of pulses within the fixed codebook.
num_blocks |  ACB-type  |       FCB-type        | double_pulses
========== |  ========  |       ========        | =============
    1      |    N/A     |       silence         |  N/A
    2      |    N/A     |      hardcoded        |  N/A
    2      | Asymmetric | pitch-adaptive window |  N/A
    2      | Asymmetric |   excitation pulses   |   2
    2      | Asymmetric |   excitation pulses   |   5
    4      | Asymmetric |   excitation pulses   |   0
    4      | Asymmetric |   excitation pulses   |   2
    4      | Asymmetric |   excitation pulses   |   5
    2      |  Hamming   |   excitation pulses   |   0
    2      |  Hamming   |   excitation pulses   |   2
    2      |  Hamming   |   excitation pulses   |   5
    4      |  Hamming   |   excitation pulses   |   0
    4      |  Hamming   |   excitation pulses   |   2
    4      |  Hamming   |   excitation pulses   |   5
    8      |  Hamming   |   excitation pulses   |   0
    8      |  Hamming   |   excitation pulses   |   2
    8      |  Hamming   |   excitation pulses   |   5
  • [8]: (if FCB-type == silence), the next 8 bits specify the gain for the hardcoded / fixed codebook that will be used to generate noise signal. The position at which this codebook will be read is not actually specified in the bitstream (and it's not particularly relevant for comfort noise). The gain calculated from the index ([0-255]) is quasi-logarithmic with a max. value of ~4e-1, starting at ~2e-5.
  • (if ACB-type == ASYMMETRIC), next follows the pitch-per-frame coding. For frame-based pitch coding, we assume that the pitch is within the following range:
min_pitch   = 0.0025 * samplerate;
max_pitch   = 0.0185 * samplerate;
pitch_range = max_pitch - min_pitch;

The number of pitch required for pitch_coding can be derived to acquire the pitch from the bitstream:

frame_pitch_nbits = ceil(log2(pitch_range));
frame_pitch       = min_pitch + read_bits(pitch_nbits);

The official meaning of "frame_pitch" is "the pitch of the last sample of the last block of this frame". You can cache the previous frame's pitch to calculate a gradually incrementing/decreasing value for use between blocks (or even samples):

for (n = 0; n < num_blocks /* 2 or 4 */; n++)
    block_pitch[n]  = ((n * 2 + 1) * frame_pitch + (num_blocks * 2 - n * 2 + 1) * prev_frame_pitch) / (num_blocks * 2)
for (n = 0; n < num_samples /* 160 */; n++)
    sample_pitch[n] = prev_frame_pitch + n * (frame_pitch - prev_frame_pitch) / num_samples;
  • (if FCB-type == pitch-adaptive window), we use this pitch to derive the original offset/position of pulses used in pitch-adaptive window coding, which is a technique optimized for ultra low-bitrate (<= 8 kbps) streams. The original offset of the first pulse (relative to the start of this frame) is provided in the bitstream, as an index in a table with 94 entries. The range of entries in the table is from -11 to 15 (with a stepping of 2), then 18, 17, 19-33 with a stepping of 1 and 35-159 with a stepping of 2. The position can be read as:
pos = read_bits(6);
if (pos >= 54)
   pos += (pos - 54) * 3 + read_bits(2);

Then from this, the original position of the pulse relative to this frame, and its distribution over the two blocks making up pitch-adaptive window-coded frames can be calculated.

  • After this, block parting starts. Every block is preceded by block-pitches if ACB-type == Hamming. These pitch values are coded in a slightly different manner, in that lower pitch values have a higher precision than higher pitch values (i.e. semi-logaritmic). The range of potential log-pitch values can be divided in four domains as follows:
conv[0] = (pitch_range * 25) >> 6; // ~0.4
conv[1] = (pitch_range * 44) >> 6; // ~0.7
block_pitch_range       = max_pitch + conv[1] + 2 * conv[0] - 4 * min_pitch;
block_pitch_nbits       = ceil(log2(block_pitch_range));
block_pitch_delta_range = pitch_range >> 7 << 5;
block_pitch_delta_nbits = ceil(log2(block_pitch_delta_range));

Then the pitch is calculated as follows:

if (block_num == 0) {
    block_pitch = read_bits(block_pitch_nbits);
} else {
    block_pitch_delta = read_bits(block_pitch_delta_nbits) - block_pitch_delta_range / 2;
    block_pitch = last_block_pitch + block_pitch_delta;
}
last_block_pitch = clip(block_pitch, block_pitch_delta_range / 2,
                        block_pitch_range - block_pitch_delta_range / 2);

The block_pitch is currently still in its logarithmic / nonlinear scale. To convert it to linear units, use:

t1 = (conv[0]   - min_pitch) * 4;
t2 = (conv[1]   - conv[0])   * 2;
t3 =  max_pitch - conv[1];
if (block_pitch < t1) {
    pitch = min_pitch + block_pitch / 4;
} else {
    block_pitch -= t1;
    if (block_pitch < t2) {
        pitch = conv[0] + block_pitch / 2;
    } else {
        block_pitch -= t2;
        if (block_pitch < t3)
            pitch = conv[1] + block_pitch;
        else
            pitch = max_pitch - 1;
    }
}

Blocks for frames using hardcoded codebook

Hardcoded codebook blocks are fairly simple.

  • If the block encodes comfort noise during silence, then we can simply use a (semi-random) index into a hardcoded codebook (any index is fine), and as a gain we use the per-frame coded value (see above). The product of these is our excitation output.
  • If the block encodes actual excitation signal (FCB-type == hardcoded), then the next 8 bits in the bitstream encode the position into a hardcoded codebook that the encoder should specify. The next 6 bits specify the index into a universal gain table, which is (like the comfort noise gain value) semi-logaritmic, going from ~0.0 to 0.22. The product of these is our excitation output.
  • After this, the excitation output can directly be moved to the speech synthesis filter to produce the output sound.

Blocks for frames using excitation pulses and adaptive codebooks

We start block-based decoding by extracting the fixed codebook from the bitsteam:

  • (if FCB-type == pitch-adaptive window), then we will will use the window position that we calculated for this frame (see above) to create excitation pulses. From this value (interpreting it as a repetitive pulse, starting at that position with pitch as interval), we could calculate the distribution (positions and amount) in each block. If the start offset (which is coded in a table, as described above) is higher than the position of this block, then no specific pulses are located within this block. This is called windowless coding. Else we use windowed coding. The amounts of bits used to encode the values used here are 12, except if this is the first block and the window position was encoded in 8 bytes (i.e. extended_window_coding = 1), in which case it is 10.
    • For windowed coding, 2 modes are present, depending on the pitch value. If pitch[0] > 32 and pitch[1] > 32, we call this a "high-pitch" frame, else it's a "low-pitch" frame:
pitch type:  low | high
             === | ====
  # pulses:   4  |   3
bits/pulse:   3  |   4
excl. bits:  16  |  24

We can then parse each pulse as a self-repeating pulse located in the periphery of the position that we calculated earlier as the "approximate pulse start position in this block".

int bitval = read_bits(12 or 10);
for (n = n_pulses - 1; n >= 0; n--, bitval >>= bits_per_pulse) {
    float sign         = bitval & (1 << bits_per_pulse - 1) ? -1.0 : 1.0;
    int start_position = pulse_off[block_idx] + n + (bitval & ((1 << bits_per_pulse - 1) - 1) * n_pulses;
}
    • For windowless blocks, we use a different method (because there is not pulse_off[] set for this block). Note also, that unlike for windowed pulses, these pulses are non-repetitive:
int bitval    = read_bits(12 or 10);
float sign1   = bitval & (1 << 9) ? -1.0 : 1.0;
float sign2   = bitval & 1 ? -sign1 : sign1;
int pos       = bitval & ((1 << 9) - 1) >> 1; // [0 - 0xFF]
if (pos < 1 * 79)      { delta = 1; idx = pos + 1; }
else if (pos < 2 * 78) { delta = 3; idx = pos + 1 - 1 * 77; }
else if (pos < 3 * 77) { delta = 5; idx = pos + 1 - 2 * 76; }
else                   { delta = 7; idx = pos + 1 - 3 * 75; }
// and if blocks were larger, than this would continue ad infinitum
int position1 = idx - delta;
int position2 = idx;
    • After either one of these, an extra set of pulses follows in a different domain. We set up a bitmask consisting of 80 items, and exclude all values in the range of pulse_off, pulse_off + excl_bits], repeating this exclusion at an interval of pitch. Then, we an amount of bits that depends on whether the first pulse set was done based on windowed (5 bits for the first block, 3 bits for the second) or windowless coding (4 bits/block). Read these bits in memory to get an index value. The index at which the next pulse should start is that of the "value"th bit that was not excluded by the previous operation. Using that as a position, we can derive the sign by reading one more bit. If set, the sign of this pulse is -1.0, else it is 1.0.
  • (if FCB-type == excitation pulses), then we apply a set of non-repetitive pulses to generate the excitation codebook:
int nbits = 5 - log2(num_blocks);
for (n = 0; n < 5; n++) {
    float sign1   = read_bits(1) ? 1.0 : -1.0;
    int position1 = read_bits(nbits) * 5 + n;
    if (n < double_pulses) {
        int position2 = read_bits(nbits) * 5 + n;
        float sign2   = position1 < position2 ? -sign1 : sign1;
    }
}
  • The fixed and adaptive codebook gain values share the same index value (7 bits), and are nearly identical to AMR-NB gain calculation.
  • The adaptive codebook is calculated using the pitch, in a manner very similar to AMR-NB / SIPRO. The calculations are only slightly different for Hamming vs. Asymmetric ACB-type, basically differing in the symmetry of the interpolation table.
  • The fixed and adaptive codebook are then merged (using their respective gain values) to create the excitation codebook, which is used as input for the speech synthesis filter. This filter is identical to the one in all other voice codecs.

Averaging projection filter (APF) / postfilter

  • TODO

LSP coding

  • TODO