AMR-NB-WIP
AMR narrow band decoder
This text aims to be a simpler and more explicit document of the AMR narrow band decoding processes to aid in development of a decoder. Reference to sections of the specification will be made in the following format: (c.f. §5.2.5). Happy reading.
Nomenclature weirdness
Throughout the specification, a number of references are made to the same (or very similar) items with fairly confusing variation. They are listed below to aid understanding of the following text but efforts will be made to consistently use one item name throughout or to use both with the lesser used name in parenthesis.
- Pitch / Adaptive codebook
- Fixed / Innovative (also algebraic when referring to the codebook)
- Quantified means not quantised
Summary
- Mode dependent bitstream parsing
- Indices parsed from bitstream
- Indices decoded to give LSF vectors, fractional pitch lags, innovative code vectors and the pitch and innovative gains
- LSF vectors converted to LP filter coefficients at each subframe
- Subframe decoding
- Excitation vector = adaptive code vector * adaptive (pitch) gain + innovative code vector * innovative gain
- Excitation vector filtered through an LP synthesis filter to reconstruct speech
- Speech signal filtered with adaptive postfilter
Bitstream parsing
Documented on http://wiki.multimedia.cx/index.php?title=AMR-NB and in 26.101
For implementation, see http://svn.mplayerhq.hu/soc/amr/amrnbdec.c?view=markup
Decoding of LP filter parameters
The received indices of LSP quantization are used to reconstruct the quantified LSP vectors. (c.f. §5.2.5)
12.2kbps mode summary
- indices into code books are parsed from the bit stream
- indices give elements of split matrix quantised (SMQ) residual LSF vectors from the relevant code books
- prediction from the previous frame is added to obtain the mean-removed LSF vectors
- the mean is added
- the LSF vectors are converted to cosine domain LSP vectors
Indices give elements of split matrix quantised (SMQ) residual LSF vectors from the relevant code books
The elements of the SMQ vectors are stored at an index into a code book that varies according to the mode. There are 5 code books for the 12.2kbps mode corresponding to the 5 indices. These tables will be referred to as:
lsf_m_n
- m
- the number of indices parsed according to the mode
- n
- the index 'position' i.e. 1 for the first index, etc
The 5 indices are stored using 7, 8, 8 + sign bit, 8, 6 bits respectively. The four elements of a 'split quantized sub-matrix' are stored at the index position in the appropriate code book are:
- 1st index in 1st code book
- r1_1, r1_2, r2_1, r2_2
- 2nd index in 2nd code book
- r1_3, r1_4, r2_3, r2_4
- 3rd index in 3rd code book
- r1_5, r1_6, r2_5, r2_6
- 4th index in 4th code book
- r1_7, r1_8, r2_7, r2_8
- 5th index in 5th code book
- r1_9, r1_10, r2_9, r2_10
With rj_i :
- j
- the first or second residual lsf vector
- i
- the coefficient of a residual lsf vector ( i = 1, ..., 10 )
- rj_i
- residual line spectral frequencies (LSFs) in Hz
Prediction from the previous frame is added to obtain the mean-removed LSF vectors
zj(n) = rj(n) + 0.65*^r2(n-1)
- zj(n)
- a mean-removed LSF vector from the current frame (denoted n)
- ^r2(n-1)
- the quantified 2nd residual vector of the last frame (denoted n-1)
The mean is added
fj = zj + lsf_mean_m
- lsf_mean_m
- a table of the means of the LSF coefficients
- m
- the number of indices parsed according to the mode
- fj
- the LSF vectors
The LSF vectors are converted to cosine domain LSP vectors
qk_i = cos( fj_i * 2 * π / f_s )
- qk_i
- line spectral pairs (LSPs) in the cosine domain
- k
- the two lsf vectors give the LSP vectors q2, q4 at the 2nd and 4th subframes; k = 2*j
- fj_i
- ith coefficient of the jth LSF vector; [0,4000] Hz
- f_s
- sampling frequency in Hz (8kHz)
Other active modes summary
The process for the other modes is similar to that for the 12.2kbps mode.
- indices into code books are parsed from the bit stream
- indices give elements of a split matrix quantised (SMQ) residual LSF vector from the relevant code books
- prediction from the previous frame is added to obtain the mean-removed LSF vector
- the mean is added
- the LSF vector is converted to a cosine domain LSP vector
Indices give elements of a split matrix quantised (SMQ) residual LSF vector from the relevant code books
The 3 indices are stored with the following numbers of bits:
Mode (kbps) | 1st index (bits) | 2nd index (bits) | 3rd index (bits) |
---|---|---|---|
10.2 | 8 | 9 | 9 |
7.95 | 9 | 9 | 9 |
7.40 | 8 | 9 | 9 |
6.70 | 8 | 9 | 9 |
5.90 | 8 | 9 | 9 |
5.15 | 8 | 8 | 7 |
4.75 | 8 | 8 | 7 |
The four elements of a 'split quantized sub-matrix' are stored at the index position in the appropriate code book are:
- 1st index in 1st code book
- r_1, r_2, r_3
- 2nd index in 2nd code book
- r_4, r_5, r_6
- 3rd index in 3rd code book
- r_7, r_8, r_9, r_10
- r_i
- residual LSF vector (Hz)
- i
- the coefficient of vector ( i = 1, ..., 10 )
Prediction from the previous frame is added to obtain the mean-removed LSF vector
z_i(n) = r_i(n) + pred_fac_i * ^r_i(n-1)
- z_i(n)
- the mean-removed LSF vector from the current frame (denoted n)
- pred_fac_i
- the prediction factor for the ith LSF coefficient
- ^r_i(n-1)
- the quantified residual vector of the last frame (denoted n-1)
These processes give the LSP vector at the 4th subframe (q4)
The available LSP vector(s) are used to linearly interpolate vectors for the other subframes (c.f. §5.2.6)
12.2 kbps mode
q1(n) = 0.5*q4(n-1) + 0.5*q2(n) q3(n) = 0.5*q2(n) + 0.5*q4(n)
Other modes
q1(n) = 0.75*q4(n-1) + 0.25*q4(n) q2(n) = 0.5 *q4(n-1) + 0.5 *q4(n) q3(n) = 0.25*q4(n-1) + 0.75*q4(n)
The LSP vector is converted to LP filter coefficients (c.f. §5.2.4)
for i=1..5 f1_i = 2*f1(i-2) - 2 * q_2i-1 * f1(i-1) for j=i-1..1 f1_j += f1(j-2) - 2 * q_2i-1 * f1(j-1) end end
f1_-1 = 0; f1_0 = 0;
Same for f2_i with q_2i insteand of q_2i-1
for i=1..5 f'1_i = f1_i + f1_i-1 f'2_i = f2_i - f2_i-1 end
for i=1..5 a_i = 0.5*f'1_i + 0.5*f'2_i end for i=6..10 a_i = 0.5*f'1_11-i - 0.5*f'2_11-i end
- a_i
- the LP filter coefficients
Decoding of the adaptive (pitch) codebook vector
- indices parsed from bitstream
- indices give integer and fractional parts of the pitch lag
- adaptive codebook vector v(n) is found by interpolating the past excitation u(n) at the pitch lag using an FIR filter. (c.f. §5.6)
Note: division in this section is integer division!
Indices give integer and fractional parts of the pitch lag
12.2kbps mode - 1/6 resolution pitch lag
First and third subframes
In the first and third subframes, a fractional pitch lag is used with resolutions:
- 1/6 in the range [17 3/6, 94 3/6]
- 1 in the range [95, 143]
...encoded using 9 bits.
For [17 3/6, 94 3/6] the pitch index is encoded as:
pitch_index = (pitch_lag_int - 17)*6 + pitch_lag_frac - 3;
- pitch_lag_int
- integer part of the pitch lag in the range [17, 94]
- pitch_lag_frac
- fractional part of the pitch lag in 1/6 units in the range [-2, 3]
so...
if(pitch_index < (94 4/6 - 17 3/6)*6) // fractional part is encoded in range [17 3/6, 94 3/6] pitch_lag_int = (pitch_index + 5)/6 + 17; pitch_lag_frac = pitch_index - pitch_lag_int*6 + (17 3/6)*6;
And for [95, 143] the pitch index is encoded as:
pitch_index = (pitch_lag_int - 95) + (94 4/6 - 17 3/6)*6;
- pitch_lag_int
- integer pitch lag in the range [95, 143]
so...
else // only integer part encoded in range [95, 143], no fractional part pitch_lag_int = pitch_index - (94 4/6 - 17 3/6)*6 + 95; pitch_lag_frac = 0;
Second and fourth subframes
In the second and fourth subframes, a pitch lag resolution of 1/6 is always used in the range [T1 - 5 3/6, T1 + 4 3/6], where T1 is nearest integer to the fractional pitch lag of the previous (1st or 3rd) subframe. The search range is bounded by [18, 143]. In this case the pitch delay is encoded using 6 bits.
So the search range for the pitch lag is:
search_range_min = max(pitch_lag_int_prev - 5, 18); search_range_max = search_range_min + 9; if(search_range_max > 143) { search_range_max = 143; search_range_min = search_range_max - 9; }
- pitch_lag_int_prev
- the integer part of the pitch lag from the previous sub frame
The pitch index is encoded as:
pitch_index = (pitch_lag_int - (search_range_min - 1))*6 + pitch_lag_frac - 3;
- pitch_lag_int
- the integer part of the pitch lag in the range [search_range_min - 1, search_range_max]
- pitch_lag_frac
- the fractional part of the pitch lag in the range [-2, 3]
So the pitch lag is calculated through:
pitch_lag_int = ((pitch_index + 5)/6 - 1) + search_range_min; pitch_lag_frac = (pitch_index - 3) - ((pitch_index + 5)/6 - 1)*6;
Q: Why? See the following, but it could be better understood and better described.
search_range_min is in the range [18, 143-9] search_range_max is in the range [18+9, 143] pitch_lag_int is in the range [search_range_min-1, search_range_max] = [17, 143] pitch_lag_frac is in the range [-2, 3]
To reach the desired range for the pitch lag, if pitch_lag_int == 17 (i.e. search_range_min-1) then pitch_lag_frac must be 3. This means the pitch_index is in the range [0,60].
(pitch_index + 5)/6 - 1
...maps the pitch_index into the range [-1, 9] which is the correct offset of pitch_lag_int from search_range_min.
pitch_index - ((pitch_index + 5)/6 - 1)*6
...removes the integer part of the pitch_index and then 3 is subtracted to shift the results into the range [-2, 3] as required.
Others modes - 1/3 resolution pitch lag
First and third subframes
In the first and third subframes, a fractional pitch lag is used with resolutions:
- 1/3 in the range [19 1/3, 84 2/3]
- 1 in the range [85, 143]
...encoded using 8 bits.
For [19 1/3, 84 2/3] the pitch lag is encoded as:
pitch_index = pitch_lag_int*3 + pitch_lag_frac - (19 1/3)*3;
- pitch_lag_int
- integer part of the pitch lag in the range [19, 84]
- pitch_lag_frac
- fractional part of the pitch lag in 1/3 units in the range [0, 2]
so...
if(pitch_index < (85 - 19 1/3)*3) // fractional part is encoded in range [19 1/3, 84 2/3] pitch_lag_int = (pitch_index + 2)/3 + 19; pitch_lag_frac = pitch_index - pitch_lag_int*3 + (19 1/3)*3;
And for [85, 143] the pitch index is encoded as:
pitch_index = pitch_lag_int - 85 + (85 - 19 1/3)*3;
- pitch_lag_int
- integer pitch lag in the range [85, 143]
so...
else // only integer part encoded in range [85, 143], no fractional part pitch_lag_int = pitch_index - (85 - 19 1/3)*3 + 85; pitch_lag_frac = 0;
Second and fourth subframes
In the second and fourth subframes, the pitch lag resolution varies depending on the mode as follows:
- 7.95 kbps mode
- resolution of 1/3 is always used in the range [T1 - 10 2/3, T1 + 9 2/3]
- encoded using 6 bits
- 10.2 and 7.40 kbps modes
- resolution of 1/3 is always used in the range [T1 - 5 2/3, T1 + 4 2/3]
- encoded using 5 bits
- 6.70, 5.90, 5.15 and 4.75 kbps modes
- resolution of 1 is used in the range [T1 -5, T1 + 4]
- resolution of 1/3 is always used in the range [T1 - 1 2/3, T1 + 2/3]
- encoded using 4 bits
Where T1 is nearest integer to the fractional pitch lag of the previous (1st or 3rd) subframe. The search range is bounded by [20, 143].
So the search range for the pitch lag is:
lower_bound = 5; range = 9; if(mode == 7.95) { lower_bound = 10; range = 19; } search_range_min = max(pitch_lag_int_prev - lower_bound, 20); search_range_max = search_range_min + range; if(search_range_max > 143) { search_range_max = 143; search_range_min = search_range_max - range; }
- pitch_lag_int_prev
- the integer part of the pitch lag from the previous sub frame
For modes 7.40, 7.95 and 10.2 the pitch index is encoded as:
pitch_index = (pitch_lag_int - search_range_min)*3 + pitch_lag_frac + 2;
- pitch_lag_int
- the integer part of the pitch lag in the range [search_range_min, search_range_max]
- pitch_lag_frac
- the fractional part of the pitch lag in the range [0, 2]
So the pitch lag is calculated through:
temp = (pitch_index + 2)/3 - 1; pitch_lag_int = temp + search_range_min; pitch_lag_frac = pitch_index - temp*3 - 2;
Q: Why? See the following, but it could be better understood and better described.
search_range_min is in the range [20, 143-range] search_range_max is in the range [20+range, 143] pitch_lag_int is in the range [search_range_min, search_range_max] pitch_lag_frac is in the range [-2, 3]
To reach the desired range for the pitch lag, if pitch_lag_int == 17 (i.e. search_range_min-1) then pitch_lag_frac must be 3. This means the pitch_index is in the range [0,60].
(pitch_index + 2)/3 - 1
...maps the pitch_index into the range [-1, range] which is the correct offset of pitch_lag_int from search_range_min.
pitch_index - ((pitch_index + 5)/6 - 1)*6
...removes the integer part of the pitch_index and then 3 is subtracted to shift the results into the range [-2, 3] as required.
Decoding of the algebraic (innovative or fixed) codebook vector
The parsed algebraic codebook index is used to find the positions and amplitudes (signs) of the excitation pulses and to find the algebraic codebook vector c(n).
If the integer part of the pitch lag, T, is less than the subframe size 40, the pitch sharpening procedure is applied which translates into c(n) += βc(n−T) , where β is the decoded pitch gain, ^g_p, bounded by [0.0,1.0] or [0.0,0.8], depending on mode.
Decoding of the adaptive and fixed codebook gains
12.2kbps and 7.95kbps - scalar quantised gains
The received indices are used to find the adaptive codebook gain, ^g_p, and the algebraic codebook gc factor, ^γ_gc (gc for gain correction), from the corresponding quantisation tables.
Other modes - vector quantised gains
The received index gives both the adaptive codebook gain, ^g_p, and the algebraic codebook gc factor, ^γ_gc. The estimated algebraic codebook gain gc′ is found as described in clause 5.7.
Smoothing of the fixed codebook gain
10.2, 6.70, 5.90, 5.15, 4.75 kbit/s modes
Adaptive smoothing of fixed codebook gain. (c.f. §6.1 part 4)
Anti-sparseness processing
7.95, 6.70, 5.90, 5.15, 4.75 kbit/s modes
An adaptive anti-sparseness postprocessing procedure is applied to the fixed codebook vector c(n) in order to reduce perceptual artifacts arising from the sparseness of the algebraic fixed codebook vectors with only a few non-zero samples per subframe. The anti-sparseness processing consists of circular convolution of the fixed codebook vector with an impulse response. Three pre-stored impulse responses are used and a number impNr = 0,1,2 is set to select one of them. A value of 2 corresponds to no modification, a value of 1 corresponds to medium modification, while a value of 0 corresponds to strong modification. The selection of the impulse response is performed adaptively from the adaptive and fixed codebook gains. (c.f. §6.1 5)
Computing the reconstructed speech
Construct excitation:
u(n) = ^g_p.v(n) + ^g_c.c(n)
(c.f. §6.1 part 6)
Additional instability protection
(c.f. §6.1 part 7)
Adaptive post-filtering
(c.f. §6.2.1)
High-pass filtering and upscaling
(c.f. §6.2.2)