Sorenson Video 3

From MultimediaWiki
Jump to navigation Jump to search

Video codec apparently based on an early H.264 draft.

Unlike H.264 this codec has more motion modes (e.g. even 4x4 part of macroblock can have its own motion vector) and motion compensation may be performed with full-pixel, halfpel or thirdpel precision which is selected independently for each macroblock.

Decoding Process

This codec extensively uses Golomb coding.

Sequence Header

Sequence header is stored in SMI atom in extradata (the name might do something with company name). Its contents are:

 "SEQH"
 3 bit - frame size code
 (for code = 7) 12 bits - width
 (for code = 7) 12 bits - height
 1 bit - motion vectors may have halfpel precision
 1 bit - motion vectors may have thirdpel precision
 4 bits - unknown
 1 bit - stream does not contain B-frames
 optional data in form 1 bit - has more data, 8 bits - data, repeat
 1 bit - stream is protected

Standard frame dimensions are: 160x120, 128x96, 176x144, 352x288, 704x576, 240x180, 320x240.

For protected streams there's this additional data:

 variable-length code - watermark width
 variable-length code - watermark height
 variable-length code - unknown
 8 bits - unknown
 2 bits - unknown
 variable-length code - unknown
 padding to byte boundary
 deflated watermark image, its checksum (which looks like CCITT 16-bit CRC) will be used to decrypt data

Slice Header

Frame data is organised into slices, each slice has a single byte header (0xFF means frame data end).

Slice data is stored in permuted form: bits 5-7 of the first byte tell the size of the slice slice in bytes, then you have 1-3 bytes for slice size, then you have most of the payload except for the first 0-2 bytes (size of slice size minus one) which are stored at the very end of the slice. Additionally slice data may be further scrambled probably in order to prevent unauthorised playback. Bits 0-4 mean slice header version (only versions 1 and 2 are known).

Slice header data:

 frame code (variable length, 0 - P frame, 1 - B frame, 2 - I frame)
 (version 1 only) 1 bit - probably "has more slices" flag
 (version 2 only) (maximum of log2(num_mbs) or 6 bits) - probably macroblock offset of the current slice
 8 bits - frame number
 5 bits - slice quantiser
 1 bit  - delta quantiser may be present
 1 bit  - unknown
 (if data is protected) 1 bit - unknown
 optional data in form 1 bit - has more data, 8 bits - data, repeat

Macroblock layer

Each macroblock starts with Golomb code signalling MB type.

For I-frames the types are:

  • 0 - macroblock with luma DCs coded in separate 4x4 block
  • 1-24 - macroblock with predefined coded block pattern and intra prediction mode (virtually the same as in H.264 but with different order for intra prediction mode)
  • 25 - macroblock with luma DCs coded in separate 4x4 block and no other blocks coded

For P-frames there are more types:

  • 0 - skip block
  • 1 - 16x16 inter block
  • 2 - inter block with MVs for each 8x16 part (codes two motion vectors)
  • 3 - inter block with MVs for each 16x8 part
  • 4 - inter block with MVs for each 8x8 part
  • 5 - inter block with MVs for each 4x8 part
  • 6 - inter block with MVs for each 8x4 part
  • 7 - inter block with MVs for each 4x4 part (codes sixteen motion vectors)
  • 8-33 - intra modes (the same as above)

B-frame MB types:

  • 0 - direct block (motion vector for each 4x4 block is calculated from the next reference frame MV and frame distances) with coded residue
  • 1 - forward block
  • 2 - backward block
  • 3 - bidirectionally predicted block (codes two motion vectors)
  • 4-29 - intra modes

Coefficients are stored in 4x4 (sub)blocks except for chroma DCs which are stored in 2x2 blocks.

Dezigzag pattern (from H.264):

 o-->o-->o   o
         |  /|
 o   o   o / o
 | / |   |/  |
 o   o   o   o
   /
 o-->o-->o-->o

This pattern is used only for luma blocks in 4x4 intra MB when quantiser is less than 24. Otherwise normal zigzag is used.

Coefficient decoding

Each coefficient is stored as Golomb codeword, last bit is coefficient sign, code = 0 means end of nonzero coefficients.

Codes for 2x2 chroma DC block:

code run value
0-2 0 code
3 1 1
4-... code & 0x3 ((code + 9) >> 2) - run

Codes for blocks using an alternative scan:

code run value
0 0 0
1 0 1
2 1 1
3 0 2
4 2 1
5 0 3
6 0 4
7 0 5
8 3 1
9 4 1
10 1 2
11 1 3
12 0 6
13 0 7
14 0 8
15 0 9
16-... code & 0x7 (code >> 3) - intra_run[run]

Please note that in this case block is coded in two parts of up two eight coefficients corresponding to each half-scan.

Codes for all other block types:

code run value
0 0 0
1 0 1
2 1 1
3 2 1
4 0 2
5 3 1
6 4 1
7 5 1
8 0 3
9 1 2
10 2 2
11 6 1
12 7 1
13 8 1
14 9 1
15 0 4
16-... (inter) code & 0xF (code >> 4) - inter_run[run]

Run correction values:

 intra_run = { 8, 2, 0, 0, 0, -1, -1, -1, [minus  ones] };
 inter_run = { 4, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, [zeroes] };

Intra macroblock information decoding

In case of 4x4 prediction intra mode is predicted in the following order:

 ( 0,  1)  ( 4,  5)
 ( 2,  3)  ( 6,  7)
 ( 8,  9)  (12, 13)
 (10, 11)  (14, 15)

Prediction is performed by reading a variable-length code which corresponds to one of the following pairs:

 { 0, 0 },
 { 1, 0 }, { 0, 1 },
 { 0, 2 }, { 1, 1 }, { 2, 0 },
 { 3, 0 }, { 2, 1 }, { 1, 2 }, { 0, 3 },
 { 0, 4 }, { 1, 3 }, { 2, 2 }, { 3, 1 }, { 4, 0 },
 { 4, 1 }, { 3, 2 }, { 2, 3 }, { 1, 4 },
 { 2, 4 }, { 3, 3 }, { 4, 2 },
 { 4, 3 }, { 3, 4 },
 { 4, 4 }

Each element of the pair is then used as an index in the prediction table (the proper order is pred_table[top + 1][left + 1][idx]). When predictors lie outside of slice, -1 is used instead. For 16x16 intra and any inter blocks value of 2 is used as the predictor. If table value is -1 then input data was incorrect or intra modes were predicted incorrectly.

   { { 2, -1, -1, -1, -1 }, { 2, 1, -1, -1, -1 }, { 1, 2, -1, -1, -1 },
     { 2,  1, -1, -1, -1 }, { 1, 2, -1, -1, -1 }, { 1, 2, -1, -1, -1 } },
   { { 0,  2, -1, -1, -1 }, { 0, 2,  1,  4,  3 }, { 0, 1,  2,  4,  3 },
     { 0,  2,  1,  4,  3 }, { 2, 0,  1,  3,  4 }, { 0, 4,  2,  1,  3 } },
   { { 2,  0, -1, -1, -1 }, { 2, 1,  0,  4,  3 }, { 1, 2,  4,  0,  3 },
     { 2,  1,  0,  4,  3 }, { 2, 1,  4,  3,  0 }, { 1, 2,  4,  0,  3 } },
   { { 2,  0, -1, -1, -1 }, { 2, 0,  1,  4,  3 }, { 1, 2,  0,  4,  3 },
     { 2,  1,  0,  4,  3 }, { 2, 1,  3,  4,  0 }, { 2, 4,  1,  0,  3 } },
   { { 0,  2, -1, -1, -1 }, { 0, 2,  1,  3,  4 }, { 1, 2,  3,  0,  4 },
     { 2,  0,  1,  3,  4 }, { 2, 1,  3,  0,  4 }, { 2, 0,  4,  3,  1 } },
   { { 0,  2, -1, -1, -1 }, { 0, 2,  4,  1,  3 }, { 1, 4,  2,  0,  3 },
     { 4,  2,  0,  1,  3 }, { 2, 0,  1,  4,  3 }, { 4, 2,  1,  0,  3 } },

For 4x4 predicted blocks there's also CBP present, it is coded the same way as in H.264.

And finally before actual coefficient data you may have a quantiser delta coded as signed variable-length code.

Inter macroblock information decoding

If motion vectors are present for the macroblock, precision and motion vector differences are coded before coefficients.

Precision for inter macroblock (in P-frame) can be determined as:

  if has_thirdpel && get_bit() != has_halfpel {
    use thirdpel mode
  } else if has_halfpel && get_bit() != has_thirdpel {
    use halfpel mode
  } else {
    use fullpel mode
  }

Motion vector differences are coded as signed variable-length codes Y component first.

CBP is coded the same way as in H.264.

Macroblock transform and dequantization

Transform coefficients:

 13  17   1   7
 13   7  -1 -17
 13  -7  -1  17
 13 -17   1  -7

Dequantization is performed by multiplying every coefficient by the same value determined by quantizer. In case if inter blocks first coefficient may be quantized slightly differently:

For intra luma blocks without separate DC coefficients block:

   dc = 13 * 13 * 1538 * block[0]

For chroma blocks:

   dc = (svq3_dequant_coeff[Q] * (block[0] >> 3)) >> 1;

Please note that chroma DCs need to be transformed first using the following matrix:

  8  8
  8 -8

Quantizer table (from svq3.c)

 static const uint32_t svq3_dequant_coeff[32] = {
   3881,  4351,  4890,  5481,  6154,  6914,  7761,  8718,
   9781, 10987, 12339, 13828, 15523, 17435, 19561, 21873,
  24552, 27656, 30847, 34870, 38807, 43747, 49103, 54683,
  61694, 68745, 77615, 89113,100253,109366,126635,141533
 };

Dequantization formula (dc=0 if not defined otherwise):

 out = (coeff * svq3_dequant_coeff[Q] + dc + 0x80000) >> 20;

Intra prediction

Intra prediction is the same as in H.264 except for the following quirks:

  • 4x4 diagonal down prediction is performed as
 a b c c
 b c c c
 c c c c
 c c c c

where a = (left[1] + top[1]) / 2, b = (left[2] + top[2]) / 2 and c = (left[3] + top[3]) / 2.

  • 16x16 plane prediction is the same as in H.264 but transposed for some reason;
  • 8x8 chroma always uses DC prediction.

Motion Compensation

Since P-frame macroblocks can have different motion vector precision (it is always halfpel precision in B-frames), motion vectors are stored and predicted as fraction of six and then rounded to the desired base.

Thirdpel interpolation in one direction uses formula ((2 * A + B + 1) * 0x2AB) >> 11 and two-dimensional interpolation uses matrix

 4 3
 3 2

and ((4 * A + 3 * B + 3 * C + 2 * D + 6) * 0xAAB) >> 15 for the output.

Packetization

The first byte of a SVQ3 RTP packet indicates the packet type(s). This is after the standard RTP headers.

1st byte value packet type
0x40 config
0x20 start
0x10 end

Note that a packet may have more than one type. Config packets are in practice transmitted individually since they do not contain additional video data beyond extradata.

The second byte of a SVQ3 RTP packet is ignored. This is likely reserved for future use.

All subsequent bytes are payload data.

SVQ3 does not make use of SDP FMTP attributes to carry codec-specific extradata; rather, this is carried within the RTP itself in config packets. Also, SVQ3 decoders expect extradata to be prefixed with the marker bytes "SEQH", followed by another 4 bytes indicating the length of the extradata. Neither are provided by the payload, and must be inserted by the depacketizer prior to decoding. The rest of the payload data is standard SVQ3 extradata. (todo: explain 'standard svq3 extradata'?)

Start packets come with a new RTP timestamp, and config packets may be periodically re-transmitted before a keyframe.

End packets simply indicate that the payload data constitutes the remainder of a SVQ3 frame.